spark_Deployment
admin
Spark application deployment
Deploying Apache Spark applications effectively is crucial for leveraging its full potential in big data processing. In this guide, we will cover Spark application deployment, including packaging dependencies, using spark-submit
, and executing an example Spark application.
Before deploying, we must ensure that the application code is correctly structured. If the code has dependencies on external libraries or other projects, we need to package them along with the Spark code. The best approach is to create an assembly jar (uber jar) using SBT or Maven. This will include all required dependencies, except for Spark and Hadoop jars, which are provided by cluster managers at runtime.
To compile and package your Spark application, use the following command:
scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala
Next, create a JAR file for the application:
jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
To submit a Spark job, we use the spark-submit
command. A common spark-submit
command looks like this:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
<application-jar> \
[application-arguments]
--class
: The main entry point of the application (e.g., org.apache.spark.examples.SparkPi
).--master
: The master URL for the Spark cluster (e.g., spark://23.195.26.187:7077
).--deploy-mode
: Specifies whether the driver runs on a worker node (cluster
) or externally (client
).--conf
: Allows setting arbitrary Spark configurations.application-jar
: Path to the compiled JAR file.application-arguments
: Arguments passed to the main method.package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession
object SparkPi {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Spark Pi").getOrCreate()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt
val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y <= 1) 1 else 0
}.reduce(_ + _)
println(s"Pi is roughly ${4.0 * count / (n - 1)}")
spark.stop()
}
}
This Spark application estimates the value of π using a Monte Carlo simulation by generating random points in a unit square and counting those that fall within the unit circle.
Locate the Spark examples JAR (available in Spark installation):
/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar
Run the application using spark-submit
:
spark-submit --class org.apache.spark.examples.SparkPi \
/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar 10
After execution, Spark generates output files in a directory (e.g., outfile
). Use the following commands to inspect the results:
Apache Spark can be deployed in different modes:
This tutorial focuses on submitting a Spark application in Standalone Mode using spark-submit
.
Before deploying a Spark application, ensure the following:
core
JAR file is available.Create a directory to store Spark application files:
mkdir spark-application
cd spark-application
Download the necessary Spark core JAR file for compilation:
wget https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10-1.3.0.jar
mv spark-core_2.10-1.3.0.jar spark-application/
Compile the application using scalac
with the required classpath:
scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala
Package the compiled class files into a JAR file:
jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
Submit the Spark job using spark-submit
:
spark-submit --class SparkWordCount --master local wordcount.jar
If the execution is successful, you will see an output similar to:
Successfully started service 'sparkDriver' on port 42954
MemoryStore started with capacity 267.3 MB
Started SparkUI at http://192.168.1.217:4040
Added JAR file:/home/hadoop/piapplication/count.jar
Stopped Spark web UI at http://192.168.1.217:4040
After execution, check the output directory:
cd outfile
ls
Expected output:
part-00000 part-00001 _SUCCESS
To view the output:
cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they,7)
(look,1)
cat part-00001
(walk,1)
(or,1)
(talk,1)
(only,1)
(love,1)
(care,1)
(share,1)
Deploying an Apache Spark application involves setting up Spark, compiling the application, creating a JAR file, and submitting it using spark-submit
. Understanding these steps is essential for running Spark applications efficiently in different environments.
Next Steps:
By following this guide, you’ll be able to deploy your Spark applications successfully and optimize their performance. 🚀