spark_Deployment

admin

3/5/2025

Spark application deployment

Apache Spark Deployment: A Complete Guide with Examples

Introduction to Apache Spark Deployment

Deploying Apache Spark applications effectively is crucial for leveraging its full potential in big data processing. In this guide, we will cover Spark application deployment, including packaging dependencies, using spark-submit, and executing an example Spark application.

Steps for Deploying a Spark Application

Step 1: Preparing the Application Code

Before deploying, we must ensure that the application code is correctly structured. If the code has dependencies on external libraries or other projects, we need to package them along with the Spark code. The best approach is to create an assembly jar (uber jar) using SBT or Maven. This will include all required dependencies, except for Spark and Hadoop jars, which are provided by cluster managers at runtime.

Step 2: Compiling and Creating an Assembly JAR

To compile and package your Spark application, use the following command:

scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala

Next, create a JAR file for the application:

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 3: Submitting a Spark Application

To submit a Spark job, we use the spark-submit command. A common spark-submit command looks like this:

./bin/spark-submit \  
  --class <main-class> \  
  --master <master-url> \  
  --deploy-mode <deploy-mode> \  
  --conf <key>=<value> \  
  <application-jar> \  
  [application-arguments]

Explanation of Commonly Used Options:

--class: The main entry point of the application (e.g., org.apache.spark.examples.SparkPi).
--master: The master URL for the Spark cluster (e.g., spark://23.195.26.187:7077).
--deploy-mode: Specifies whether the driver runs on a worker node (cluster) or externally (client).
--conf: Allows setting arbitrary Spark configurations.
application-jar: Path to the compiled JAR file.
application-arguments: Arguments passed to the main method.

Example: Running a SparkPi Application

Scala Code for SparkPi Application

package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession

object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("Spark Pi").getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println(s"Pi is roughly ${4.0 * count / (n - 1)}")
    spark.stop()
  }
}

This Spark application estimates the value of π using a Monte Carlo simulation by generating random points in a unit square and counting those that fall within the unit circle.

Executing the SparkPi Application

Locate the Spark examples JAR (available in Spark installation):

/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar

Run the application using spark-submit:

spark-submit --class org.apache.spark.examples.SparkPi \  
/usr/local/opt/apache-spark/libexec/examples/jars/spark-examples_2.11-2.4.3.jar 10

Step 4: Checking Output

After execution, Spark generates output files in a directory (e.g., outfile). Use the following commands to inspect the results:

What is Apache Spark Deployment?

Apache Spark can be deployed in different modes:

Local Mode: Runs Spark on a single machine, often used for testing.
Standalone Mode: Runs Spark in a cluster without an external resource manager.
YARN Mode: Uses Hadoop’s YARN resource manager to deploy Spark.
Kubernetes Mode: Runs Spark applications in a Kubernetes cluster.

This tutorial focuses on submitting a Spark application in Standalone Mode using spark-submit.

Prerequisites for Spark Deployment

Before deploying a Spark application, ensure the following:

Apache Spark is installed.
Java and Scala are installed.
The Spark core JAR file is available.
The application JAR file is compiled.

Steps to Deploy an Apache Spark Application

Step 1: Set Up Spark Application Directory

Create a directory to store Spark application files:

mkdir spark-application
cd spark-application

Step 2: Download Required Spark JAR Files

Download the necessary Spark core JAR file for compilation:

wget https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10-1.3.0.jar
mv spark-core_2.10-1.3.0.jar spark-application/

Step 3: Compile the Spark Application

Compile the application using scalac with the required classpath:

scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala

Step 4: Create a JAR File

Package the compiled class files into a JAR file:

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar /usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 5: Submit Spark Application

Submit the Spark job using spark-submit:

spark-submit --class SparkWordCount --master local wordcount.jar

If the execution is successful, you will see an output similar to:

Successfully started service 'sparkDriver' on port 42954
MemoryStore started with capacity 267.3 MB
Started SparkUI at http://192.168.1.217:4040
Added JAR file:/home/hadoop/piapplication/count.jar
Stopped Spark web UI at http://192.168.1.217:4040

Step 6: Verify the Output

After execution, check the output directory:

cd outfile
ls

Expected output:

part-00000  part-00001  _SUCCESS

To view the output:

cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they,7)
(look,1)

cat part-00001
(walk,1)
(or,1)
(talk,1)
(only,1)
(love,1)
(care,1)
(share,1)

Conclusion

Deploying an Apache Spark application involves setting up Spark, compiling the application, creating a JAR file, and submitting it using spark-submit. Understanding these steps is essential for running Spark applications efficiently in different environments.

Next Steps:

Learn about Spark RDD operations.
Explore Spark SQL for data analysis.
Understand Spark Streaming for real-time processing.

By following this guide, you’ll be able to deploy your Spark applications successfully and optimize their performance. 🚀

Table of content

Introduction to Apache Spark
Spark Architecture & Components
Working with Spark Shell
Core Spark Concepts
Working with Data in Spark
- Spark DataFrames
- Spark SQL
- Dataset API
- Handling JSON, CSV, and Parquet
Spark Streaming
- What is Spark Streaming?
- Structured Streaming
- Processing Real-time Data
Performance Optimization
- Spark Execution Plan
- Broadcast Variables & Accumulators
- Caching & Persistence
- Optimizing Shuffle Operations
Machine Learning with Spark
- Introduction to MLlib
- Classification & Regression
- Clustering & Recommendation Systems
Job Deployment & Cluster Management
- Job Deployment in Spark
- Running Spark on YARN, Mesos, and Kubernetes
- Monitoring & Debugging Spark Jobs
Advanced Spark Topics
- GraphX (Graph Processing in Spark)
- Spark with Hadoop & HDFS
- Security in Spark
Spark Interview Preparation
- Top 250 Spark Questions
- Spark Interview Questions
Additional Spark Resources