Apache Spark Configuration Guide-Optimize Spark Shell for Performance

3/24/2025

Apache Spark UI environment tab showing active configurations

Go Back

Apache Spark Configuration Guide: Optimize Spark Shell for Performance (2025)

Introduction

Apache Spark provides multiple ways to configure its system to optimize performance, logging, and resource management. This guide covers the three primary configuration methods:

  • Spark Properties – Control application parameters via SparkConf or system properties.
  • Environment Variables – Set per-machine settings (e.g., IP addresses) via conf/spark-env.sh.
  • Logging – Configure logging behavior using log4j2.properties.
Apache Spark UI environment tab showing active configurations

Spark Properties

Spark properties are application-specific and can be set using a SparkConf object or Java system properties.

Basic Configuration Example

val conf = new SparkConf()  
  .setMaster("local[2]")  
  .setAppName("CountingSheep")  
val sc = new SparkContext(conf)  
  • local[2]: Runs Spark locally with 2 threads for parallelism.
  • Time/Size Units: Use suffixes like ms, s, mb, gb for durations and byte sizes.

Dynamic Loading

Avoid hardcoding configurations by passing them at runtime:

./bin/spark-submit --name "MyApp" --master local[4] --conf spark.eventLog.enabled=false  

Configuration Precedence

  1. SparkConf settings (highest priority)
  2. Command-line --conf flags
  3. spark-defaults.conf file

Environment Variables

Configure node-specific settings (e.g., IP, ports) via conf/spark-env.sh. Key variables:

VariablePurpose
JAVA_HOMEJava installation path.
SPARK_LOCAL_IPBinds Spark to a specific IP.
SPARK_PUBLIC_DNSHostname advertised to the cluster.

Logging Configuration

Customize logging via log4j2.properties:

Copy the Template

cp conf/log4j2.properties.template conf/log4j2.properties  

Adjust log levels (e.g., INFO, ERROR) and appenders for better debugging and monitoring.

Key Configuration Parameters

Application Properties

PropertyDefaultDescription
spark.app.name(none)Application name (visible in UI/logs).
spark.driver.memory1gMemory for the driver process.
spark.executor.memory1gMemory per executor.

Execution Behavior

PropertyDefaultDescription
spark.default.parallelismVariesDefault number of partitions.
spark.sql.shuffle.partitions200Partitions for shuffles in SQL.

Dynamic Allocation

PropertyDefaultDescription
spark.dynamicAllocation.enabledfalseEnables dynamic executor scaling.
spark.dynamicAllocation.minExecutors0Minimum executors to retain.

Viewing Configurations

Check active settings in the Spark UI under the Environment tab: http://<driver>:4040.

Conclusion

Properly configuring Spark ensures optimal resource utilization and performance. Use:

  • SparkConf for application-specific settings.
  • spark-env.sh for cluster-wide machine configurations.
  • log4j2.properties for fine-grained logging control.

For advanced tuning, refer to the Spark documentation.

Table of content