Broadcast Variables & Accumulators in Apache Spark and pyspark

8/17/2025

Broadcast variable example in PySpark with lookup table

Go Back

Broadcast Variables & Accumulators in Apache Spark

When working with Apache Spark, performance optimization is crucial for handling large-scale data. Two important features that make Spark efficient are Broadcast Variables and Accumulators. These tools help reduce data shuffling, improve speed, and support better cluster-wide communication.

In this tutorial, we’ll explore what broadcast variables and accumulators are, their importance, and practical examples in PySpark and Scala.


Broadcast variable example in PySpark with lookup table

What are Broadcast Variables in Spark?

A broadcast variable allows the programmer to keep a read-only copy of data cached on each worker node rather than shipping a copy with every task.

Instead of sending the same data multiple times, Spark broadcasts it once and reuses it across nodes. This improves efficiency, especially when working with lookup tables, configurations, or reference data.

Example: Broadcast Variable in PySpark

from pyspark import SparkContext

sc = SparkContext("local", "Broadcast Example")

# Broadcast variable
lookup_data = {"A": "Apple", "B": "Banana", "C": "Cherry"}
broadcast_var = sc.broadcast(lookup_data)

# Sample RDD
rdd = sc.parallelize(["A", "B", "C", "A", "B"])

# Use broadcast variable
result = rdd.map(lambda x: (x, broadcast_var.value[x])).collect()

print(result)

✅ Output: [('A', 'Apple'), ('B', 'Banana'), ('C', 'Cherry'), ('A', 'Apple'), ('B', 'Banana')]


What are Accumulators in Spark?

An accumulator is a variable that workers can “add” to using an associative operation, and the result is only available to the driver program.

They are often used for counters, sums, and debugging. Unlike regular variables, accumulators are write-only from workers and readable on the driver.

Example: Accumulator in PySpark

from pyspark import SparkContext

sc = SparkContext("local", "Accumulator Example")

# Create an accumulator
acc = sc.accumulator(0)

rdd = sc.parallelize([1, 2, 3, 4, 5])

# Add values to accumulator
def add_func(x):
    global acc
    acc += x

rdd.foreach(add_func)

print("Accumulator Value:", acc.value)

✅ Output: Accumulator Value: 15


Broadcast Variables vs Accumulators

FeatureBroadcast VariableAccumulator
PurposeDistribute read-only data across nodesAggregate values across nodes
AccessRead-onlyWrite-only (from workers), Readable (from driver)
Use CaseLookup tables, configurations, reference dataCounters, sums, debugging
ScopeAvailable to all workersUpdated by workers, collected at driver

Use Cases in Real-World Applications

  1. Broadcast Variables

    • Distributing machine learning models to all worker nodes.

    • Sharing lookup/reference tables in ETL jobs.

    • Caching common datasets for repeated access.

  2. Accumulators

    • Counting bad records in data pipelines.

    • Monitoring metrics (e.g., number of null values).

    • Summing transaction amounts across a cluster.


 

 

Example in Scala:

import org.apache.spark.sql.SparkSession

object BroadcastExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("Broadcast Example").getOrCreate()
    val sc = spark.sparkContext

    // Large lookup map
    val lookupMap = Map(1 -> "Apple", 2 -> "Banana", 3 -> "Cherry")
    val broadcastVar = sc.broadcast(lookupMap)

    val data = sc.parallelize(Seq(1, 2, 3, 4))

    val result = data.map(x => broadcastVar.value.getOrElse(x, "Unknown"))
    result.collect().foreach(println)

    spark.stop()
  }
}

Output:

Apple
Banana
Cherry
Unknown

2. Accumulators in Spark

What are Accumulators?

Accumulators are variables used to perform aggregations across tasks, such as counters or sums. They are write-only for workers and readable only on the driver program.

They are commonly used for monitoring, debugging, or implementing custom metrics.

Example in Scala:

import org.apache.spark.sql.SparkSession

object AccumulatorExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("Accumulator Example").getOrCreate()
    val sc = spark.sparkContext

    // Define an accumulator
    val accumulator = sc.longAccumulator("My Accumulator")

    val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

    data.foreach(x => accumulator.add(x))

    println("Accumulator Value: " + accumulator.value)

    spark.stop()
  }
}

Output:

Accumulator Value: 15

3. Key Differences Between Broadcast Variables and Accumulators

FeatureBroadcast VariablesAccumulators
PurposeShare read-only data across nodesAggregate values across tasks
AccessibilityRead-only on workersWrite-only on workers
Example Use CaseLookup tables, configuration dataCounting, summation, metrics
Data MovementCached on each worker nodeValues sent back to driver

 

Best Practices

  • Use broadcast variables for large reference datasets instead of repeatedly joining.

  • Use accumulators for counters and debugging, not for main business logic (since updates are not guaranteed in case of task retries).

  • Always test accumulators in production carefully, as they may update multiple times if a job fails and retries.


Conclusion

Both Broadcast Variables and Accumulators are essential tools in Scala Spark for optimizing performance and managing distributed data processing.

  • Use Broadcast Variables when you need to efficiently share large read-only datasets across executors.

  • Use Accumulators when you need to collect metrics or perform aggregations from worker nodes back to the driver.

Mastering these concepts helps in writing efficient, scalable, and optimized Spark applications.