Broadcast Variables & Accumulators in Apache Spark and pyspark
Broadcast variable example in PySpark with lookup table
When working with Apache Spark, performance optimization is crucial for handling large-scale data. Two important features that make Spark efficient are Broadcast Variables and Accumulators. These tools help reduce data shuffling, improve speed, and support better cluster-wide communication.
In this tutorial, we’ll explore what broadcast variables and accumulators are, their importance, and practical examples in PySpark and Scala.
A broadcast variable allows the programmer to keep a read-only copy of data cached on each worker node rather than shipping a copy with every task.
Instead of sending the same data multiple times, Spark broadcasts it once and reuses it across nodes. This improves efficiency, especially when working with lookup tables, configurations, or reference data.
from pyspark import SparkContext
sc = SparkContext("local", "Broadcast Example")
# Broadcast variable
lookup_data = {"A": "Apple", "B": "Banana", "C": "Cherry"}
broadcast_var = sc.broadcast(lookup_data)
# Sample RDD
rdd = sc.parallelize(["A", "B", "C", "A", "B"])
# Use broadcast variable
result = rdd.map(lambda x: (x, broadcast_var.value[x])).collect()
print(result)
✅ Output: [('A', 'Apple'), ('B', 'Banana'), ('C', 'Cherry'), ('A', 'Apple'), ('B', 'Banana')]
An accumulator is a variable that workers can “add” to using an associative operation, and the result is only available to the driver program.
They are often used for counters, sums, and debugging. Unlike regular variables, accumulators are write-only from workers and readable on the driver.
from pyspark import SparkContext
sc = SparkContext("local", "Accumulator Example")
# Create an accumulator
acc = sc.accumulator(0)
rdd = sc.parallelize([1, 2, 3, 4, 5])
# Add values to accumulator
def add_func(x):
global acc
acc += x
rdd.foreach(add_func)
print("Accumulator Value:", acc.value)
✅ Output: Accumulator Value: 15
Feature | Broadcast Variable | Accumulator |
---|---|---|
Purpose | Distribute read-only data across nodes | Aggregate values across nodes |
Access | Read-only | Write-only (from workers), Readable (from driver) |
Use Case | Lookup tables, configurations, reference data | Counters, sums, debugging |
Scope | Available to all workers | Updated by workers, collected at driver |
Broadcast Variables
Distributing machine learning models to all worker nodes.
Sharing lookup/reference tables in ETL jobs.
Caching common datasets for repeated access.
Accumulators
Counting bad records in data pipelines.
Monitoring metrics (e.g., number of null values).
Summing transaction amounts across a cluster.
import org.apache.spark.sql.SparkSession
object BroadcastExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("Broadcast Example").getOrCreate()
val sc = spark.sparkContext
// Large lookup map
val lookupMap = Map(1 -> "Apple", 2 -> "Banana", 3 -> "Cherry")
val broadcastVar = sc.broadcast(lookupMap)
val data = sc.parallelize(Seq(1, 2, 3, 4))
val result = data.map(x => broadcastVar.value.getOrElse(x, "Unknown"))
result.collect().foreach(println)
spark.stop()
}
}
✅ Output:
Apple
Banana
Cherry
Unknown
Accumulators are variables used to perform aggregations across tasks, such as counters or sums. They are write-only for workers and readable only on the driver program.
They are commonly used for monitoring, debugging, or implementing custom metrics.
import org.apache.spark.sql.SparkSession
object AccumulatorExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("Accumulator Example").getOrCreate()
val sc = spark.sparkContext
// Define an accumulator
val accumulator = sc.longAccumulator("My Accumulator")
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))
data.foreach(x => accumulator.add(x))
println("Accumulator Value: " + accumulator.value)
spark.stop()
}
}
✅ Output:
Accumulator Value: 15
Feature | Broadcast Variables | Accumulators |
---|---|---|
Purpose | Share read-only data across nodes | Aggregate values across tasks |
Accessibility | Read-only on workers | Write-only on workers |
Example Use Case | Lookup tables, configuration data | Counting, summation, metrics |
Data Movement | Cached on each worker node | Values sent back to driver |
Use broadcast variables for large reference datasets instead of repeatedly joining.
Use accumulators for counters and debugging, not for main business logic (since updates are not guaranteed in case of task retries).
Always test accumulators in production carefully, as they may update multiple times if a job fails and retries.
Both Broadcast Variables and Accumulators are essential tools in Scala Spark for optimizing performance and managing distributed data processing.
Use Broadcast Variables when you need to efficiently share large read-only datasets across executors.
Use Accumulators when you need to collect metrics or perform aggregations from worker nodes back to the driver.
Mastering these concepts helps in writing efficient, scalable, and optimized Spark applications.