Caching & Persistence in Spark with Scala and PySpark
#Caching & Persistence in Spark with Scala PySpark
Efficient memory management and computation optimization are essential when working with big data. In Apache Spark, caching and persistence play a key role in improving performance by reusing intermediate results instead of recomputing them multiple times. In this tutorial, we’ll explore caching and persistence in Spark, along with practical examples in Scala and PySpark.
When Spark executes a job, it builds a Directed Acyclic Graph (DAG) of transformations. By default, intermediate results are not stored—they are recomputed whenever needed. This can be inefficient for iterative algorithms or multiple actions on the same DataFrame/RDD.
To optimize this, Spark provides:
Cache – Stores the RDD/DataFrame in memory for quick reuse.
Persist – Stores the RDD/DataFrame in memory or disk with different storage levels.
Feature | Cache | Persist |
---|---|---|
Default Storage | MEMORY_ONLY | Configurable (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.) |
Flexibility | Limited to memory | Supports multiple storage levels |
Use Case | Best for smaller datasets that fit in RAM | Useful for larger datasets or fault tolerance |
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("CachingExample").getOrCreate()
# Load DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)
# Cache DataFrame
df.cache()
# Perform action
print("Row count:", df.count())
# Cached data will be reused
print("Average Age:", df.groupBy().avg("age").collect())
✅ The df.cache()
stores the DataFrame in memory for faster access.
from pyspark.storagelevel import StorageLevel
# Persist DataFrame with MEMORY_AND_DISK
df.persist(StorageLevel.MEMORY_AND_DISK)
print("Row count:", df.count())
✅ If the dataset is too large for memory, Spark will spill the data to disk.
import org.apache.spark.sql.SparkSession
object CacheExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("CacheExample").getOrCreate()
val df = spark.read.option("header", "true").csv("data.csv")
// Cache DataFrame
df.cache()
println("Row count: " + df.count())
println("Average Age: " + df.groupBy().avg("age").collect().mkString)
spark.stop()
}
}
import org.apache.spark.storage.StorageLevel
object PersistExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("PersistExample").getOrCreate()
val df = spark.read.option("header", "true").csv("data.csv")
// Persist DataFrame with MEMORY_AND_DISK
df.persist(StorageLevel.MEMORY_AND_DISK)
println("Row count: " + df.count())
spark.stop()
}
}
Cache when:
Data fits in memory.
You are repeatedly accessing the same DataFrame/RDD.
Persist when:
Dataset is too large for memory.
You need fault tolerance.
You want more control over storage levels.
MEMORY_ONLY
– Default, stores RDD/DataFrame in memory only.
MEMORY_AND_DISK
– Stores in memory; spills to disk if needed.
DISK_ONLY
– Stores only on disk.
MEMORY_ONLY_SER
– Stores serialized format in memory (more space-efficient).
MEMORY_AND_DISK_SER
– Serialized in memory and spills to disk.
Caching and persistence in Spark are powerful techniques to optimize job execution. While cache is a shorthand for storing data in memory, persist provides more flexibility with multiple storage options. By using them wisely in Scala and PySpark, you can significantly improve performance for iterative computations and complex pipelines.
Spark caching and persistence
Cache vs persist in Spark
PySpark cache DataFrame
Persist DataFrame in Spark Scala
Spark storage levels tutorial