Caching & Persistence in Spark with Scala and PySpark

Efficient memory management and computation optimization are essential when working with big data. In Apache Spark, caching and persistence play a key role in improving performance by reusing intermediate results instead of recomputing them multiple times. In this tutorial, we’ll explore caching and persistence in Spark, along with practical examples in Scala and PySpark.

#Caching & Persistence in Spark with Scala PySpark

What is Caching and Persistence in Spark?

When Spark executes a job, it builds a Directed Acyclic Graph (DAG) of transformations. By default, intermediate results are not stored—they are recomputed whenever needed. This can be inefficient for iterative algorithms or multiple actions on the same DataFrame/RDD.

To optimize this, Spark provides:

Cache – Stores the RDD/DataFrame in memory for quick reuse.
Persist – Stores the RDD/DataFrame in memory or disk with different storage levels.

Cache vs Persist in Spark

Feature	Cache	Persist
Default Storage	MEMORY_ONLY	Configurable (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.)
Flexibility	Limited to memory	Supports multiple storage levels
Use Case	Best for smaller datasets that fit in RAM	Useful for larger datasets or fault tolerance

Caching in PySpark

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("CachingExample").getOrCreate()

# Load DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Cache DataFrame
df.cache()

# Perform action
print("Row count:", df.count())

# Cached data will be reused
print("Average Age:", df.groupBy().avg("age").collect())

✅ The df.cache() stores the DataFrame in memory for faster access.

Persisting in PySpark

from pyspark.storagelevel import StorageLevel

# Persist DataFrame with MEMORY_AND_DISK
df.persist(StorageLevel.MEMORY_AND_DISK)

print("Row count:", df.count())

✅ If the dataset is too large for memory, Spark will spill the data to disk.

Caching in Spark with Scala

import org.apache.spark.sql.SparkSession

object CacheExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("CacheExample").getOrCreate()

    val df = spark.read.option("header", "true").csv("data.csv")

    // Cache DataFrame
    df.cache()

    println("Row count: " + df.count())
    println("Average Age: " + df.groupBy().avg("age").collect().mkString)

    spark.stop()
  }
}

Persistence in Spark with Scala

import org.apache.spark.storage.StorageLevel

object PersistExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("PersistExample").getOrCreate()

    val df = spark.read.option("header", "true").csv("data.csv")

    // Persist DataFrame with MEMORY_AND_DISK
    df.persist(StorageLevel.MEMORY_AND_DISK)

    println("Row count: " + df.count())

    spark.stop()
  }
}

When to Use Cache and Persist in Spark?

Cache when:
- Data fits in memory.
- You are repeatedly accessing the same DataFrame/RDD.
Persist when:
- Dataset is too large for memory.
- You need fault tolerance.
- You want more control over storage levels.

Storage Levels in Spark Persistence

MEMORY_ONLY – Default, stores RDD/DataFrame in memory only.
MEMORY_AND_DISK – Stores in memory; spills to disk if needed.
DISK_ONLY – Stores only on disk.
MEMORY_ONLY_SER – Stores serialized format in memory (more space-efficient).
MEMORY_AND_DISK_SER – Serialized in memory and spills to disk.

Conclusion

Caching and persistence in Spark are powerful techniques to optimize job execution. While cache is a shorthand for storing data in memory, persist provides more flexibility with multiple storage options. By using them wisely in Scala and PySpark, you can significantly improve performance for iterative computations and complex pipelines.

SEO Keywords:

Spark caching and persistence
Cache vs persist in Spark
PySpark cache DataFrame
Persist DataFrame in Spark Scala
Spark storage levels tutorial

Table of content

Introduction to Scala
Scala Basics
Control Structures
Functions and Methods
Object-Oriented Programming in Scala
Advanced Scala Concepts
Functional Programming in Scala
Scala Programming Exercises
- Fibonacci Program
- Remove Duplicates in String
Scala Interview Preparation
- Top 250 Scala Questions
- Scala Interview Questions
Scala Practice & Coding Challenges
- Scala Quiz
- Coding Challenges
Additional Scala Resources