Caching & Persistence in Spark with Scala and PySpark

8/17/2025

#Caching & Persistence in Spark with Scala PySpark

Go Back

Caching & Persistence in Spark with Scala and PySpark

Efficient memory management and computation optimization are essential when working with big data. In Apache Spark, caching and persistence play a key role in improving performance by reusing intermediate results instead of recomputing them multiple times. In this tutorial, we’ll explore caching and persistence in Spark, along with practical examples in Scala and PySpark.


#Caching & Persistence in Spark with Scala  PySpark

What is Caching and Persistence in Spark?

When Spark executes a job, it builds a Directed Acyclic Graph (DAG) of transformations. By default, intermediate results are not stored—they are recomputed whenever needed. This can be inefficient for iterative algorithms or multiple actions on the same DataFrame/RDD.

To optimize this, Spark provides:

  • Cache – Stores the RDD/DataFrame in memory for quick reuse.

  • Persist – Stores the RDD/DataFrame in memory or disk with different storage levels.


Cache vs Persist in Spark

FeatureCachePersist
Default StorageMEMORY_ONLYConfigurable (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.)
FlexibilityLimited to memorySupports multiple storage levels
Use CaseBest for smaller datasets that fit in RAMUseful for larger datasets or fault tolerance

Caching in PySpark

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("CachingExample").getOrCreate()

# Load DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Cache DataFrame
df.cache()

# Perform action
print("Row count:", df.count())

# Cached data will be reused
print("Average Age:", df.groupBy().avg("age").collect())

✅ The df.cache() stores the DataFrame in memory for faster access.


Persisting in PySpark

from pyspark.storagelevel import StorageLevel

# Persist DataFrame with MEMORY_AND_DISK
df.persist(StorageLevel.MEMORY_AND_DISK)

print("Row count:", df.count())

✅ If the dataset is too large for memory, Spark will spill the data to disk.


Caching in Spark with Scala

import org.apache.spark.sql.SparkSession

object CacheExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("CacheExample").getOrCreate()

    val df = spark.read.option("header", "true").csv("data.csv")

    // Cache DataFrame
    df.cache()

    println("Row count: " + df.count())
    println("Average Age: " + df.groupBy().avg("age").collect().mkString)

    spark.stop()
  }
}

Persistence in Spark with Scala

import org.apache.spark.storage.StorageLevel

object PersistExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("PersistExample").getOrCreate()

    val df = spark.read.option("header", "true").csv("data.csv")

    // Persist DataFrame with MEMORY_AND_DISK
    df.persist(StorageLevel.MEMORY_AND_DISK)

    println("Row count: " + df.count())

    spark.stop()
  }
}

When to Use Cache and Persist in Spark?

  1. Cache when:

    • Data fits in memory.

    • You are repeatedly accessing the same DataFrame/RDD.

  2. Persist when:

    • Dataset is too large for memory.

    • You need fault tolerance.

    • You want more control over storage levels.


Storage Levels in Spark Persistence

  • MEMORY_ONLY – Default, stores RDD/DataFrame in memory only.

  • MEMORY_AND_DISK – Stores in memory; spills to disk if needed.

  • DISK_ONLY – Stores only on disk.

  • MEMORY_ONLY_SER – Stores serialized format in memory (more space-efficient).

  • MEMORY_AND_DISK_SER – Serialized in memory and spills to disk.


Conclusion

Caching and persistence in Spark are powerful techniques to optimize job execution. While cache is a shorthand for storing data in memory, persist provides more flexibility with multiple storage options. By using them wisely in Scala and PySpark, you can significantly improve performance for iterative computations and complex pipelines.


SEO Keywords:

  • Spark caching and persistence

  • Cache vs persist in Spark

  • PySpark cache DataFrame

  • Persist DataFrame in Spark Scala

  • Spark storage levels tutorial