Resilient Distributed Dataset (RDD)

Updated:01/20/2021 by Computer Hope

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it.

"Resilient Distributed Datasets (RDD) is a distributed memory abstraction that helps a programmer to perform in-memory computations on large cluster." One of the important advantages of RDD is fault tolerance, it means if any failure occurs it recovers automatically.

Features of Spark RDD

  • Lazy Evaluation
  • In-Memory Computation
  • Fault Tolerance
  • Immutability
  • Partitioning