What Is The Difference Between Persist() And Cache()?
#sparkrdd dataframe cache perist
When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/tmp/user.txt")
cache()
cache
is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val textFile = sc.textFile("/tmp/user.txt")
val wordsRDD = textFile.flatMap(line => line.split(" "))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
cache() doesn’t take any parameters
cache() on RDD will persist the objects in memor
There are 2 flavours of persist() functions
RDD
rdd.persist()
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
df.persist(StorageLevel.DISK_ONLY)
ds.persist(StorageLevel.MEMORY_AND_DISK)
Based on the provided StorageLevel, the behaviour of the persisted objects will vary.
understanding use of Persist() And Cache() in spark with rdd and dataframe ....
This Solution is provided by Shubham mishra
This article is contributed by Developer Indian team. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
Also folllow our instagram , linkedIn , Facebook , twiter account for more....