Top Spark Interview Question and Answer 2023

Updated:01/01/2023 by Computer Hope
Top 50 Spark Interview and Question.

Q.1 What is the use of coalesce in Spark?
Spark uses a coalesce method to reduce the number of partitions in a DataFrame.
Suppose you want to read data from a CSV file into an RDD having four partitions.

Q.2 What is the significance of Resilient Distributed Datasets in Spark?
Ans: Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Q.3 Which all languages Apache Spark supports?
Ans: Apache Spark is written in Scala. Many people use Scala for the purpose of development. But it also has API in Java, Python, and R.

Q.4 Which all languages Apache Spark supports?
Ans: There are three methods to run Spark in a Hadoop cluster:
1.Standalone deployment
2.Hadoop Yarn deployment
3.Spark In MapReduce (SIMR)

Q.5 What is SparkSession in Apache Spark?
Ans: Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having a spark context, hive context, SQL context, now all of it is encapsulated in a Spark session.

Q.6 What operations does RDD support?
Ans: RDDs support two types of operations: transformations and actions.

Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.

Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

List of Spark Interview and Question.

  1. What Is The Difference Between Persist() And Cache()?
  2. What do you understand by SchemaRDD?
  3. What Is The Advantage Of A Parquet File?
  4. Explain Spark.
  5. Can you explain the main features of Spark Apache?
  6. What is Apache Spark?
  7. Explain the concept of Sparse Vector.
  8. What is the method for creating a data frame?
  9. Explain what is SchemaRDD.
  10. Explain what are accumulators.
  11. Explain the core of Spark.
  12. Explain how data is interpreted in Spark.
  13. How many forms of transformations are there?
  14. What is Apache Spark?
  15. What is Spark SQL?
  16. Explain Spark SQL caching and uncaching?
  17. What are the components of Apache Spark Ecosystem?
  18. What is Spark Core?
  19. Which all languages Apache Spark supports?
  20. How is Apache Spark better than Hadoop?
  21. What are the different methods to run Spark over Apache Hadoop?
  22. What is SparkContext in Apache Spark?
  23. What is SparkSession in Apache Spark?
  24. SparkSession vs SparkContext in Apache Spark.
  25. What are the abstractions of Apache Spark?
  26. How can we create RDD in Apache Spark?
  27. Why is Spark RDD immutable?
  28. Explain the term paired RDD in Apache Spark
  29. How is RDD in Spark different from Distributed Storage Management?
  30. Explain transformation and action in RDD in Apache Spark.
  31. What are the types of Apache Spark transformation?
  32. Explain the RDD properties.
  33. What is lineage graph in Apache Spark?
  34. Explain the terms Spark Partitions and Partitioners.
  35. By Default, how many partitions are created in RDD in Apache Spark?
  36. What is Spark DataFrames?
  37. What are benefits of DataFrame in Spark?
  38. What is Spark Dataset?
  39. What are the advantages of datasets in spark?
  40. What is Directed Acyclic Graph in Apache Spark?
  41. What is the need for Spark DAG?
  42. What is the difference between DAG and Lineage?
  43. Explain the concept of “persistence”?
  44. What is Map-Reduce learning function?
  45. When processing information from HDFS, is the code performed near the data?
  46. Does Spark also contain the storage layer?
  47. What is the difference between Caching and Persistence in Apache Spark?
  48. What are the limitations of Apache Spark?
  49. List the advantage of Parquet file in Apache Spark.
  50. What is lazy evaluation in Spark?
  51. What are the benefits of Spark lazy evaluation?
  52. What are the ways to launch Apache Spark over YARN?
  53. Explain various cluster manager in Apache Spark?
  54. How much faster is Apache spark than Hadoop?
  55. Different Running Modes of Apache Spark
  56. What are the different ways of representing data in Spark?
  57. What is write ahead log(journaling) in Spark?
  58. Explain catalyst query optimizer in Apache Spark.
  59. What are shared variables in Apache Spark?
  60. How does Apache Spark handles accumulated Metadata?
  61. What is Apache Spark Machine learning library?
  62. List commonly used Machine Learning Algorithm.
  63. What is the difference between DS. and DF and RDD?
  64. What is Speculative Execution in Apache Spark?
  65. How can data transfer be minimized when working with Apache Spark?
  66. What are the cases where Apache Spark surpasses Hadoop?
  67. What is action, how it process data in apache spark
  68. How is fault tolerance achieved in Apache Spark?
  69. What is the role of Spark Driver in spark applications?
  70. What is worker node in Apache Spark cluster?
  71. Why is Transformation lazy in Spark?
  72. Can I run Apache Spark without Hadoop?
  73. Explain Accumulator in Spark.
  74. What is the role of Driver program in Spark Application?
  75. How to identify that given operation is Transformation/Action in your program?
  76. Name the two types of shared variable available in Apache Spark.
  77. What are the common faults of the developer while using Apache Spark?
  78. By Default, how many partitions are created in RDD in Apache Spark?
  79. Why we need compression and what are the different compression format supported?
  80. Explain the filter transformation.
  81. How to start and stop spark in interactive shell?
  82. Explain sortByKey() operation.
  83. Explain distnct(),union(),intersection() and substract() transformation in Spark
  84. Explain foreach() operation in apache spark
  85. groupByKey vs reduceByKey in Apache Spark
  86. Explain mapPartitions() and mapPartitionsWithIndex()
  87. What is Map in Apache Spark?
  88. What is FlatMap in Apache Spark?
  89. .Explain fold() operation in Spark.
  90. Explain API createOrReplaceTempView()
  91. Explain values() operation in Apache Spark.
  92. Explain keys() operation in Apache spark.
  93. Explain textFile Vs wholeTextFile in Spark
  94. Explain cogroup() operation in Spark
  95. Explain pipe() operation in Apache Spark
  96. Explain Spark coalesce() operation
  97. .Explain the repartition() operation in Spark
  98. Explain fullOuterJoin() operation in Apache Spark
  99. Expain Spark leftOuterJoin() and rightOuterJoin() operation
  100. Explain Spark join() operation
  101. Explain the top() and takeOrdered() operation
  102. Explain first() operation in Spark
  103. Explain sum(), max(), min() operation in Apache Spark
  104. Explain countByValue() operation in Apache Spark RDD
  105. Explain the lookup() operation in Spark
  106. Explain Spark countByKey() operation
  107. Explain Spark saveAsTextFile() operation
  108. Explain reduceByKey() Spark operation
  109. Explain the operation reduce() in Spark
  110. .Explain the action count() in Spark RDD
  111. Explain Spark map() transformation
  112. Explain the flatMap() transformation in Apache Spark
  113. What are the limitations of Apache Spark?
  114. Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?
  115. Explain Spark streaming
  116. What is DStream in Apache Spark Streaming?
  117. What’s Paired RDD?
  118. What is implied by the treatment of memory in Spark?
  119. Explain the Directed Acyclic Graph.
  120. Explain the lineage chart.
  121. What Are The Various Levels Of Persistence In Apache Spark
  122. Explain the idle appraisal in Spark.
  123. Explain the advantage of a lazy evaluation.
  124. What are Spark’s key features?
  125. Explain PageRank?
  126. What is Broadcast Variables?
  127. What is Piping or pipe() technique ?
  128. What is Broadcast Variables?
  129. Difference among map()and flatMap()?
  130. On which port the Spark UI is available?
  131. What is the difference between CreateOrReplaceTempView and createGlobalTempView?
  132. What are the types of Transformation on DStream?
  133. What is Shuffling in Spark?
  134. Name different types of data sources available in SparkSQL.
  135. What is the role of a Spark Driver?
  136. What do you understand by typed and untyped datasets?
  137. How does Logical Planning and Physical Planning process takes place in Spark?
  138. What do you understand by CB Optimization in Spark SQL?
  139. What are Partitions? How will you control Partitions in Spark?
  140. What are ML Pipelines and its key components?
  141. What is Seriation? How will you handle the serialization issue in Spark?