Top Apache Spark Architecture Interview Questions
spark architecture explained with driver and executor
Apache Spark architecture follows a master-slave design where the Driver acts as the master and Executors act as workers to process distributed data.
Driver Program
Cluster Manager
Executors
Worker Nodes
The Driver:
Converts user code into execution plan (DAG)
Schedules tasks
Communicates with executors
Tracks job progress
Executors are responsible for:
Running tasks
Storing intermediate data
Sending results back to Driver
A worker node is a machine in the cluster that runs executors and performs computations.
Cluster Manager:
Allocates resources
Manages cluster nodes
Launches executors
User submits application
Driver creates DAG
DAG is divided into stages
Tasks are assigned to executors
Results are returned to Driver
Directed Acyclic Graph represents logical execution plan of tasks.
A stage is a group of tasks that can be executed without shuffling.
A task is the smallest unit of work sent to an executor.
Local Mode
Client Mode
Cluster Mode
Driver runs inside the cluster, suitable for production.
Driver runs on client machine, useful for debugging.
Everything runs on a single machine for testing.
Client Mode → Driver outside cluster
Cluster Mode → Driver inside cluster
Through RDD lineage, Spark recomputes lost data.
Execution is delayed until an action is triggered.
Tasks are executed close to data to reduce network cost.
Tasks are reassigned and recomputed using lineage.
Shuffle is data redistribution across partitions during wide transformations.
In-memory processing
DAG optimization
Reduced disk I/O
By distributing data across executors and processing in parallel.
Increase partitions
Cache data
Avoid unnecessary shuffles
Narrow → No shuffle
Wide → Requires shuffle
Yes, using standalone cluster manager.
Focus on explaining this flow clearly:
Driver → DAG → Stages → Tasks → Executors → Result