Spark Directed Acyclic Graph
Updated:01/20/2021 by Computer Hope
As I mentioned earlier, the Spark driver divides DAG stages into tasks. Here, you can see that each stage is divided into two tasks. But why did Spark divide only two tasks for each stage? It depends on your number of partitions. In this program, we have only two partitions, so each stage is divided into two tasks. And a single task runs on a single partition. The number of tasks for a job is:
1 ( no of your stages * no of your partitions )
Now, I think you may have a clear picture of how Spark works internally.
Jobs Tab
The Jobs tab displays a summary page of all jobs in the Spark application and a details page for each job. The summary page shows high-level information, such as the status, duration, and progress of all jobs and the overall event timeline. When you click on a job on the summary page, you see the details page for that job. The details page further shows the event timeline, DAG visualization, and all stages of the job.
List of stages (grouped by state active, pending, completed, skipped, and failed)
- Stage ID
- Description of the stage
- Submitted timestamp
- Duration of the stage
- Tasks progress bar
- Input: Bytes read from storage in this stage
- Output: Bytes written in storage in this stage
- Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
- Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
Stages Tab
The Stages tab displays a summary page that shows the current state of all stages of all jobs in the Spark application.At the beginning of the page is the summary with the count of all stages by status (active, pending, completed, sikipped, and failed)
Stage detail
The stage detail page begins with information like total time across all tasks, Locality level summary, Shuffle Read Size / Records and Associated Job IDs.
- Tasks deserialization time
- Duration of tasks.
- GC time is the total JVM garbage collection time.
- Result serialization time is the time spent serializing the task result on a executor before sending it back to the driver.
- Getting result time is the time that the driver spends fetching task results from workers.
- Scheduler delay is the time the task waits to be scheduled for execution.
- Peak execution memory is the maximum memory used by the internal data structures created during shuffles, aggregations and joins.
- Shuffle Read Size / Records. Total shuffle bytes read, includes both data read locally and data read from remote executors.
- Shuffle Read Blocked Time is the time that tasks spent blocked waiting for shuffle data to be read from remote machines.
- Shuffle Remote Reads is the total shuffle bytes read from remote executors.
- Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
- Shuffle spill (disk) is the size of the serialized form of the data on disk.