Explain YARN Job Scheduling – Hadoop Tutorial
YARN Job Scheduling
In the Hadoop ecosystem, YARN (Yet Another Resource Negotiator) plays a crucial role in managing resources and scheduling jobs across the cluster. Efficient job scheduling is essential for ensuring fair resource allocation, maximizing cluster utilization, and maintaining performance for multiple users and applications.
This tutorial explains the YARN job scheduling mechanism, its components, and the different scheduling policies available in Hadoop.
Job scheduling in YARN refers to the process of allocating cluster resources (CPU, memory, containers) to different jobs and applications based on defined scheduling policies. Since multiple users and applications may run simultaneously, YARN ensures fair distribution of resources while meeting business requirements.
ResourceManager (RM):
Acts as the global resource manager.
Runs the Scheduler, which decides how resources are distributed.
ApplicationMaster (AM):
Requests resources from the ResourceManager.
Manages tasks for a single application.
NodeManager (NM):
Executes tasks in containers.
Reports resource usage back to the ResourceManager.
YARN supports multiple job scheduling strategies to cater to different use cases:
Jobs are executed in the order they are submitted (First-In-First-Out).
Simple but not ideal for multi-user environments.
Allows multiple organizations or teams to share a cluster.
Divides resources into hierarchical queues with guaranteed minimum capacity.
Ensures efficient resource utilization while supporting multiple tenants.
Distributes resources fairly among all running applications.
Ensures no single job monopolizes resources.
Supports preemption, meaning long-running jobs can release resources for smaller jobs.
A user submits a job to the ResourceManager.
The ResourceManager assigns a container for the ApplicationMaster.
The ApplicationMaster requests resources for tasks.
The Scheduler in ResourceManager applies the scheduling policy (FIFO, Capacity, or Fair) to allocate resources.
The tasks are launched on NodeManagers as containers.
Job progress is tracked, and results are sent back to the user.
User A submits a large MapReduce job.
User B submits a smaller Spark job.
With Fair Scheduler, resources are split fairly between both jobs.
With Capacity Scheduler, each user gets resources from their assigned queue.
Efficient resource utilization across the cluster.
Fair allocation for multi-user environments.
Scalability for large enterprise workloads.
Flexibility with multiple scheduling policies.
YARN job scheduling ensures that cluster resources are allocated efficiently and fairly among competing applications. With support for FIFO, Capacity Scheduler, and Fair Scheduler, YARN provides flexibility for different organizational needs. Understanding these scheduling mechanisms is critical for optimizing Hadoop performance in real-world scenarios.