Apache Hadoop Architecture

Updated:10/20/2022 by Computer Hope

This HDFS tutorial designed to be an all in one package to answer all your questions about hadoop architecture.

How Many component present in Architecture of Hadoop 1.0

  • Name Node
  • Secondary Name Node
  • Data Node
  • Job tracker
  • Task tracker

Name Node

HDFS consists of only one Name Node that is called the Master Node. The master node can track files, manage the file system and has the metadata of all of the stored data within it. In particular, the name node contains the details of the number of blocks, locations of the data node that the data is stored in, where the replications are stored, and other details. The name node has direct contact with the client.

Secondary Name Node

This is only to take care of the checkpoints of the file system metadata which is in the Name Node. This is also known as the checkpoint Node. It is the helper Node for the Name Node. The secondary name node instructs the name node to create & send fsimage & editlog file, upon which the compacted fsimage file is created by the secondary name node

Data Node

A Data Node stores data in it as blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write. These are slave daemons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.

Job tracker

Job Tracker receives the requests for Map Reduce execution from the client. Job tracker talks to the Name Node to know about the location of the data that will be used in processing. The Name Node responds with the metadata of the required processing data.

Task tracker

It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. It also receives code from the Job Tracker. Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper


Architecture of Hadoop 1.0

How Many component present in Architecture of Hadoop 2.0

  • Name Node
  • Secondary Name Node
  • Secondary Name Node
  • Data Node
  • Resource Manager
  • Node Manager

  In Apache Hadoop 2.0 , there is again HDFS which is again used for storage and on the top of HDFS, there is YARN which works as Resource Management. It basically allocates the resources and keeps all the things going on

Architecture of Hadoop 2.0

Different between of Hadoop 1.0 and Hadoop 2.0

  • Hadoop 2.x has Multi-tenancy Support, but Hadoop 1.x doesn’t.
  • Hadoop 1.x supports maximum 4,000 nodes per cluster where Hadoop 2.x supports more than 10,000 nodes per cluster
  • Hadoop 1.x supports only one namespace for managing HDFS filesystem whereas Hadoop 2.x supports multiple namespaces
  • Hadoop 1.x supports one and only one programming model: MapReduce. Hadoop 2.x supports multiple programming models with YARN Component like MapReduce, Interative, Streaming, Graph, Spark, Storm etc.
  • Hadoop 1.x does not support Microsoft Windows while Hadoop 2.x Added support for Microsoft windows.

Conclusion :

Hadoop is an open source software developed by the Apache Software Foundation (ASF). You can download Hadoop directly from the project website at http://hadoop.apache.org. Cloudera is a company that provides support, consulting, and management tools for Hadoop. Cloudera also has a distribution of software called Cloudera’s Distribution Including Apache Hadoop (CDH).
Here in this article, we explain high-level summary of the architecture of the Hadoop ecosystem. In order to facilitate the processing, analysis, and storage of big datasets in a distributed setting, each component has a distinct function. Different components can be added, depending on the use case, to create a Hadoop cluster that meets certain needs.