Explain HDFS Read and Write Operations – Hadoop Tutorial
HDFS write operation pipeline showing client writing data to DataNodes with replication
The Hadoop Distributed File System (HDFS) is the backbone of Hadoop, enabling the storage and processing of massive datasets across distributed clusters. To fully understand how Hadoop works, it’s important to learn how HDFS read and write operations function. These operations define how data is stored, replicated, and retrieved in a reliable and fault-tolerant way.
In this Hadoop tutorial, we will explain step-by-step how HDFS read and write operations are performed, along with their architecture and workflow.
The write operation in HDFS is designed to ensure reliability and fault tolerance. Instead of writing an entire file to a single machine, HDFS breaks it down into blocks (default size: 128 MB) and distributes them across multiple DataNodes with replication (default: 3 copies).
Client Request:
The client contacts the NameNode to create a new file in HDFS.
The NameNode checks if the file already exists and if the client has necessary permissions.
Block Allocation:
The NameNode allocates DataNodes to store the blocks of the file.
The client receives the list of DataNodes for the first block.
Data Writing:
The client writes the first block of data to the first DataNode.
That DataNode immediately pipelines the block to the second DataNode, which then forwards it to the third DataNode (ensuring replication).
Acknowledgement:
Once all replicas are written successfully, acknowledgments flow back to the client.
The process repeats for subsequent blocks until the file is fully written.
The read operation in HDFS is optimized for high throughput and parallel processing. When a client requests a file, HDFS retrieves its data blocks from the DataNodes based on the metadata stored in the NameNode.
Client Request:
The client sends a request to the NameNode to read a file.
Block Location Retrieval:
The NameNode looks up the metadata and returns the list of DataNodes storing the file’s blocks.
The client is directed to the nearest DataNode (to optimize performance).
Data Retrieval:
The client directly contacts the DataNodes to fetch the required data blocks.
If a DataNode is unavailable, the client retrieves data from another replica.
Data Assembly:
The client assembles the blocks into the complete file.
Feature | Write Operation | Read Operation |
---|---|---|
Interaction | Client → NameNode → DataNodes | Client → NameNode → DataNodes |
Data Flow | Client → DataNode1 → DataNode2 → DataNode3 | DataNodes → Client |
Replication | Ensures 3 replicas of each block | Reads from the nearest available replica |
Acknowledgement | From DataNodes back to client | Direct file blocks sent to client |
Fault Tolerance: Replication ensures data availability even if a DataNode fails.
Scalability: Can handle petabytes of data across thousands of nodes.
High Throughput: Optimized for batch processing and large dataset handling.
Data Locality: Clients fetch data from the nearest DataNode to reduce latency.
The HDFS read and write operations are fundamental to Hadoop’s ability to store and process big data efficiently. By splitting files into blocks, replicating them across DataNodes, and using NameNode for metadata management, HDFS ensures fault tolerance, scalability, and high throughput. Understanding these operations is essential for Hadoop developers and big data engineers.