Explain HDFS Read and Write Operations – Hadoop Tutorial

8/22/2025

HDFS write operation pipeline showing client writing data to DataNodes with replication

Explain HDFS Read and Write Operations – Hadoop Tutorial

The Hadoop Distributed File System (HDFS) is the backbone of Hadoop, enabling the storage and processing of massive datasets across distributed clusters. To fully understand how Hadoop works, it’s important to learn how HDFS read and write operations function. These operations define how data is stored, replicated, and retrieved in a reliable and fault-tolerant way.

In this Hadoop tutorial, we will explain step-by-step how HDFS read and write operations are performed, along with their architecture and workflow.

HDFS Write Operation

The write operation in HDFS is designed to ensure reliability and fault tolerance. Instead of writing an entire file to a single machine, HDFS breaks it down into blocks (default size: 128 MB) and distributes them across multiple DataNodes with replication (default: 3 copies).

Steps of Write Operation:

Client Request:
- The client contacts the NameNode to create a new file in HDFS.
- The NameNode checks if the file already exists and if the client has necessary permissions.
Block Allocation:
- The NameNode allocates DataNodes to store the blocks of the file.
- The client receives the list of DataNodes for the first block.
Data Writing:
- The client writes the first block of data to the first DataNode.
- That DataNode immediately pipelines the block to the second DataNode, which then forwards it to the third DataNode (ensuring replication).
Acknowledgement:
- Once all replicas are written successfully, acknowledgments flow back to the client.
- The process repeats for subsequent blocks until the file is fully written.

HDFS Read Operation

The read operation in HDFS is optimized for high throughput and parallel processing. When a client requests a file, HDFS retrieves its data blocks from the DataNodes based on the metadata stored in the NameNode.

Steps of Read Operation:

Client Request:
- The client sends a request to the NameNode to read a file.
Block Location Retrieval:
- The NameNode looks up the metadata and returns the list of DataNodes storing the file’s blocks.
- The client is directed to the nearest DataNode (to optimize performance).
Data Retrieval:
- The client directly contacts the DataNodes to fetch the required data blocks.
- If a DataNode is unavailable, the client retrieves data from another replica.
Data Assembly:
- The client assembles the blocks into the complete file.

HDFS Write vs Read Operations

Feature	Write Operation	Read Operation
Interaction	Client → NameNode → DataNodes	Client → NameNode → DataNodes
Data Flow	Client → DataNode1 → DataNode2 → DataNode3	DataNodes → Client
Replication	Ensures 3 replicas of each block	Reads from the nearest available replica
Acknowledgement	From DataNodes back to client	Direct file blocks sent to client

Advantages of HDFS Read/Write Design

Fault Tolerance: Replication ensures data availability even if a DataNode fails.
Scalability: Can handle petabytes of data across thousands of nodes.
High Throughput: Optimized for batch processing and large dataset handling.
Data Locality: Clients fetch data from the nearest DataNode to reduce latency.

Conclusion

The HDFS read and write operations are fundamental to Hadoop’s ability to store and process big data efficiently. By splitting files into blocks, replicating them across DataNodes, and using NameNode for metadata management, HDFS ensures fault tolerance, scalability, and high throughput. Understanding these operations is essential for Hadoop developers and big data engineers.

Explain HDFS Read and Write Operations – Hadoop Tutorial

HDFS Write Operation

Steps of Write Operation:

HDFS Read Operation

Steps of Read Operation:

HDFS Write vs Read Operations

Advantages of HDFS Read/Write Design

Conclusion

Table of content