Hadoop Tutorial For Beginners

Updated:01/20/2023 by Computer Hope

First developed by Doug Cutting and Mike Cafarella in 2005, Licence umder for Apache License 2.0
Apache Hadoop   is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. The Hadoop Distributed File System (HDFS) is Hadoop's storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines. HDFS in Hadoop Architecture divides large data into different blocks
Cloudera Hadoop   Cloudera's open source platform, is the most popular distribution of Hadoop and related projects in the world .

What is big data ?

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size. For this we can use   Apache Hadoop   and   cloudera hadoop  



Type of  big data(Apache Hadoop)   ?

  • Structured –that which can be stored in rows and columns like relational data sets
  • Unstructured – data that cannot be stored in rows and columns like video, images, etc.
  • Semi-structured – data in XML that can be read by machines and human
Unstructured

Structured –that which can be stored in rows and columns like relational data sets

Semi-structured

Unstructured – data that cannot be stored in rows and columns like video, images, etc.

Lights

Semi-structured – data in XML that can be read by machines and human

Advantages of hadoop

Apache Hadoop   is the most important framework for working with Big Data. The biggest strength of Hadoop is scalability. It can upgrade from working on a single node to thousands of nodes without any issue in a seamless manner.
  • Apache hadoop   stores data in a distributed fashion, which allows data to be processed distributedly on a cluster of nodes
  • In short, we can say that   Apache hadoop   is an open-source framework. Hadoop is best known for its fault tolerance and high availability feature
  • Apache hadoop   clusters are scalable.
  • The Apache hadoop   framework is easy to use.
  • In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active.

Mandatory Tools you will need to install

To get started with Hadoop, you need to download and install it on your machines. You can use Apache Hadoop, or choose from distributions like Cloudera, Hortonworks, or MapR. It's important to note that the actual steps for installation may vary based on your operating .

  • Download the Hadoop distribution from the official Apache Hadoop website.
  • Download and install the JDK from the Oracle website or use OpenJDK.

Conclusion:

To get started with Hadoop, It is a framework which is based on java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment. You can check next article , we discuss about component and architecture.