Q1.What are HDFS and YARN?Ans: HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology. YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Q2. How does NameNode tackle DataNode failures?Ans: NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly. A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
Q3. Is there any way to change the replication of files on HDFS after they are already written to HDFS?Ans: We can change the dfs.replication value to a particular number in the$HADOOP_HOME/conf/hadoop-site.xml file,which will start replicating to the factor of that number for any new content that comes in.
Q6. Which of the following has replaced JobTracker from MapReduce v1?Ans: ResourceManager
Q7. Write the YARN commands to check the status of an application and kill an application.Ans: yarn application -status ApplicationIDyarn application -kill ApplicationID
Q.8 What are the two types of metadata that a NameNode server holds?Ans: The two types of metadata that a NameNode server holds are:Metadata in Disk - This contains the edit log and the FSImageMetadata in RAM - This contains the information about DataNode
Q9. How is formatting done in HDFS?Ans: Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command.This command formats the HDFS via NameNode. This command is only used for the first time.Formatting the file system means starting the working of the directory specified by the dfs.name.directory variable.If you execute this command on existing filesystem, you will delete all your data stored on your NameNode.Formatting a Namenode will not format the DataNodeFully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster.Different nodes are allotted as Master and Slaves.Q10. Write the three modes in which Hadoop can run .
Q11. Explain rack awareness in Hadoop.Ans: HDFS replicates blocks onto multiple machines. In order to have higher fault tolerance against rack failures (network or physical), HDFS is able to distribute replicas across multiple racks.Hadoop obtains network topology information by either invoking a user-defined script or by loading a Java class which should be an implementation of the DNSToSwitchMapping interface. It’s the administrator’s responsibility to choose the method, to set the right configuration, and to provide the implementation of said method.
Q12. Explain rack awareness in Hadoop.Ans: Even though data is distributed amongst multiple DataNodes, NameNode is the central authority for file metadata and replication (and as a result, a single point of failure). The configuration parameter dfs.NameNode.replication.min defines the number of replicas a block should replicate to in order for the write to return as successful.
Q13.Explain How Input And Output Data Format Of The Hadoop Framework?.Ans: The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.See the flow mentioned below(input) -> map -> -> combine/sorting -> -> reduce -> (output)
Q14. Which Interface Needs To Be Implemented To Create Mapper And Reducer For The Hadoop?.Ans: org.apache.hadoop.mapreduce.Mapperorg.apache.hadoop.mapreduce.Reducer
Q15.Explain The Shuffle?.Ans: Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Q16. How Many Instances Of Jobtracker Can Run On A Hadoop Cluster?.Ans: Only One
Q17.What Is Fault Tolerance?.Ans: Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.
Q19. What Is A Job Tracker?.Ans: Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Q20. What Is A Task Tracker?.Ans: Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
Q22. Explain about the indexing process in HDFS.Ans: Indexing process in HDFS depends on the block size. HDFS stores the last part of the datathat further points to the address where the next part of data chunk is stored.
Q24. What happens to a NameNode that has no data?Ans: There does not exist any NameNode without data. If it is a NameNode then it should havesome sort of data in it.
Q25. What is Hadoop streaming?Ans: Hadoop distribution has a generic application programming interface for writing Map andReduce jobs in any desired programming language like Python, Perl, Ruby, etc. This isreferred to as Hadoop Streaming. Users can create and run jobs with any kind of shellscripts or executable as the Mapper or Reducers.
Q26 What is a block and block scanner in HDFS?Ans: Block-The minimum amount of data that can be read or written is generally referred to asa “block” in HDFS. The default size of a block in HDFS is 64MB.Block Scanner-Block Scanner tracks the list of blocks present on a DataNode and verifiesthem to find any kind of checksum errors. Block Scanners use a throttling mechanism toreserve disk bandwidth on the datanode.
Q27. Explain what is heartbeat in HDFS?Ans: Heartbeat is referred to a signal used between a data node and Name node, and betweentask tracker and job tracker, if the Name node or job tracker does not respond to thesignal, then it is considered there is some issues with data node or task tracker.
Q28.What happens when a datanode fails ?Ans: When a datanode failsJobtracker and namenode detect the failureOn the failed node all tasks are re‐scheduledNamenode replicates the users data to another node.
Q29.Explain what happens in textinformat ?Ans: In textinputformat, each line in the text file is a record. Value is the content of the linewhile Key is the byte offset of the line. For instance, Key: longWritable, Value: text
Q30. Explain what is sqoop in Hadoop ?Ans: To transfer the data between Relational database management (RDBMS) and HadoopHDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS likeMySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS.