Saturday, 1 July 2017

Hadoop HDFS Interview Questions

1. What is HDFS?
Hadoop Distributed file system or HDFS is a Java based distributed file system that allows us to store Big data across multiple nodes in a Hadoop cluster. So, if I install Hadoop, I will get HDFS as an underlying storage system for storing the huge data sets in the distributed environment.
2. What are the key features of HDFS?
NOTE: You should also explain the features briefly while listing different HDFS features. 
Some of the prominent features of HDFS are as follows:
  • Cost effective and Scalable: HDFS, in general, is deployed on a commodity hardware. So, it is very economical in terms of the cost of ownership of the project. Also, one can scale the cluster by adding more nodes.
  • Variety and Volume of Data: HDFS is all about storing huge data i.e. Terabytes & Petabytes of data and different kinds of data. So, I can store any type of data into HDFS, be it structured, unstructured or semi structured.
  • Reliability and Fault Tolerance: HDFS divides the given data into data blocks, replicates it and stores it in a distributed fashion across the Hadoop cluster. This makes HDFS very reliable and fault tolerant. 
  • High Throughput: Throughput is the amount of work done in a unit time. HDFS provides high throughput access to application data.
3. Explain the HDFS Architecture and list the various HDFS daemons in HDFS cluster?
While listing various HDFS daemons, you should also talk about their roles in brief. Here is how you should answer this question:
Apache Hadoop HDFS Architecture follows a Master/Slave topology where a cluster comprises a single NameNode (Master node or daemon) and all the other nodes are DataNodes (Slave nodes or daemons).  Following daemon runs in HDFS cluster:
  • NameNode: It is the master daemon that maintains and manages the data block present in the DataNodes. 
  • DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is responsible of storing the data as blocks.
  • Secondary NameNode: The Secondary NameNode works concurrently with the primary NameNode as helper daemon. It performs checkpointing. 
4. What is checkpointing in Hadoop?
Checkpointing is the process of combining the Edit Logs with the FsImage (File system Image). It is performed by the Secondary NameNode.
5. What is a NameNode in Hadoop?
The NameNode is the master node that manages all the DataNodes (slave nodes). It records the metadata information regarding all the files stored in the cluster (on the DataNodes), e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc.
6. What is a DataNode?
DataNode are the slave nodes in HDFS. It is a commodity hardware that provides storage for the data. It serves the read and write request of the HDFS client. 
7. Is Namenode machine same as DataNode machine as in terms of hardware?
Unlike the DataNodes, a NameNode is a highly available server that manages the File System Namespace and maintains themetadata information. Therefore, NameNode requires higher RAM for storing the metadata information corresponding to the millions of HDFS files in the memory, whereas the DataNode needs to have a higher disk capacity for storing huge data sets. 
8. What is the difference between NAS (Network Attached Storage) and HDFS?
Here are the key differences between NAS and HDFS:
  • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides a service for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed file system to store data using commodity hardware.
  • In HDFS, data blocks are distributed across all the machines in a cluster. Whereas in NAS, data is stored on a dedicated hardware.
  • HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS uses commodity hardware which is cost effective, whereas a NAS is a high-end storage devices which includes high cost.
9. What is the difference between traditional RDBMS and Hadoop?
This question seems to be very easy, but in an interview these simple questions matter a lot. So, here is how you can answer the very question:
RDBMS
Hadoop
Data Types
RDBMS relies on the structured data and the schema of the data is always known.
Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
Processing
RDBMS provides limited or no processing capabilities.
Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schema on Read Vs. Write
RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.
On the contrary, Hadoop follows the schema on read policy.
Read/Write Speed
In RDBMS, reads are fast because the schema of the data is already known.
The writes are fast in HDFS because no schema validation happens during HDFS write.
Cost
Licensed software, therefore, I have to pay for the software.
Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use Case
RDBMS is used for OLTP (Online Trasanctional Processing) system.
Hadoop is used for Data discovery, data analytics or OLAP system.

10. What is throughput? How does HDFS provides good throughput?
Throughput is the amount of work done in a unit time. HDFS provides good throughput because:
  • The HDFS is based on Write Once and Read Many Model, it simplifies the data coherency issues as the data written once can’t be modified and therefore, provides high throughput data access.
  • In Hadoop, the computation part is moved towards the data which reduces the network congestion and therefore, enhances the overall system throughput.
11. What is Secondary NameNode? Is it a substitute or back up node for the NameNode?
Here, you should also mention the function of the Secondary NameNode while answering the later part of this question so as to provide clarity:
A Secondary NameNode is a helper daemon that performs checkpointing in HDFS. No, it is not a backup or a substitute node for the NameNode. It periodically, takes the edit logs (meta data file) from NameNode and merges it with the FsImage (File system Image) to produce an updated FsImage as well as to prevent the Edit Logs from becoming too large.
12. What do you mean by meta data in HDFS? List the files associated with metadata.
The metadata in HDFS represents the structure of HDFS directories and files. It also includes the various information regarding HDFS directories and files such as ownership, permissions, quotas, and replication factor.
NOTE: While listing the files associated with metadata, give a one line definition of each metadata file.
There are two files associated with metadata present in the NameNode:
  • FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
  • EditLogs: It contains all the recent modifications made to the file system with respect to the recent FsImage.
13. What is the problem in having lots of small files in HDFS?
As we know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to the generation of too much meta data and storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.  
14. What is a heartbeat in HDFS?
Heartbeats in HDFS are the signals that are sent by DataNodes to the NameNode to indicate that it is functioning properly (alive). By default, the heartbeat interval is 3 seconds, which can be configured using dfs.heartbeat.interval in hdfs-site.xml.
15. How would you check whether your NameNode is working or not?
There are many ways to check the status of the NameNode. Most commonly, one uses the jps command to check the status of all the daemons running in the HDFS. Alternatively, one can visit the NameNode’s Web UI for the same. 
16. What is a block?
You should begin the answer with a general definition of a block. Then, you should explain in brief about the blocks present in HDFS and also mention their default size. 
Blocks are the smallest continuous location on your hard drive where data is stored. HDFS stores each file as blocks, and distribute it across the Hadoop cluster. The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.
17. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?
Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into 5 blocks ( 514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will be of 2 MB only. Since, we are using the default replication factor i.e. 3, each block will be replicated thrice. Therefore, we will have 15 blocks in total where 12 blocks will be of size 128 MB each and 3 blocks of size 2 MB each.
18. How to copy a file into HDFS with a different block size to that of existing block size configuration?
NOTE: You should start the answer with the command for changing the block size and then, you should explain the whole procedure with an example. This is how you should answer this question:
Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’ where the block_size is specified in Bytes.
Let me explain it with an example: Suppose, I want to copy a file called test.txt of size, say of 120 MB, into the HDFS and I want the block size for this file to be 32 MB (33554432 Bytes) instead of the default (128 MB). So, I would issue the following command:
hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
Now, I can check the HDFS block size associated with this file by:
hadoop fs -stat %o /sample_hdfs/test.txt
Else, I can also use the NameNode web UI for seeing the HDFS directory.
19. Can you change the block size of HDFS files?
Yes, I can change the block size of HDFS files by changing the default size parameter present in hdfs-site.xml. But, I will have to restart the cluster for this property change to take effect.
20. What is a block scanner in HDFS?
Block scanner runs periodically on every DataNode to verify whether the data blocks stored are correct or not. The following steps will occur when a corrupted data block is detected by the block scanner:
  • First, the DataNode will report about the corrupted block to the NameNode.
  • Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.
  • The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).
This whole process allows HDFS to maintain the integrity of the data when a client performs a read operation. One can check the block scanner report using the DataNode’s web interface- localhost:50075/blockScannerReport 
21. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the Fault Tolerance capability of the system?
NOTE: Basically, this question is regarding replication of blocks in Hadoop and how it helps in providing fault tolerance.  
HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster. By default, this replication factor is set to 3 which is configurable. So, if I store a file of 1 GB in HDFS where the replication factor is set to default i.e. 3, it will finally occupy a total space of 3 GB because of the replication. Now, even if a DataNode fails or a data block gets corrupted, I can retrieve the data from other replicas stored in different DataNodes.  
22. Replication causes data redundancy and consume a lot of space, then why is it pursued in HDFS?
Replication is pursued in HDFS to provide the fault tolerance. And, yes, it will lead to the consumption of a lot of space, but one can always add more nodes to the cluster if required. By the way, in practical clusters, it is very rare to have free space issues as the very first reason to deploy HDFS was to store huge data sets. Also, one can change the replication factor to save HDFS space or use different codec provided by Hadoop to compress the data.  
23. Can we have different replication factor of the existing files in HDFS?
NOTE: You should always answer such type of questions by taking an example to provide clarity. 
Yes, one can have different replication factor for the files existing in HDFS. Suppose, I have a file named test.xml stored within the sample directory in my HDFS with the replication factor set to 1. Now, the command for changing the replication factor of text.xml file to 3 is:
hadoop fs -setrwp -w 3 /sample/test.xml
Finally, I can check whether the replication factor has been changed or not by using following command:
hadoop fs -ls /sample
or 
hadoop fsck /sample/test.xml -files
24. What is a rack awareness algorithm and why is it used in Hadoop?
Rack Awareness algorithm in Hadoop ensures that all the block replicas are not stored on the same rack or a single rack. Considering the replication factor is 3, the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a different (remote) rack but, on a different DataNode within that (remote) rack. There are two reasons for using Rack Awareness:
  • To improve the network performance: In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps to reduce write traffic in between different racks and thus provides a better write performance. 
  • To prevent loss of data: I don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if one thinks about it, it will make sense, as it is said that never put all your eggs in the same basket.
25. How data or a file is written into HDFS?
The best way to answer this question is to take an example of a client and list the steps that will happen while performing the write without going into much of the details:
Suppose a client wants to write a file into HDFS. So, the following steps will be performed internally during the whole HDFS write process:
  • The client will divide the files into blocks and will send a write request to the NameNode.
  • For each block, the NameNode will provide the client a list containing the IP address of DataNodes (depending on replication factor, 3 by default) where the data block has to be copied eventually.
  • The client will copy the first block into the first DataNode and then the other copies of the block will be replicated by the DataNodes themselves in a sequential manner.
26. Can you modify the file present in HDFS?
No, I cannot modify the files already present in HDFS, as HDFS follows Write Once Read Many model. But, I can always append data into the existing HDFS file.
27. Can multiple clients write into an HDFS file concurrently?
No, multiple clients can’t write into an HDFS file concurrently. HDFS follows single writer multiple reader model. The client which opens a file for writing is granted a lease by the NameNode. Now suppose, in the meanwhile, some other client wants to write into that very file and asks NameNode for the write permission. At first, the NameNode will check whether the lease for writing into that very particular file has been granted to someone else or not. Then, it will reject the write request of the other client if the lease has been acquired by someone else, who is currently writing into the very file.
29. Does HDFS allow a client to read a file which is already opened for writing?
Basically, the intent of asking this question is to know about the constraints associated with reading a file which is currently being written by some client. You may answer this question in following manner:
Yes, one can read the file which is already opened. But, the problem in reading a file which is currently being written lies in the consistency of the data i.e. HDFS does not provide the surety that the data which has been written into the file will be visible to a new reader before the file has been closed. For this, one can call the hflush operation explicitly which will push all the data in the buffer into the write pipeline and then the hflush operation will wait for the acknowledgements from the DataNodes. Hence, by doing this the data that has been written into the file before the hflush operation will be visible to the readers for sure.   
30. Define Data Integrity? How does HDFS ensure data integrity of data blocks stored in HDFS?
Data Integrity talks about the correctness of the data. It is very important for us to have a guarantee or assurance that the data stored in HDFS is correct. However, there is always a slight chance that the data will get corrupted during I/O operations on the disk. HDFS creates the checksum for all the data written to it and verifies the data with the checksum during read operation by default. Also, each DataNode runs a block scanner periodically, which verifies the correctness of the data blocks stored in the HDFS.
31. What do you mean by the High Availability of a NameNode? How is it achieved?
NameNode used to be single point of failure in Hadoop 1.x where the whole Hadoop cluster becomes unavailable as soon as NameNode is down. In other words, High Availability of the NameNode talks about the very necessity of a NameNode to be active for serving the requests of Hadoop clients.
To solve this Single Point of Failure problem of NameNode, HA feature was intorduced in Hadoop 2.x where we have two NameNode in our HDFS cluster in an active/passive configuration. Hence, if the active NameNode  fails, the other passive NameNode can take over the responsibility of the failed NameNode and keep the HDFS up and running.  
32. Define Hadoop Archives? What is the command for archiving a group of files in HDFS.
Hadoop Archive was introduced to cope up with the problem of increasing memory usage of the NameNode for storing the metadata information because of too many small files. Basically, it allows us to pack a number of small HDFS files into a single archive file and therefore, reducing the metadata information. The final archived file follows the .har extension and one can consider it as a layered file system on top of HDFS. 
The command for archiving a group of files:
hadoop archive –archiveName Example_archive.har /input/location /output/location
33. How will you perform the inter cluster data copying work in HDFS?
One can perform the inter cluster data copy by using distributed copy command given as follows:
hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>


Hadoop HDFS Interview Questions

1. What is HDFS? Hadoop Distributed file system or HDFS is a Java based distributed file system that allows us to store   Big data acros...