

Hadoop consists of two major components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is a distributed, scalable, Java-based file system that allows to store large volumes of unstructured data. MapReduce, which is covered in the next chapter, is a framework for performing calculations on the data stored in HDFS.
Since Hadoop is a system running inside an operating system, HDFS runs on top of the existing file system (for example ext3). Hadoop is designed for sequential data access rather than random data access. It means that it’s suitable to handle large files since seeks are expensive operations (the larger the file the less Hadoop spends seeking for the next data location).

HDFS stores files in large blocks (usually 128MB), as illustrated in the Figure 4. This brings several advantages: [25]
- Easy replication which allows HDFS to be fault tolerant. Each block can be replicated to more nodes so when it comes to a node failure, data can be replicated from another node. Replication is usually set to 3 nodes.
- A file can be larger than any single disk in the network.
- If a file or a part of file is smaller than the block size, only needed space is used.
HDFS blocks are actually stored in multiple operating system blocks.
How to work with HDFS? [26]
InfoSphere BigInsights
HDFS can be manipulated through a Java API or through a command line interface. In this chapter I will cover basics commands for manipulating with HDFS through a command line interface.
Before you start, make sure that Hadoop is running. You can run it by this command:
All commands for manipulating HDFS through Hadoop’s command line interface begin with “hadoop fs” followed by the command name as an argument. This is called File System Shell (fs) and is invoked as follows:
For example the ls command for listing the current directory in HDFS looks like this:
If you want to specify a path, fs shell command takes path URIs as arguments. URI format is then:
Scheme for local file system is file
and for hdfs file system is hdfs
. A path to file stored in a root directory of local file system could look like:
And a path to file stored in hdfs file system could be:
Then the command that copies a file from the local file system into hdfs file system can look like this:
Since scheme and authority are optional (defaults are taken from configuration file core-site.xml from the conf directory of Hadoop installation – in my case it’s /opt/ibm/biginsights/hadoop-conf/core-site.xml), previous command can be shortened to:
Basic File System commands [26]
HDFS is not a fully POSIX-compliant file system, but it supports many of the commands: cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail, and more.
There are also a few commands that are specific to HDFS such as:
copyFromLocal
orput
– both do the same, copy files from the local file system to a location on another filesystem (usually to hdfs file system)> hadoop fs –put file:///root/file.txt hdfs://localhost:9000/user/root/file.txtcopyToLocal
orget
– opposite to put> hadoop fs –get hdfs://localhost:9000/user/root/file.txt file:///root/file.txtgetmerge
– is an enhanced form of get that can merge the files from multiple locations in to a single local file. Takes a source directory and a destination file as input and concatenates files in src into the destination local file.> hadoop fs -getmerge hdfs://localhost:9000/user/root file:///root/merged-files.txt
Other commands with more detailed explanation can be found in Hadoop’s File System Shell Guide page: http://hadoop.apache.org/docs/stable/file_system_shell.html. To view Hadoop’s File System man page inside the command line, type:
And for a specific command:
Bibliography
[25] | K. McDonald, “Hadoop Architecture,” BigData University, 2011. [Online]. Available: http://www.bigdatauniversity.com/web/media/player.php?file=BD001EN/Videos/L02V01_hadooparchitecturepart1.mp4. [Accessed 20 March 2013]. |
[26] | The Apache Software Foundation, “Hadoop’s File System Shell Guide,” [Online]. Available: http://hadoop.apache.org/docs/stable/file_system_shell.html. [Accessed 20 March 2013]. |