Hadoop Distributed File System (HDFS)

Hadoop consists of two major components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is a distributed, scalable, Java-based file system that allows to store large volumes of unstructured data. MapReduce, which is covered in the next chapter, is a framework for performing calculations on the data stored in HDFS.

Since Hadoop is a system running inside an operating system, HDFS runs on top of the existing file system (for example ext3). Hadoop is designed for sequential data access rather than random data access. It means that it’s suitable to handle large files since seeks are expensive operations (the larger the file the less Hadoop spends seeking for the next data location).

Figure 4 HDFS Blocks

HDFS stores files in large blocks (usually 128MB), as illustrated in the Figure 4. This brings several advantages: [25]

Easy replication which allows HDFS to be fault tolerant. Each block can be replicated to more nodes so when it comes to a node failure, data can be replicated from another node. Replication is usually set to 3 nodes.
A file can be larger than any single disk in the network.
If a file or a part of file is smaller than the block size, only needed space is used.

HDFS blocks are actually stored in multiple operating system blocks.

How to work with HDFS? [26]

InfoSphere BigInsights

It’s important to mention that I’m working with IBM InfoSphere BigInsights – that is an IBM’s enterprise edition of Hadoop. Take into account that some things (especially directory paths) can be slightly different from working with the original Apache’s Hadoop installation. I also want to mention that basic edition of BigInsights is available for free. More information together with installation instructions can be found in this article: http://www.havlena.net/en/business-analytics-intelligence/installation-of-infosphere-biginsights-on-virtual-machine/.

HDFS can be manipulated through a Java API or through a command line interface. In this chapter I will cover basics commands for manipulating with HDFS through a command line interface.

Before you start, make sure that Hadoop is running. You can run it by this command:

> /opt/ibm/biginsights/bin/start.sh hadoop

All commands for manipulating HDFS through Hadoop’s command line interface begin with “hadoop fs” followed by the command name as an argument. This is called File System Shell (fs) and is invoked as follows:

> hadoop fs

For example the ls command for listing the current directory in HDFS looks like this:

> hadoop fs -ls .

If you want to specify a path, fs shell command takes path URIs as arguments. URI format is then:

scheme://authority/path

Scheme for local file system is file and for hdfs file system is hdfs. A path to file stored in a root directory of local file system could look like:

file:///root/file.txt

And a path to file stored in hdfs file system could be:

hdfs://localhost:9000/user/root/file.txt

Then the command that copies a file from the local file system into hdfs file system can look like this:

> hadoop fs -copyFromLocal file:///root/file.txt hdfs://localhost:9000/user/root/file.txt

Since scheme and authority are optional (defaults are taken from configuration file core-site.xml from the conf directory of Hadoop installation – in my case it’s /opt/ibm/biginsights/hadoop-conf/core-site.xml), previous command can be shortened to:

> hadoop fs –copyFromLocal /root/file.txt /user/root/file.txt

Basic File System commands [26]

HDFS is not a fully POSIX-compliant file system, but it supports many of the commands: cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail, and more.

There are also a few commands that are specific to HDFS such as:

copyFromLocal or put – both do the same, copy files from the local file system to a location on another filesystem (usually to hdfs file system)
> hadoop fs –put file:///root/file.txt hdfs://localhost:9000/user/root/file.txt
copyToLocal or get – opposite to put
> hadoop fs –get hdfs://localhost:9000/user/root/file.txt file:///root/file.txt
getmerge – is an enhanced form of get that can merge the files from multiple locations in to a single local file. Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
> hadoop fs -getmerge hdfs://localhost:9000/user/root file:///root/merged-files.txt

Other commands with more detailed explanation can be found in Hadoop’s File System Shell Guide page: http://hadoop.apache.org/docs/stable/file_system_shell.html. To view Hadoop’s File System man page inside the command line, type:

> hadoop fs –help

And for a specific command:

> hadoop fs -help cat

Bibliography

[25]	K. McDonald, “Hadoop Architecture,” BigData University, 2011. [Online]. Available: http://www.bigdatauniversity.com/web/media/player.php?file=BD001EN/Videos/L02V01_hadooparchitecturepart1.mp4. [Accessed 20 March 2013].
[26]	The Apache Software Foundation, “Hadoop’s File System Shell Guide,” [Online]. Available: http://hadoop.apache.org/docs/stable/file_system_shell.html. [Accessed 20 March 2013].

Hadoop Distributed File System (HDFS)

How to work with HDFS? [26]

InfoSphere BigInsights

Basic File System commands [26]

Bibliography

Leave a Reply Cancel reply