

What is Hadoop?
Hadoop is an open source project that offers a platform to work with Big Data and helps to overcome volume and variety challenges (for more information about Big Data challenges see article What is Big Data?). From volume perspective it allows to process, store and analyze massive amounts of data. From variety perspective it allows to work with mixture of structured and unstructured data, which doesn’t lend itself well to being placed in traditional tables. Hadoop works well in situations that require the support of analytics that are deep and computationally extensive, situations when we need to store and access large amounts of data in an efficient fashion.
IBM’s book Harness the Power of Big Data contains very apt definition: “At a very high level, Hadoop is a distributed file system and data processing engine that is designed to handle extremely high vilumes of data in any structure.” [13]
Hadoop was inspired by Google’s work on its distributed file system (called Google File System) and MapReduce project. Google was developing these technologies in order to bring the power of parallel, distributed computing to its daily web indexing operations. In 2006 Yahoos!’s employee Doug Cutting took this two software components and created a research project called Hadoop (the name comes from toy elephant belonging to Don’s son). After a while, Hadoop was turned over to the Apache software foundation and is now maintained as an opensource project with a global community of contributors (one of significant contributors is IBM). [13] [23]
Hadoop is written in Java and its original deployments include some of the most well-known and most technically advanced organizations, such as Facebook, Twitter, Facebook, Linkedin, Netflix, Yahoo!, and many others. [24]
From technical perspective, Hadoop generally consists of two parts:
- Hadoop Distributed File System (HDFS)
- Programming paradigm (MapReduce)
HDFS is a file system that allows handle large amount of data distributed over multiple nodes in parallel. Instead of seeing individual computers, you see an extremely large volume where you can store your data. HDFS is a fault tolerant system (due to storing data in multiple nodes) that allows you to use inexpensive commodity hardware – nodes don’t share memory or disks, you can simply buy a bunch of commodity servers, put them in a rack, and run the Hadoop software on each one. In contrast to traditional relational database world, with HDFS we don’t have to worry about data model. [13]
MapReduce is a programming paradigm built on the proven concept of divide-and-conquer. It breaks down complex tasks into small units of work and process them in parallel. Rather than bringing all the data to application (bringing data to function), it’s the application that gets moved to all of the nodes in which the data is stored (bringing function to data). Processing then occurs at each node simultaneously and offers are then delivered back as a unified whole. [24] [23]
Hadoop Environment
A typical Hadoop environment usually consists of these types of nodes: [24]
- Master node – several instances (at least 2 to mitigate failure)
Here are major elements in the master node:- NameNode – provides the client with information on where in the cluster particular data is stored.
- JobTracker – is a process that is assigned to interact with client applications. It is also responsible for distributing MapReduce tasks to particular nodes within a cluster.
- Worker node – dozens or hundreds
- DataNode – stores data in the HDFS, and is responsible for replication data across clusters. DataNodes interact with client applications when the NameNode has supplied the DataNode’s address.
- TaskTracker – is a process in the cluster that is capable of receiving tasks (including Map, Reduce, and Shuffle) from a JobTracker.
How Hadoop Works
A user accesses data such as audio, video, social media content, documents, web logs, and similar and send it to HDFS. HDFS breaks data into parts and load them into multiple nodes running on commodity hardware. Each part is replicated multiple times so that if a node fails, another node has a copy. In the time data is loaded into the cluster, MapReduce framework is used for analysis. The user submits a Map job processing specific set of data (usually written in Java) to JobTracker. JobTracker recognize where requested data is stored and send the job to appropriate nodes. When the each node has finished processing of the Map job, it stores results. Then the user starts Reduce job through the JobTracker in which results of the Map job stored locally on individual nodes are aggregated. When Reduce job is done, the user can access results that can be loaded into one of number analytic environments for analysis. In this moment MapReduce job has been completed and data is ready for further analysis by Data Scientists. [23]
Introduction to Hadoop on Bigdatauniversity.com
Video with transcription is available here.
Video with transcription is available here.
Hadoop Components
Hadoop itself (meant Hadoop platform consistiong of HDFS and MapReduce) is just a small part of Hadoop world and not sufficient for effective data analysis. There are many other Hadoop components or Hadoop-related projects that are significantly important. The list enriched by short descriptions from Wikibon [23] follows:
- Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.
- MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.
- Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.
- Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)
- HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.
- Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.
- Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.
- Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.
- Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It’s development is being led by engineers from Hortonworoks, which include Ambari in its Hortonworks Data Platform.
- Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.
- Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.
- Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.
- HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.
- BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop’s sub-projects and related components with the goal improving the Hadoop platform as a whole.
Conclusion
Hadoop allows enterprises to process and analyze large amounts of unstructured and semistructured data in cost and time effective manner. Behind this stands a concepts that makes Hadoop so good at dealing with large amounts of data: “Hadoop spreads out the data and can handle complex computational questions by harnessing all of the available cluster processors to work in parallel.” [15] Rather than bringing data to central location and doing computations there, processing occurs at each node simultaneously.
Although Hadoop is very powerful and flexible platform, meaningful analysis of data requires highly specialized programming skills. Moreover, Hadoop is good for data at rest but doesn’t support real-time processing (data in motion).
Bibliography
[13] | P. C. Zikopoulos, D. deRoos, K. Parasuraman, T. Deutsch, D. Corrigan and J. Giles, Harness the Power of Big Data: The IBM Big Data Platform, McGraw-Hill, 2013. |
[15] | F. J. Ohlhorst, Big Data Analytics: Turning Big Data into Big Money, New Jersey: John Wiley & Sons, Inc., 2013. |
[23] | J. Kelly, “Big Data: Hadoop, Business Analytics and Beyond,” 24 December 2012. [Online]. Available: http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond. [Accessed 1 February 2013]. |
[24] | R. D. Schneider, Hadoop for Dummies, Special Edition, Mississauga: John Wiley & Sons, Inc., 2012. |