As the world becomes more digitally orientated the amount of data being generated has exploded. This has been referred to by many as the big data revolution. Examples of such data sets are transaction data, medical information and server logs. The challenge in refining these data sets into usable information is their inherent size. The solution to this is the use of distributed computation to process the information is parallel. One framework used to perform such computation is Apache Hadoop. Hadoop is an open source, distributed computation platform created by Doug Cutting. The inspiration for this project steamed form the Nutch search engine project.
The advantages of such a platform are numerous. One of the major advantages that Hadoop provides is its distributed file system HDFS (Hadoop Distributed File System). Once the data is in HDFS Hadoop handles many of the tasks related to distributed computation. For example task allocation, data locality and node failure. This makes processing your data even easier.
Map Reduce is a computation model built on top of Hadoop and is inspired by functional programming languages such as Lisp. Map Reduce makes it very easy to process large data sets through use of lists of key value pairs. Computation is split into Map and Reduce phases. The data located in HDFS is split into chunks (single lines by default) and passed to the mapping nodes. The mappers then extract the relevant data and output it in the form of key value pairs. Once the map phase is complete reduce tasks receive the values associated with a particular key, perform whatever post processing is necessary and output a series of key value pairs. A simple example of a Map Reduce job would be a job to enumerate the number of A’s in a body of text. The map phase would consist of enumerating all the A’s in a given line and output the total e.g. <’a', 14>. The reduce phase would then consist of iterating over a list of these key value pairs and outputting their sum. This is of course a very simple example. This model allows for far more complex operations.
One downside to this model is that a new job must be written every time we want to learn something new from our data set. The solution to this is another framework built on top of Hadoop called Hive. Hive is a data warehousing solution that allows queries to be made against data located in HDFS using a language called HQL(Hive Query Language). Hive Query Language is syntactically similar to SQL. These Queries are converted to Map Reduce jobs. This cuts down on developer time and allows for single run ad hoc queries.
One of the other impressive applications built on top of Hadoop is Hbase. Hbase is a no SQL database that stores data in HDFS. Due to its distributed nature it makes Hbase incredibly scalable. Instead of the traditional database model of scaling up (Get a bigger box), Hbase allows for scaling out i.e adding more commodity machines to your Hadoop cluster. This could mean savings of tens of thousands of dollars. Some companies such as StumbleUpon are using Hive in conjunction with Hbase in order to allow them with fast random access to their data along with the facility to make complex queries.Learn more here.





