Hadoop, Map Reduce, Hive and Hbase

Posted: 13th May 2012 by Arran in Cool Stuff

Hadoop logoAs the world becomes more digitally orientated the amount of data being generated has exploded. This has been referred to by many as the big data revolution. Examples of such data sets are transaction data, medical information and server logs. The challenge in refining these data sets into usable information is their inherent size. The solution to this is the use of distributed computation to process the information is parallel. One framework used to perform such computation is Apache Hadoop. Hadoop is an open source, distributed computation platform created by Doug Cutting. The inspiration for this project steamed form the Nutch search engine project.

The advantages of such a platform are numerous. One of the major advantages that Hadoop provides is its distributed file system HDFS (Hadoop Distributed File System). Once the data is in HDFS Hadoop handles many of the tasks related to distributed computation. For example task allocation, data locality and node failure. This makes processing your data even easier.

Map Reduce is a computation model built on top of Hadoop and is inspired by functional programming languages such as Lisp. Map Reduce makes it very easy to process large data sets through use of lists of key value pairs. Computation is split into Map and Reduce phases. The data located in HDFS is split into chunks (single lines by default) and passed to the mapping nodes. The mappers then extract the relevant data and output it in the form of key value pairs. Once the map phase is complete reduce tasks receive the values associated with a particular key, perform whatever post processing is necessary and output a series of key value pairs. A simple example of a Map Reduce job would be a job to enumerate the number of A’s in a body of text. The map phase would consist of enumerating all the A’s in a given line and output the total e.g. <’a', 14>.  The reduce phase would then consist of iterating over a list of these key value pairs and outputting their sum. This is of course a very simple example. This model allows for far more complex operations.

One downside to this model is that a new job must be written every time we want to learn something new from our data set. The solution to this is another framework built on top of Hadoop called Hive. Hive is a data warehousing solution that allows queries to be made against data located in HDFS using a language called HQL(Hive Query Language). Hive Query Language is syntactically similar to SQL. These Queries are converted to Map Reduce jobs. This cuts down on developer time and allows for single run ad hoc queries.

One of the other impressive applications built on top of Hadoop is Hbase. Hbase is a no SQL database that stores data in HDFS. Due to its distributed nature it makes Hbase incredibly scalable. Instead of the traditional database model of scaling up (Get a bigger box), Hbase allows for scaling out i.e adding more commodity machines to your Hadoop cluster. This could mean savings of tens of thousands of dollars. Some companies such as StumbleUpon are using Hive in conjunction with Hbase in order to allow them with fast random access to their data along with the facility to make complex queries.Learn more here.

The Inernet in College is so Fast

Posted: 11th January 2012 by Arran in Uncategorized

Google Goggles Sudoku Solver

Posted: 1st November 2011 by Arran in Cool Stuff
Tags: , , ,

Google goggles is a neat little app for android that allows you to search Google using images.  This works great for barcodes and product logos but its new feature is really mind blowing. If you take a picture of a Sudoku puzzle using Google Goggles it will recognize the puzzle and attempt to solve it. This dose require an Internet connection but never the less is really cool! Here is a demo from one of the developers that worked on the project.

Die Hard

Posted: 30th October 2011 by Arran in Funny
Tags: , ,

Python Intro

Posted: 9th October 2011 by Arran in Tutorial

This is a very basic intro to Python. Python is a super powerful scripted language, that can be ran interactively in the python interpreter or saved and ran in scripts. Python has great object orientation and runs lightning fast. Most importantly is lots of fun!!

Get Python

The first step to learning Python is to install it if you haven’t already . Python is available for Windows, Linux and OSx. It can be downloaded at Python.org, or if you are running linux like me you can run

#Ubuntu/Debian
sudo apt-get install python
#RedHat/Fedora
sudo yum install python

Lets Get Started

Now that we have python installed we can get stated. Lets not break with programing tradition. Our first line of Python will be “Hello World”. Start up the Python interpretor.

This is what you should see. The interpretor is now waiting for commands. To print “Hello World” to the screen all we need to do is to enter

print "Hello World"

Easy right? Lets try a little more. This time lets assign the value “Hello World” to a variable called greeting. Python is strongly typed but uses type inference, meaning there is no need to state a variables type at deceleration.

greeting = "Hello World"
print greeting
#returns
Hello World

You may have notice that there are no semi-colons an the end of each line. Python uses indentation to determine structure in your code. This can be seen when we use a conditional statement.

number = 5

if (number >= 6):
    print number," Is Greater Than or Equal to Six"
else:
    print number," Is Less Than Six"

As you can imagine this code will print out “5 Is Less Than Six”. Notice that there is a colon an the end of the if and else statements and that all code contained in each block is indented.

Lets imagine that we intend to make this kind of evaluation many times in our code. It would make sense to have a function to preform this task. You can define functions in the interpretor as well. Lets place this code into a method called “evaluate”.

def evaluate(number):
    if (number >= 6):
        print number," Is Greater Than or Equal to Six"
    else:
        print number," Is Less Than Six"

Now we have a function evaluate() that takes number as a parameter.We can call our new function like so.

evaluate(7)
#returns
7 Is Greater Than or Equal to Six

There we have it. But as you all know we shouldn’t be printing inside a function. Lets modify our definition so that we return the string instead of printing it.

def evaluate(number):
    if (number >= 6):
        return str(number)+" Is Greater Than or Equal to Six"
    else:
        return str(number)+" Is Less Than Six"

The str() converts number to a string which allows us to concatenate the two strings together.

Now that we have an evaluate function lets try evaluating all the numbers in given range. We can do this by calling our new function in a for loop.

for i in range(0,9):
    print evaluate(i)

This code will call evaluate() for all the values from 0 to 8.

This may be a useless function but we have now seen all the basics. Variables, conditional statements, functions and loops. These are the building blocks for any program in any language.In the next tutorial we will progress to writing scripts and the use of lists in Python.