Sunday, 22 February 2015

What is Apache Storm? How is it related to potato chips?

Logo of Apache Storm project

Previously, we have seen what is Big data, now let us look at a framework that lets you operate with big data. You might have known what is a framework, it is nothing but that which provides you the classes and interfaces to do a particular job easily. The term easily is very important here. It hides all the stuff that you don't need to worry about, by itself taking that burden.

Now, let us come to Storm. In general, a storm comes and goes at a blazing speed, perhaps in seconds. Such is the speed with which we need to process big data and Apache storm does exactly that. But what does it do and how does it do, is worth discussing about.

Apache storm processes a stream of data i.e. the data is continuous here. In simple terms, the data keeps on flowing.

Understanding Storm with potato chip factory  

Processing the potatoes in a factory
See how the potatoes are processed here

Just consider a potato chip factory. Here, first the potatoes are unloaded from the truck and sent into the factory. These potatoes are undergone to several stages.
1. The potatoes are cleaned.
2. Next, they are tested whether they are good or not.
3. Next, they are peeled.
4. Next, they are cut into the desired shape.
5. Next, they are fried
6. Next, a flavor is added to them.
7. Finally, they are sent into the packets.

The same potatoes undergo all these stages. In a similar way, the data is also undergone to several operations and each operation is called as a Bolt in storm. You got the potatoes from the truck, that means the truck here is the source of potatoes. So, the source of the data is called as a Spout in storm. After first stage is completed, the potatoes move to the 2nd stage. So, the 1st stage acts as a spout to the 2nd stage.
Now, all the spouts and the bolts are together called as topology in storm.

The important point you need to remember that process these potatoes. The machine to clean a potato is different, and the machine to test it is different, the peeler is different and so on. But all are connected to each other. Remember also, that these machines keep on running. They never stop because the potatoes keep on coming.
In a similar way, the data keeps on flowing and you do the operations. The programs that perform these operations keep on running. You only give them the data. They are not run when you provide the data and terminate when the operation is complete. If there is no data, then the program will be idle i.e. it will be in memory but doesn't do any job. So, the topology will be running always and when you give the data the operations are performed.

The birth of Storm

Apache storm is actually introduced by Twitter and is now an Apache project under incubation. But it is used in production environment. It is used by many companies like Twitter, Yahoo, Yelp, Groupon etc.

Apache storm as it is a big data framework, the bolts are executed in parallel i.e. bolts are distributed among several systems. So processing happens in sub-second latency. There is a terminology for storm which you need to master to understand it. Let us go through it.

Apache Storm terminology

Bolt: A program that performs an operation on the data. There can be any number of bolts in a topology. Each bolt does one operation. The operations can include anything, from modifying the data to performing calculations, logging or storing in the database.

Spout: The source of data is called as spout. It produces data to the bolts for processing. The spout can be anything, it can be a http request or a messaging queue. For example, if you want to process some tweets, then you need to listen to a Http port which takes in those tweets and these tweets will be stream of data.

Tuple: The part of the data that is processed by a bolt. As we have discussed before, Storm processes a stream of data i.e. the data keeps on flowing. A stream of data is a collection of several datum.
For example, if tweets are a stream of data that keeps on flowing, then every tweet is called as tuple. This tuple is processed by the bolt. The spout produces a series of tuples.

Topology: Bolts and spouts together are called as a topology. The topology contains spouts and bolts and also the connections between them. For example, in the above example, there is a sequence of processes, the potatoes cannot undergo those processes as they like. For example, they cannot be fried until they are peeled and cut. So, there is a flow of execution. In a similar way, you specify which bolt should execute first and which one next. Because, one bolt can depend on another.

Stream: It is a sequence of tuples which have no limits. It is a continuous flow of tuples.

Fields: Every tuple will have one or more fields. A field is nothing but a key-value pair. As, we have seen before, a tuple is a datum processed by the bolt. A tuple can be anything, from a simple String object or a Student object.
Consider, for example, that we need to process a Student information. Obviously, the student will contain many fields like sno, sname, age etc. Now, the fields will be sno, sname, age and they contain corresponding values.

There is more terminology, to be explored, but I don't want to clutter this post with all that. These are the basic terms.

A simple practical example

Consider that I want to store a student information. The student gives only name and age. Now, I need to generate sno. So, I have got two bolts as follows.

Spout: HTTP source
Bolt 1. To generate sno
Bolt 2. To store student object in database.

The spout will contain student objects where every object is a tuple and each tuple contains only sname and age since sno is to be generated by the bolt 1.
Now, bolt 1 generated a new sno. Now, we need to add the sno to the existing tuple and send it to bolt 2.
So, we add an additional field named sno with the generated value, say 101.

Monday, 9 February 2015

What is Big data? Is it worth worrying about?

Big data, the beep these days.

Big data has been a popular term, in the market these days. In fact, it was not the term that is introduced in the past year or a few years before. But this term exists long back. When we go through this term, we often think that it is some thing weird. But, NO. Big data means nothing but data that is big, simply huge amounts of data (say thousands of Petabytes, or can be higher than that too).


Big data exists previously, but the term was now popularized. Take Google for example, we all know that Google contains a lot of information about web pages. And it has up to 20,00,000 servers (approx), which represents that it has very very very huge amount of data. Not just Google, but all companies will have large sets of data.

But what will you do with all the data?

Data is for performing some operations on it and getting the output we need. For example, when you type something in Google, you see some millions of results and you can also see the no. of seconds that Google took to give you them.

Google search results and time taken to search them

You got about 39,90,000 results in 0.16 seconds. Isn't it cool? Isn't it something looking great? Because, processing those many records and getting the best results on the top of results with in a sub-second latency is cool.
This is what you are intended to know about the big data.

When data is big, processing it obviously takes more time. But to perform any no. of operations we need on the data within seconds is what is needed.

Now, try copying a 10 MB file from your F:\ drive to C:\ drive and see how much time it takes. At least, a few seconds? Yes. But how do you think that Google has processed Petabytes of data in 0.16 seconds? Tell me, Google has about 200 algorithms to filter the search results and all those 200 algorithms must process some petabytes (more than that, of course) of data. How do you think it is possible in .16 seconds? Can we do it too? Yes, absolutely, with the right tools!

What is Hadoop?

When people talk about Big data, they talk about Hadoop and less often about MapReduce. I am not going to tell the entire thing about Hadoop here, however i will give you a brief introduction of what it is.

Previously you have seen that petabytes of data is processed within .16 seconds. Now, when you have such huge amounts of data, you also need to delivery them in nearly same time. So, there must be a way for you to do this. Is writing simple programs, and making an efficient use of variables in program going to do it? No, it deals with software design. You need to think of a way to ensure that the work is done by our system in less time frame.

Now, you know that Google has lot of servers. Now, tell me if Google stores all the data in one server? 
When you type something in Google, your query obviously goes into some server, but is that server going to contain all the data that you are intended to find out? Absolutely, No. Is that server going to do all your job? 

Is it going to contain all the code that needs to process the data? No.
Say, for example, your query is transferred to another server? Is that server going to execute all the Google programs? No.

Even if it does, then tell me how much amount of RAM is needed for it to perform operations on petabytes of data? Tell me, how many processors are needed to process that huge amount of data in .16 seconds?

Tell me, what is the highest memory RAM that you have known? How many processors can fit in a single server?

You got the answer there in. The key here is that, one server is not going to do the entire job. But the job is splitted to multiple servers. When Google has to do a lot of operations, all that it does is to give each operation to a server and then finally combine the results of all those operations and give it to you. There by, it is not only making efficient use of resources but also producing results in faster way.

It doesn't matter how much amount of data you have, as long as you have more servers that need to process it, you always produce the results with in lesser time frame. If one server is going to do all of that, then it is going to take a lot of days, even with higher system configuration.

In order to do this, we have got MapReduce (developed by Google). It is a programming paradigm in which the job is split into smaller tasks where every task will be having a key and a value. After all those small tasks are completed, no. of key value pairs in the map are going to be reduced. That is why Map-Reduce. To know more about MapReduce click here.

Apache Hadoop is a framework which lets us process the big data with less time. It contains an implementation of MapReduce and also hdfs (a filesystem that manages the data by distributing it among multiple servers). Learn more about hdfs here

Now, decide for yourself whether it is worth worrying about or not.

Wednesday, 4 February 2015

Quick Introduction to Maven

Maven logo
Maven is a dependency management tool.

Now, the terms dependency management might be new. Nothing, new here, you know it all. dependency means, in general, a jar file (which is a collection of classes). This jar files relates to a particular framework or api and management in the sense, managing those jar files. I said, jar files because your project might contain multiple jar files, because it is most likely to use more than one framework. Even, if project is not using more than one framework, you might contain multiple jar files. This is because the framework you are using might depend on classes in other jar files. So, it is necessary to include those jar files for the framework to work.
For example, when writing a common Hello world program in Java, you need System.out.println() where System class resides in java.lang package. This class is present in rt.jar which is situated in your jre folder.
Without this rt.jar file, you cannot compile the program because there is no java.lang package.

Fine, what does maven do?

Maven manages these dependencies (jars) for you. When your project uses a particular framework, it downloads all the jar files that the particular framework uses. So, you don't have to. All you need to do is to just tell maven which framework your project uses, and maven looks after every resource (jar file) that the framework need to work. So, no matter how many frameworks you use, just tell them what you use, what those frameworks use internally is something you need not care about.

Where do i specify what frameworks my projects use?

There is a file called pom.xml where POM means Project Object model which defines the model of the project. We will be writing a <dependencies> tag in which we will specify each dependency. Note that dependency means a jar file. There are many other tags in pom.xml, however we are only going to work with a few which enables us to include dependencies.

Where are these jar files stored?

These jar files are stored in the Central repository (http://search.maven.org). This contains all the jar files from which the maven pulls. You can also download the jar files independently.

groupId, artifactId and version

Every dependency will have these properties. groupId refers to the id of the group which is unique to the project and next, artifactId is the name of the jar file and version refers to the version of the jar file (as you might have updates for the framework, so you have different versions).
You might have a doubt about the difference between groupId and artifactId. It is simple, groupId refers to the name of the project. Let me give you some examples, so that you will be able to understand.

groupId - org.springframework
artifactId - spring-core

groupId - org.springframework
artifactId - spring-beans

Writing your first pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
<modelVersion>4.0.0</modelVersion>

<groupId>com.javademos.blogspot</groupId>
<artifactId>maven-project</artifactId>
<version>1.0</version>

</project>
 
See, <project> is the root element which contains several tags in it. Just don't worry about the schema locations and all, if you are not familiar with it. The .xsd files contains definitions for the tags like <groupId>, <artifactId> etc, i.e. if you don't have those xsd files, these tags don't work, the pom.xml will contain errors. Consider it like importing packages when we make use of a class in java.

When we create a project, it must have a name which is the groupId and there must be a name that is given to the jar file (artifcatId) and the version. The groupId must follow the package naming conventions (i.e. how  java packages are named).
Now, comes the thing. Including dependencies..
 
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
<modelVersion>4.0.0</modelVersion>

<groupId>com.javademos.blogspot</groupId>
<artifactId>maven-project</artifactId>
<version>1.0</version> 
 
   <dependencies>
          <dependency>
              <groupId>org.springframework</groupId>
              <artifactId>spring-core</artifactId>
              <version>4.1.4.RELEASE</version> 
          </dependency> 
   </dependencies> 
</project>
 
See, you can contain as many dependencies your project uses. All you need to do is to add a <dependency> tag in the <dependencies> tag. Note, that the version name can also contain alphabets as you can see.

You might ask me, how do I know all groupId's, artifactIds and versions which I want to use. In fact, most of the times you don't know the groupId, but you will be familiar with artifactId. Just type go to the http://search.maven.org and there search for artifactId. For example, spring-core. That's it. By seeing the groupId you will be able to understand which spring-core you need to use. Just click on the latest version (if you use) otherwise, you can go to all versions where you will be displayed a lot of versions and choose the version your project uses.

In the left side, you will be having Dependency information panel, where you can copy the <dependency> tag and paste it. You need not even write it too!