Sunday, 22 February 2015

What is Apache Storm? How is it related to potato chips?

Logo of Apache Storm project

Previously, we have seen what is Big data, now let us look at a framework that lets you operate with big data. You might have known what is a framework, it is nothing but that which provides you the classes and interfaces to do a particular job easily. The term easily is very important here. It hides all the stuff that you don't need to worry about, by itself taking that burden.

Now, let us come to Storm. In general, a storm comes and goes at a blazing speed, perhaps in seconds. Such is the speed with which we need to process big data and Apache storm does exactly that. But what does it do and how does it do, is worth discussing about.

Apache storm processes a stream of data i.e. the data is continuous here. In simple terms, the data keeps on flowing.

Understanding Storm with potato chip factory  

Processing the potatoes in a factory
See how the potatoes are processed here

Just consider a potato chip factory. Here, first the potatoes are unloaded from the truck and sent into the factory. These potatoes are undergone to several stages.
1. The potatoes are cleaned.
2. Next, they are tested whether they are good or not.
3. Next, they are peeled.
4. Next, they are cut into the desired shape.
5. Next, they are fried
6. Next, a flavor is added to them.
7. Finally, they are sent into the packets.

The same potatoes undergo all these stages. In a similar way, the data is also undergone to several operations and each operation is called as a Bolt in storm. You got the potatoes from the truck, that means the truck here is the source of potatoes. So, the source of the data is called as a Spout in storm. After first stage is completed, the potatoes move to the 2nd stage. So, the 1st stage acts as a spout to the 2nd stage.
Now, all the spouts and the bolts are together called as topology in storm.

The important point you need to remember that process these potatoes. The machine to clean a potato is different, and the machine to test it is different, the peeler is different and so on. But all are connected to each other. Remember also, that these machines keep on running. They never stop because the potatoes keep on coming.
In a similar way, the data keeps on flowing and you do the operations. The programs that perform these operations keep on running. You only give them the data. They are not run when you provide the data and terminate when the operation is complete. If there is no data, then the program will be idle i.e. it will be in memory but doesn't do any job. So, the topology will be running always and when you give the data the operations are performed.

The birth of Storm

Apache storm is actually introduced by Twitter and is now an Apache project under incubation. But it is used in production environment. It is used by many companies like Twitter, Yahoo, Yelp, Groupon etc.

Apache storm as it is a big data framework, the bolts are executed in parallel i.e. bolts are distributed among several systems. So processing happens in sub-second latency. There is a terminology for storm which you need to master to understand it. Let us go through it.

Apache Storm terminology

Bolt: A program that performs an operation on the data. There can be any number of bolts in a topology. Each bolt does one operation. The operations can include anything, from modifying the data to performing calculations, logging or storing in the database.

Spout: The source of data is called as spout. It produces data to the bolts for processing. The spout can be anything, it can be a http request or a messaging queue. For example, if you want to process some tweets, then you need to listen to a Http port which takes in those tweets and these tweets will be stream of data.

Tuple: The part of the data that is processed by a bolt. As we have discussed before, Storm processes a stream of data i.e. the data keeps on flowing. A stream of data is a collection of several datum.
For example, if tweets are a stream of data that keeps on flowing, then every tweet is called as tuple. This tuple is processed by the bolt. The spout produces a series of tuples.

Topology: Bolts and spouts together are called as a topology. The topology contains spouts and bolts and also the connections between them. For example, in the above example, there is a sequence of processes, the potatoes cannot undergo those processes as they like. For example, they cannot be fried until they are peeled and cut. So, there is a flow of execution. In a similar way, you specify which bolt should execute first and which one next. Because, one bolt can depend on another.

Stream: It is a sequence of tuples which have no limits. It is a continuous flow of tuples.

Fields: Every tuple will have one or more fields. A field is nothing but a key-value pair. As, we have seen before, a tuple is a datum processed by the bolt. A tuple can be anything, from a simple String object or a Student object.
Consider, for example, that we need to process a Student information. Obviously, the student will contain many fields like sno, sname, age etc. Now, the fields will be sno, sname, age and they contain corresponding values.

There is more terminology, to be explored, but I don't want to clutter this post with all that. These are the basic terms.

A simple practical example

Consider that I want to store a student information. The student gives only name and age. Now, I need to generate sno. So, I have got two bolts as follows.

Spout: HTTP source
Bolt 1. To generate sno
Bolt 2. To store student object in database.

The spout will contain student objects where every object is a tuple and each tuple contains only sname and age since sno is to be generated by the bolt 1.
Now, bolt 1 generated a new sno. Now, we need to add the sno to the existing tuple and send it to bolt 2.
So, we add an additional field named sno with the generated value, say 101.

No comments:

Post a Comment