Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. Consider social-networking sites like Facebook or Twitter. Billions of users post comments, update their status, upload photos etc. Imagine how large such data would be. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is Big data.
The three Vs – volume, velocity and variety are commonly used to characterize different aspects of big data. Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it. A research report on Big data done by McKinsey can be found here.
Ok, now we have Big Data. What can be done with it?!
We can extract insight and intelligent information from an immense volume, variety and velocity of data in context, beyond what was previously possible
Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of “big data” tools has arisen to handle sensemaking over large quantities of data, as in the Apache Hadoop Big Data Platform.
Some instances of big data :
- In total, the four main detectors at the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010 (13,000 terabytes)
- Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress
- Facebook handles 40 billion photos from its user base
- FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide
- The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates
- Decoding the human genome originally took 10 years to process; now it can be achieved in one week
- Computational social science – Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behavior and real-world economic indicators.The authors of the study examined Google queries logs made by Internet users in 45 different countries in 2010 and calculated the ratio of the volume of searches for the coming year (‘2011’) to the volume of searches for the previous year (‘2009’), which they call the ‘future orientation index’. They compared the future orientation index to the per capita GDP of each country and found a strong tendency for countries in which Google users enquire more about the future to exhibit a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.
Consider a big organization. You have some data to process. So, you get a cluster to process that data. Soon the data is increasing in volume. You can get more nodes to inlcude in that cluster. This can be very expensive. And over and above that there is cost of maintenance. But to what extent can you increase your cluster?! A better alternative would be to rent a cluster on time basis to process your data. Once you are done, you can stop using it. Later, whenever you might require, you can rent again on-demand. This is exactly what Amazon Web Services provides.
Amazon Web Services (abbreviated AWS) is a collection of remote computing services (also called web services) that together make up a cloud computing platform, offered over the Internet by Amazon.com. The most central and well-known of these services are Amazon EC2 and Amazon S3. Apaarently, you can get an instance running on Amazon cloud for as low as 1 Rupee/hour!
Map-Reduce is a simple data-parallel programming model designed for scalability and fault-tolerance.
There are two types of Hadoop Clusters:
- Cloud Cluster
- Local Cluster
For learning, we can get started with a local cluster on our personal machines with a single node. In the next post, I will tell you how to set up a single-node hadoop cluster.