Jahangir Mohammed provided the most detailed response to our question on Quora – September 2, 2012. We wanted to find out if there was a systematic process/ method involved in the analysis, collection and presentation of big data. Here is Jahangir’s answer:
“I am inclined to say the approach is definitely systematic, but there are lots of options and one needs to figure out what is the best implementation for their specific use case.
There are various distributed data collection and aggregation frameworks like Flume, Chukwa and Scribe which can be leveraged efficiently to collect and aggregrate data in real-time from lots of servers.
If one has the data in some form sitting in RDBMS, they can use sqoop to transfer data between RDBMS and to a big-data framework like Hadoop(meant HDFS).
Hadoop is a well-known framework that allows distributed processing and analysis of big data. There are couple of other frameworks like Cascalog, storm – stream processing, some MPI frameworks and some BSP frameworks(like Apache Hama) and Dremel’s open source (is currently being worked on) all of which are created to crunch big data. Also, there is Amazon’s EMR or Google’s big query from a cloud perspective, but to keep it explicit there is nothing stopping to run any open source
implementations on cloud.
This can be home-grown to using a commercial product. Some of the offerings out there like Datameer and big query do offer some visualizations, dashboards, excel capabilities and so forth.”
Feel free to leave a comment and add your views in the comments section.
Special thanks to Jahangir Mohammed and Vijay Kamath who both took time out to provide answers to our question.