Recently, I’ve been spending some working with Big Data and Hadoop Distributions and I was trying to come up with a “useful” side project to play around with the technology, what bigger event is there on twitter than the annual #happynewyear tweets as they fly around the world at the dawn of 2014?
I connected to twitters streaming API using a simple node.js client. The open source node package appropriately named Twit by Tolga Tezel does all the heavy lifting for me in a few lines of code. I aggregated over 6 million tweets in 24 hours – averaging 60 tweets per second. According to twitters documentation, the streaming API will give you access to 1% of the twitter firehose at any one time and judging by the geographic spread of the tweets I suspect that it is sympathetic to where in the world you connect from, I was running out of the windows azure data center in Dublin.
Processing the data
Now on to the data crunching, I uploaded all the tweets in multiple 20MB text files to Windows Azure Blob Storage and spun up an 8 node HDInsight Hadoop Cluster to process the data. Storing the tweets naively in blob storage gave me the flexibility to only spin up the cluster for a couple of minutes. I aggregated all the tweets that had a place associated with them and extracted the latitude and longitude coordinates.
Visualizing the results
I used Chrome’s open source Web GL Globe platform to showcase the results in an interactive 360 degree visualization of the data. You’ll need to be running Web GL enabled browser when you connect to the website.
Open Source tools – power to the people
This experiment cost me absolutely zip to conduct ! All the code and technologies I used were open source – node.js, Hadoop and Web GL Globe. The cloud compute time also came free of charge thanks to my MSDN subscription.
The source code is available here, may the source be with you and #happynewyear