So here is the thing: How people are sharing on Twitter around the world? What are the devices or services they usually use to share their check-ins, photos, videos, or updates on Twitter?
This is a really simple analytics I did on a data that I’ve been gathering for almost 2 months now (around 97 million tweets from US and 5.5 million tweets from France by the time of this study) to get some answers for the above question by using Hadoop batch processing.
I have 4 EC2 instances up and running 24×7 to track tweets (from Twitter Public Streaming API) and store them into MongoDB Replica Set. One of the nodes is an application server that I built by Node.js stack to process and visualise the stream as it comes to the system in a real-time. Currently I have average of 100 tweets/s, minimum of 30-40/s, and maximum of 180-220/s. There is more than one Twitter account at the same time to tracking tweets by locations and different keywords. That’s why I get more than 1% of the entire stream sometimes!
There are few batch processing jobs that I’d like to do from exported files on S3 by using Amazon EMR cluster and not using any MongoDB aggregations or Redis operations no matter how fast they are only because of their unpredicted behaviours in a long-term processing job. Besides, firing and destroying EMR clusters automatically and analysing big amount of data from S3 is so easy and so cheap that risking of running a long-term MapReduce job or aggregation in a production environment is just out of the question, not that my MongoDB Replica Set instances are that big or good enough for production anyway. In my case Hadoop MapReduce jobs take about 20 minutes plus 4 to 5 minutes to start and spin off the cluster will cost (5 (m2.xLarge instances) * 0.5 (price per hour))/2 ( since it just takes around 30 minutes) = 5*0.5/2 = $1.25 which is nothing compare to what we have to pay to beef up three EC2 instances!
UPDATE1: I was wrong about the price, I based my calculations regarding to EC2 instances price which it should have been based on EC2 prices plus EMR prices as well. Don’t worry this one will be still cheap enough!
So here is the price for the same cluster: (5 * ($0.09+$0.495)/2 = $1.46 ! So if you fire up the cluster once a day you’ll end up paying around 44 bucks a month. There is a calculator here provided by Amazon that you can play with it.
UPDATE2: You always pay the entire hour in AWS! There will be no more 30 minutes so optimise your workload to use the entire hour cause you will pay for it anyway 🙂
I will do more complex analytics on the source of tweets to see the frequency of devices and services during day time and night time. That’d be cool!.