Streaming Twitter API, Big Data and AC Milan vs FC Barcelona
These days we are working on a project based on monitoring of tweets through the Twitter Streaming API. This API allows you to open a connection to Twitter and start receiving tweets that meet the search criteria, in our case containing certain keywords. Using this API we can get all the tweets published on Twitter. The standard API search does not offer all tweets, it is rate limited and it is not in true real time.
At this stage of project development, we need to perform several tests, mainly to assess whether the system we’re designing is capable of processing large amounts of data (tweets) per second.
On the occasion of the Champions League match AC Milan vs FC Barcelona, we thought that this might be a good opportunity to monitor different keywords associated with the game. During the match, 30 minutes before and 30 minutes after, we opened a streaming connection to the Twitter API to read all the tweets with these keywords:
milanbarça, milanbarcelona, forçabarça, forzamilan, milan-barça, milan-barcelona, milan-barca, milan-fcb, milan vs barça, milan vs barcelona, milan vs barca, messi, ibrahimovic
To perform this monitoring, we developed a small console application written in C # and based on this code written by Shannon Whitley (@swhitley). We stored the tweets in a Microsoft SQL Server database. For this test we decided not to use MSMQ queues.
The result was great and we stored in our database 83,582 tweets during the 172 minutes that the connection to the Twitter API was open, which means an average of 8.09 tweets per second.
The next phase of testing will be how to manage the information gathered in different ways to display it on a website and/or a mobile app or tablets. We must decide whether a relational database, denormalized to get better performance, it is sufficient, or is it better to choose a database better prepared to work with bigdata, as may be Hadoop.
One of the challenges of this project will be also how to manage the access to the Twitter API standard to obtain additional information for each user once made her/his first tweet… because of the restrictions on rate limit.
Although the analysis of the tweets was not the purpose of this test, here we show you some interesting facts:
- Distribution of tweets per minute during the match. 18:45 (GMT +0:00) starts the game. From 19:30 to 19:45 (GMT +0:00) half time. 20:35 (GMT +0:00) end of the game.
- “Golden minute” 20:37 (GMT +0:00), just after the game: 1,791 tuits. 29 tweets per second.
- Unique users: 54,284. An average of 1.5 tweets per person.
- User with more tweets (excluding RTs): @Rossonerosemper, 73 tweets.
- Tweet more retweeted (during the game): 69 times.
— ACMilan.com.br (@acmilanoficial) marzo 28, 2012
- Geolocated tweets: 1177, only the 1.4%.
If anyone is interested in the data, here you can download a tab text file with the 83,582 tweets.