With this article, I am starting a small blog series called “Twitter analytics”. My goal is to see whether it is possible to discover meaningful social media insights by applying data gathering and data analytics on the social network Twitter. This is the first part of the series where I talk about how fetching and storing data from Twitter works and why it is way harder than I thought.
Twitter data sources
How do you get a good data set of tweets? This was the first question that I had to ask myself at the beginning of this project. The solution seemed quite simple: use an already existing data set to dodge the hassle of fetching and cleaning up the data myself.
That was easier said than done. I spend a few days looking through different data sets and sources and not a single one fitted my needs. They were either outdated, only usable for a special kind of analysis, or they did not have the features that needed. So I decided to get my data right from the source by asking Twitter itself.
Creating a tweet downlader
Getting data from the Twitter API was also quite a hassle. First of all, I had to work with the rate limits, which also meant that I could only build my data set over multiple requests step by step. Furthermore, there is a distinction between the retrieving current tweets (from a week ago until today) and retrieving archived tweets (reaching back all the way to 2006). It was pretty clear to me that I was not able to manually handle those restrictions, so I spent a week building a framework for downloading Twitter data.
The downloader automatically splits my queries into a set of requests and then fetches the data from Twitter step by step. I made sure to append the results of every single request immediately to a CSV file. By doing that, I was able to work with the already fetched tweets, even if the download did run into an error halfway through. This also allowed me to run my analytics code on the data set while it was still downloading. I often would check on a download at 5000 tweets to see whether the data set fits my needs and then decided to continue or stop the download. This is the frontend that I created for the downloader:
The big table on the bottom gives an overview of the data sets that I previously downloaded. This is quite handy since I can easily see the metadata of every data set and compare it. The keyword, the number of tweets in the data set, the timeframe and the minimum engagement rate are the key metrics that determine the content and quality of my data set. I will talk more about them in the next part.
Parameters that I used for downloading
On the top left, you can see the UI for creating a new data set. It allows me to set the parameters “keyword”, “minimum engagement” and the “amount of tweets”.
- The “keyword” parameter is pretty simple because it just defines that only tweets, which contain the keyword should be fetched. I use this parameter to set the topic that the tweets should be about.
- By using the “minimum engagement” parameter, I can set a minimum amount of likes, retweets and replies a tweet must have to be stored in the data set. This is a key metric because it allowes me to cut out the huge pool of tweets that nearly no one interacted with. The tweets that got none or only very few interactions are not very interesting for my analysis, therefore they are just unnecessary API and memory usage. Furthermore, I can capture longer timeframes with less API calls by setting a high minimum engagement number.
- At last, I used the “amount of tweets” parameter to tell my downloader when the data set is completed and it can stop to issue new requests.
Implementing analytics into the framework
Originally, I intended to only implement the downloader and perform the data analysis in jupyter notebook. But since I already created a nice dashboard, I tried to integrate the analytical process into it. As you can see in the previous screenshot, my dashboard contains a function called “Run”. Clicking on this button will send the selected CSV to a Phyton script, which will then perform different analytical functions on the data and store the results as JSON files. I will dive deeper into the analytics code in the next part of this series.
Implementing visualization into the framework
Three is a charm and the last thing I needed in my framework was a way to visualize the results of my Phyton scripts. Therefore, I created a small frontend, which reads the results out of the JSON files and displays them nicely. The two main visualization techniques I used were HTML tables and diagrams created with the Plotly libary.
Technical overview of my framework
Last but not least, here is a full overview of the framework that I have created. I am very much looking forward to using this framework in the future for some interesting Twitter analytics – so stay tuned for part 2!