Data blog: October 2016

I've been meaning to post an update for a while, and here is one! I was attending some meetups recently, and decided to do some Twitter data-scraping and analysis. It is all really easy to do - you just sign up for a Twitter developer account, get your API keys, and voila! And what better subject to analyse than US politics... .

Wrong!

As I'm mostly a python guy, I have used the amazingly simple twitter library for connecting to the streaming API. This API outputs a sample of what is being tweeted currently (so no messing with history etc.). It lets you filter by content, so I left a machine running for 40 hours (I also gave it an upper limit of 500GB in case there were a LOT of tweets), searching for tweets with 'debate' (which includes @debate and #debate, though not #DebateNight, and is case-insensitive). I started the data collection the morning (9:45 LA time on October 18th) the day before the debate, and finished it the night after (1:45am on October 20th).

The raw data file ended up being about 12.5GB, or just over 2M tweets, so it still fits into the 'medium data' category (i.e. a number-crunching server that is 4 years old can comfortably fit it in memory). I was hoping to use tools like Spark, and maybe I will try to convert my code into that, but there was no need. You can check out (!) my code here.

What can we do with twitter data? NLTK (Natural Language ToolKit) now has a TweetTokenizer class which can tokenise hashtags, @ mentions, emoji and unicode pretty well (though in English mostly). Then we can count frequent terms, and for example generate these word clouds (using wordcloud), for most frequent words following 'trump' or 'clinton':

Guess which one is which.

A slightly fancier thing would be to perform sentiment analysis on the tweets. I initially stumbled upon this Marco Bonzanini's series of posts about twitter analytics, but quick testing showed that I can't seem to find a good seed dictionary for calculating sentiment. After some searching I found this tool for visualising sentiment of tweets by Christopher G. Healey and R. Siddarth Shankar (from North Carolina uni, how fitting), who in turn use sentiment data gathered on MTurk and presented in this paper (available to download).

What I thought was quite smart is the way Healey et al. average sentiment. The database of sentiment mentioned above contains a mean and standard deviation of valence (positive/negative affect), estimated from many measurements. They take both of these and then compute the normal distribution pdf for the valence being exactly the mean, giving a measure of certainty about valence for each word. This is then used as a weight in averaging the sentiment for the whole tweet.
I only analysed sentiment for tweets with >= 2 words in the database, which excluded ~200k tweets (10%) from analysis.

We can see that:

Both of candidates are mentioned about equally frequently during the debate (grey highlight on the right), though Donald has a few more mentions in the night (twitter wars?)

The spike in mentions of candidates seems to be during the debate, though peak mention of both at the same time is a few hours after (comparisons?).

The sentiment of tweets which mention candidates is lower (more 'angry'/'bad'/'racist') than in tweets not mentioning them (note: line is smooth curve fit from ggplot2, using Generalized Additive Model):