tag:blogger.com,1999:blog-81730816566928995402024-03-14T09:13:29.504+00:00Data blogLukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-8173081656692899540.post-2386456787650633432016-10-21T18:19:00.002+01:002016-10-21T18:19:58.377+01:00[5] Tweeting the debateI've been meaning to post an update for a while, and here is one! I was attending some meetups recently, and decided to do some Twitter data-scraping and analysis. It is all really easy to do - you just sign up for a <a href="https://dev.twitter.com/" target="_blank">Twitter developer account</a>, get your API keys, and voila! And what better subject to analyse than US politics... .<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEH9N8xjPsQp6g23UTfHw-Lqurc9Q_f7RU_OJnDnFxaIhqW_MADfbqKzJ8E3VxRngGdiUgTuqo-k4VQOBrqg1xvOmrV64VqpzxGkVudRkLSzTBQBg1C1NGNQqrEQmo3EjZEwaT3vtrZiAl/s1600/dVf2RdQ+-+Imgur.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="179" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEH9N8xjPsQp6g23UTfHw-Lqurc9Q_f7RU_OJnDnFxaIhqW_MADfbqKzJ8E3VxRngGdiUgTuqo-k4VQOBrqg1xvOmrV64VqpzxGkVudRkLSzTBQBg1C1NGNQqrEQmo3EjZEwaT3vtrZiAl/s320/dVf2RdQ+-+Imgur.gif" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Wrong!</td></tr>
</tbody></table>
As I'm mostly a python guy, I have used the amazingly simple <a href="https://github.com/sixohsix/twitter" target="_blank">twitter</a> library for connecting to the streaming API. This API outputs a sample of what is being tweeted currently (so no messing with history etc.). It lets you filter by content, so I left a machine running for 40 hours (I also gave it an upper limit of 500GB in case there were a LOT of tweets), searching for tweets with '<b>debate</b>' (which includes @debate and #debate, though not #DebateNight, and is case-insensitive). I started the data collection the morning (9:45 LA time on October 18th) the day before the debate, and finished it the night after (1:45am on October 20th).<br />
<br />
The raw data file ended up being about 12.5GB, or just over 2M tweets, so it still fits into the 'medium data' category (i.e. a number-crunching server that is 4 years old can comfortably fit it in memory). I was hoping to use tools like <a href="http://spark.apache.org/" target="_blank">Spark</a>, and maybe I will try to convert my code into that, but there was no need. <a href="https://bitbucket.org/ktokolwiek/debate_twitter" target="_blank">You can check out (!) my code here</a>.<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
What can we do with twitter data? <a href="http://www.nltk.org/" target="_blank">NLTK (Natural Language ToolKit)</a> now has a <a href="http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual" target="_blank">TweetTokenizer class</a> which can tokenise hashtags, @ mentions, emoji and unicode pretty well (though in English mostly). Then we can count frequent terms, and for example generate these word clouds (using <a href="https://github.com/amueller/word_cloud/" target="_blank">wordcloud</a>), for most frequent words following 'trump' or 'clinton':<br />
<span id="goog_252216906"></span><span id="goog_252216907"></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkRzN0VZHRSsFCBbENgWf31yz5XD6VtN5VwbE_f7paBY_pp_4EezL85wtV-XTz8Rd6VG3qNIsnHyzWtZpByQlKj7yHRl9gP-20skxoYD_iTKWkR28UVNeDzW1nCisXO-xiBallEXYLAxPg/s1600/trump.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="265" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkRzN0VZHRSsFCBbENgWf31yz5XD6VtN5VwbE_f7paBY_pp_4EezL85wtV-XTz8Rd6VG3qNIsnHyzWtZpByQlKj7yHRl9gP-20skxoYD_iTKWkR28UVNeDzW1nCisXO-xiBallEXYLAxPg/s320/trump.png" width="320" /></a></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibDgIk9gVDBuKxEIEPCinbDsvh0I19vHpZ1G4uY7wHHNa1MOk46ho1M46O9pkvn_9_nwIsAXWABcV1QzLoKfDgUEPP7pMVtqDYRlHgX8ZDrmCyz7PpPAe5MlsP-JVmd7xzgRndYgLyuwQS/s1600/clinton.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="312" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibDgIk9gVDBuKxEIEPCinbDsvh0I19vHpZ1G4uY7wHHNa1MOk46ho1M46O9pkvn_9_nwIsAXWABcV1QzLoKfDgUEPP7pMVtqDYRlHgX8ZDrmCyz7PpPAe5MlsP-JVmd7xzgRndYgLyuwQS/s320/clinton.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Guess which one is which.</td></tr>
</tbody></table>
<div>
<br /></div>
A slightly fancier thing would be to perform sentiment analysis on the tweets. I initially stumbled upon this <a href="https://marcobonzanini.com/2015/05/17/mining-twitter-data-with-python-part-6-sentiment-analysis-basics/" target="_blank">Marco Bonzanini's series of posts about twitter analytics</a>, but quick testing showed that I can't seem to find a good seed dictionary for calculating sentiment. After some searching I found <a href="https://www.csc.ncsu.edu/faculty/healey/tweet_viz/" target="_blank">this tool for visualising sentiment of tweets by Christopher G. Healey and R. Siddarth Shankar</a> (from North Carolina uni, how fitting), who in turn use sentiment data gathered on MTurk and presented in <a href="http://link.springer.com/article/10.3758%2Fs13428-012-0314-x," target="_blank">this paper</a> (available to download).<br />
<br />
What I thought was quite smart is the way Healey et al. average sentiment. The database of sentiment mentioned above contains a mean and standard deviation of valence (positive/negative affect), estimated from many measurements. They take both of these and then compute the normal distribution pdf for the valence being exactly the mean, giving a measure of certainty about valence for each word. This is then used as a weight in averaging the sentiment for the whole tweet.<br />
I only analysed sentiment for tweets with >= 2 words in the database, which excluded ~200k tweets (10%) from analysis.<br />
<br />
We can see that:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgG2FEf6JoL1fYUJzqRY68WQ8mi5evdyJuxOGOIYhKib6LSD34gxYNuegOBoObg62uvzg8nyGu79bRgXud4qm-BKSVlkqIBz-8qOMKarDHHjGGerv8NSdGPCup32o7jJgjP8dofMXDkXs0O/s1600/hillary_freq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgG2FEf6JoL1fYUJzqRY68WQ8mi5evdyJuxOGOIYhKib6LSD34gxYNuegOBoObg62uvzg8nyGu79bRgXud4qm-BKSVlkqIBz-8qOMKarDHHjGGerv8NSdGPCup32o7jJgjP8dofMXDkXs0O/s320/hillary_freq.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMsZD5QTVVPC3wWD39D7JWHlmULLgMQrbT1cTF_nfj6KWHfOjoj_SwNWCEwbZuwTYlojjs22D9kryqK4wqf7rem2jweKsY-lGRFTr5zpoMW9mJlGGdOXJ5K6vP1XjOXYeWAUSIz1FebUxp/s1600/donald_freq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMsZD5QTVVPC3wWD39D7JWHlmULLgMQrbT1cTF_nfj6KWHfOjoj_SwNWCEwbZuwTYlojjs22D9kryqK4wqf7rem2jweKsY-lGRFTr5zpoMW9mJlGGdOXJ5K6vP1XjOXYeWAUSIz1FebUxp/s320/donald_freq.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
Both of candidates are mentioned about equally frequently during the debate (grey highlight on the right), though Donald has a few more mentions in the night (twitter wars?)</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicUjJmuz46gLjARdvcNzp1_w_h1uI8uxyyo_zS5lnNhWn54F1atbSYcdI_RgBezepgdQnot4twEx_diS-Gav9lkBNonwbJzF88LT3X2WriDViXUsJANkmELj1OqaUUkXKzMEPvGj8i-goH/s1600/both_freq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="281" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicUjJmuz46gLjARdvcNzp1_w_h1uI8uxyyo_zS5lnNhWn54F1atbSYcdI_RgBezepgdQnot4twEx_diS-Gav9lkBNonwbJzF88LT3X2WriDViXUsJANkmELj1OqaUUkXKzMEPvGj8i-goH/s320/both_freq.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
The spike in mentions of candidates seems to be during the debate, though peak mention of <b>both at the same time</b> is a few hours after (comparisons?).</div>
<div class="separator" style="clear: both; text-align: left;">
The sentiment of tweets which mention candidates is lower (more 'angry'/'bad'/'racist') than in tweets not mentioning them (note: line is smooth curve fit from <a href="http://docs.ggplot2.org/current/geom_smooth.html" target="_blank">ggplot2</a>, using <a href="https://cran.r-project.org/web/packages/gam/gam.pdf" target="_blank">Generalized Additive Model</a>):</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg22eGNKlTc4PVH1MWV92uxVfXKFAThge42KKp_KSCuRGRRinAprj5ft5JGzCl5fr2gzoRukFJWzHTh0SaGhGJJhDfN7TSMq-kY6gvQgloat0HLVFUD35xHY9LIyjIJgiEfhPpGsnReEOwz/s1600/hillary_sentiment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg22eGNKlTc4PVH1MWV92uxVfXKFAThge42KKp_KSCuRGRRinAprj5ft5JGzCl5fr2gzoRukFJWzHTh0SaGhGJJhDfN7TSMq-kY6gvQgloat0HLVFUD35xHY9LIyjIJgiEfhPpGsnReEOwz/s320/hillary_sentiment.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXnG-tlb2b9n37GU0P6PL9GBVGO49BhjvvJKxt9ItfImhJuI4jcv5gC9A-vgihOT_9lHkNe_z82paZOqSGkGLZa6uRKu87fIhy39RTB8DU1V3r-at2Ag2msh17l4r1_velun1P5Y8CGFqe/s1600/donald_sentiment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXnG-tlb2b9n37GU0P6PL9GBVGO49BhjvvJKxt9ItfImhJuI4jcv5gC9A-vgihOT_9lHkNe_z82paZOqSGkGLZa6uRKu87fIhy39RTB8DU1V3r-at2Ag2msh17l4r1_velun1P5Y8CGFqe/s320/donald_sentiment.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
Not sure if the rise in sentiment for tweets mentioning Trump is a blip, or maybe <a href="https://www.thestar.com/news/world/2016/10/19/donald-trump-said-19-false-things-on-tuesday-oct-18.html" target="_blank">it's because he lied only a tiny bit that day</a>?</div>
<div class="separator" style="clear: both; text-align: left;">
However it seems that the more people mention any candidate, the worse they feel about them. Even more so if they mention more candidates:</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGvHroJKaL_lRsuu3fWkMXgyYwrgraM-Bp5w2LQUX8hcSmzxW0XUuwOP2EghyyVfUiYOJTcxP57_YT_ylS5wacRo8SV23ZG5J9RlT64JjbOIHaQJ6FgqTO6zW5xuxAPKwm781lHtA8zKvT/s1600/both_sentiment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGvHroJKaL_lRsuu3fWkMXgyYwrgraM-Bp5w2LQUX8hcSmzxW0XUuwOP2EghyyVfUiYOJTcxP57_YT_ylS5wacRo8SV23ZG5J9RlT64JjbOIHaQJ6FgqTO6zW5xuxAPKwm781lHtA8zKvT/s320/both_sentiment.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
But hey, I can see some hombres on the horizon!</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnlZ_rIvYhPUrUKjD9xcaAsqbAl9qKA7gboBP-9EcT7vkGm_xdMxAoSGgMYp1HJFmar_ZIj-BqugGVp37_-u3vaq6Iir1qr16RlN4DAHCAl40pAZJ2-TXoRmYm-d0LG061xDkTafV9cPQl/s1600/hombres_freq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnlZ_rIvYhPUrUKjD9xcaAsqbAl9qKA7gboBP-9EcT7vkGm_xdMxAoSGgMYp1HJFmar_ZIj-BqugGVp37_-u3vaq6Iir1qr16RlN4DAHCAl40pAZJ2-TXoRmYm-d0LG061xDkTafV9cPQl/s320/hombres_freq.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
And they are bad! <b>Tremendously</b> bad!</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfSf4O3mSqjjEU_zF12YL0MUr7lqWi0BMzQuCZu6tTC5_RdraADhwTBRPTfLYU2X5hNvaImNnWnwgBHybmc9n5uhdZDQPMujXYbj_aZSD7hMJxbwS9eZn-Pkuv5wZ2WuC2POtp7DCY2yZO/s1600/hombres_sentiment.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfSf4O3mSqjjEU_zF12YL0MUr7lqWi0BMzQuCZu6tTC5_RdraADhwTBRPTfLYU2X5hNvaImNnWnwgBHybmc9n5uhdZDQPMujXYbj_aZSD7hMJxbwS9eZn-Pkuv5wZ2WuC2POtp7DCY2yZO/s320/hombres_sentiment.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
Comments? Leave a comment, check out my source code, give me a shout!Lukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.com0tag:blogger.com,1999:blog-8173081656692899540.post-26649655925349292912015-05-08T16:06:00.002+01:002015-05-08T16:06:58.370+01:00[4] FPTP bias - who's gained and lost the most in GE2015 due to FPTP?The First-Past-The-Post method (as currently used e.g. in the UK parliamentary elections) has been heavily criticised (<a href="http://link.springer.com/chapter/10.1007%2F978-0-387-09720-6_3">a book</a>, <a href="https://youtu.be/s7tWHJfhiyo">a very accessible video from CGP Grey</a>), and understandably so, as it creates various problems with tactical voting, and the views of people not being accurately represented. There is a <a href="http://www.electoral-reform.org.uk/">society</a> which campaigns for electoral reform in the UK.<br />
<br />
But let's look at the data - which party has gained / lost the most due to FPTP vs. proportional representation in the 2015 General Elections in the UK? What follows is my quick-and-dirty analysis.<br />
<br />
<a href="http://lukaszkopec.com/upload/fptp.zip">[download the data and code]</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9wy1A3aViy5giSb-C_XHrCBH3Z0zCxQ03rwq1fRCNKw3MBxS3lYGngzTYwWcbWrg0RUaOHRqYfCuZhq8RAh0AmD2TwNiqnOHf0aC7ShuB5P0ZaGGa0On0GQ30aOcIiyEOpkKhAiRLEDTo/s1600/loss_gain.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9wy1A3aViy5giSb-C_XHrCBH3Z0zCxQ03rwq1fRCNKw3MBxS3lYGngzTYwWcbWrg0RUaOHRqYfCuZhq8RAh0AmD2TwNiqnOHf0aC7ShuB5P0ZaGGa0On0GQ30aOcIiyEOpkKhAiRLEDTo/s1600/loss_gain.png" /></a></div>
The method I used is simple - I first calculated the proportion of seats each party gained in the last election. Then I subtracted from it the party's share of votes. To make the data comparable across parties, I divided that number by the share of votes.<br />
<br />
UKIP is the biggest loser here, getting just one Commons seat with a 12.7% vote share (~98.8% loss). Democratic Unionist Party (NI) gained more than twice as many seats as its vote share would suggest (~105.1% gain).<br />
<br />
There is no obvious relation between the vote share (or seats share) and how much the party gains or loses, so no obvious advantage to small / large parties. However, small regional parties (vide: SNP victory this election) seem to benefit the most.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_GFAbRQbv-CCFmvQb__pVFSY5pJgaKM2GP78m8-DwiII9hY0cYOVkXCHYZfFLU4M-WqDtL1jZoylfe2fw1omB8QLbRIYo7kshpb9gTw2hVOewnH_8K8j4qn_0IQB2MJ3rnqpLSl-aAZnj/s1600/seats_votes_loss_gain.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_GFAbRQbv-CCFmvQb__pVFSY5pJgaKM2GP78m8-DwiII9hY0cYOVkXCHYZfFLU4M-WqDtL1jZoylfe2fw1omB8QLbRIYo7kshpb9gTw2hVOewnH_8K8j4qn_0IQB2MJ3rnqpLSl-aAZnj/s1600/seats_votes_loss_gain.png" /></a></div>
<br />
Still, the number of votes cast seems to be a good predictor of the number of seats.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgExk5ftZqiQub6qJYUX_-zYgL1S-qHppsvloemE5l0lk77RbrBLp248RJbY89A_NsmNAVTZHOANOg_yCwKnAIo28K7TlN8ogjPnQOFNHuzFOnJFvw4svZNW_rYLNEr1BNU4DYA3qoh2IHv/s1600/poor+ukip.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgExk5ftZqiQub6qJYUX_-zYgL1S-qHppsvloemE5l0lk77RbrBLp248RJbY89A_NsmNAVTZHOANOg_yCwKnAIo28K7TlN8ogjPnQOFNHuzFOnJFvw4svZNW_rYLNEr1BNU4DYA3qoh2IHv/s1600/poor+ukip.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Unless you are UKIP. We don't like UKIP.</td></tr>
</tbody></table>
Lukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.com0tag:blogger.com,1999:blog-8173081656692899540.post-15031935918333814072014-11-10T00:58:00.000+00:002014-11-10T00:58:28.042+00:00There is a house...<div class="separator" style="clear: both; text-align: justify;">
There is a church in Reykjavík which is shaped like a (slightly choppy) approximation of the logistic distribution...</div>
<div class="separator" style="clear: both; text-align: justify;">
<a href="http://en.wikipedia.org/wiki/Hallgr%C3%ADmskirkja">[wiki]</a></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVfvC0Mqd-sX0BCXC6iDu4IgH7UK9G0luzVkOdJSsUvViNFEq0Hbj3NRRnA74D2gQ0zYD3yrstWoSlyGVtvVo5Rm8bRo5batkSiqvpjb0OqIm1V9OxAr32RojYt0e7SbaEbRQvl7tDQOH-/s1600/Screen+Shot+2014-11-10+at+00.48.45.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVfvC0Mqd-sX0BCXC6iDu4IgH7UK9G0luzVkOdJSsUvViNFEq0Hbj3NRRnA74D2gQ0zYD3yrstWoSlyGVtvVo5Rm8bRo5batkSiqvpjb0OqIm1V9OxAr32RojYt0e7SbaEbRQvl7tDQOH-/s1600/Screen+Shot+2014-11-10+at+00.48.45.png" height="333" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">I said approximation, ok?</td></tr>
</tbody></table>
<br />Lukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.com0tag:blogger.com,1999:blog-8173081656692899540.post-14468956294653348002014-10-09T18:15:00.001+01:002014-10-09T18:15:42.416+01:00[2] #BellogateSo, we had this happen to UCL e-mails <a href="http://www.bbc.co.uk/news/blogs-trending-29555344" target="_blank">[BBC Link]</a>:<br />
<br />
That landed me with (ca.) 3167 e-mails (which I managed to redirect to my spam folder). It started with somebody sending an e-mail from (supposedly) 'provost@ucl.ac.uk' to 'all-students@ucl.ac.uk', saying 'bello!'. Then, when people started replying to all and realising that anything can get through, the situation spiralled out of control.<br />
<br />
Now, I've downloaded the e-mails in Python (<a href="http://www.voidynullness.net/blog/2013/07/25/gmail-email-with-python-via-imap/" target="_blank">this blog post</a> was helpful), and started crunching some numbers.<br />
<br />
And yes, they came in large numbers. Over time it picked up, then died slightly after midnight, picked up a bit in the morning (when the problem was fixed by UCL's IT):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJfFdp-9c8Uy_3CBMFNWshy3KEbkbKl8NHiPSbgKmyQuAxueXaLlTsXi7fSg8mrKJbU8lYpgnp7AxkcWpXNgt9SWs9gzenveatzT8P9mbyGq-qPRs62vyMsapgh9KzlYX8U9bdF9u_Jnrn/s1600/blah.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJfFdp-9c8Uy_3CBMFNWshy3KEbkbKl8NHiPSbgKmyQuAxueXaLlTsXi7fSg8mrKJbU8lYpgnp7AxkcWpXNgt9SWs9gzenveatzT8P9mbyGq-qPRs62vyMsapgh9KzlYX8U9bdF9u_Jnrn/s1600/blah.png" height="240" width="320" /></a></div>
<br />
Most of the peak around 11pm was due to ucl e-mail addresses sending most e-mails:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhubwbTQdSGqI9r1nWtxrLcOtzfEKYxR4EC4IRVrFhdy6eLjHcwFbDb2uOtzBL6z8steaPFh67r2ITXI2mlUBec-LCYtkrbzc_CPWf4HuIAO2LksBaQDYzCZTUd0Q37IBrEdWPsJle2oC9j/s1600/from_ucl.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhubwbTQdSGqI9r1nWtxrLcOtzfEKYxR4EC4IRVrFhdy6eLjHcwFbDb2uOtzBL6z8steaPFh67r2ITXI2mlUBec-LCYtkrbzc_CPWf4HuIAO2LksBaQDYzCZTUd0Q37IBrEdWPsJle2oC9j/s1600/from_ucl.png" height="240" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Proportion of addresses '@ucl.ac.uk' which sent the e-mail, by the time sent.</td></tr>
</tbody></table>
<br />
It started with most of the e-mails having 'bello' in the title, but soon I had to block all messages sent to all-students, because filtering by title stopped making sense:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYOKbF6Qx4Y4FoGuEq5eVWud9brruEszKC3tBQOnZXfRAmxdAnODQjwPZVDeO8j70eINo5oV0hO244Qe4-r_Liz-_l-lc_H2xVgSlJYKhjvJbYOtN8dLCkXnrPOr-WvFWNwlyyOgUW2yY4/s1600/bello_percent.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYOKbF6Qx4Y4FoGuEq5eVWud9brruEszKC3tBQOnZXfRAmxdAnODQjwPZVDeO8j70eINo5oV0hO244Qe4-r_Liz-_l-lc_H2xVgSlJYKhjvJbYOtN8dLCkXnrPOr-WvFWNwlyyOgUW2yY4/s1600/bello_percent.png" height="240" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Proportion of e-mails with 'bello' in 'Subject' field</td></tr>
</tbody></table>
Now I'm trying to connect the data in a graph... Who knows, maybe I'll end up doing some Enron-style <a href="http://bailando.sims.berkeley.edu/enron_email.html" target="_blank">discovery</a>?Lukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.com0tag:blogger.com,1999:blog-8173081656692899540.post-55856487305024504042014-09-09T13:13:00.000+01:002014-09-09T13:13:00.202+01:00Life hacking using gas consumption dataSo my flat is in a split terraced house, and somehow both gas meters ended in the neighbours flat (I have the dubious pleasure of having both electricity meters). I'm moving out soon, so I need to submit the final readings to the gas company, but my neighbours are moving out tomorrow, so I won't be able to just pop in to check the readings. The landlord will check the meters anyway, so it's not a big deal, but hey, why should I trust him?<br />
Fear not, for, as a geek, I begot a spreadsheet, in which collected are all figures detailing my hydrocarbon usage, at least in the gaseous phase! So I can trust my own math instead of the landlord! Cue plot:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtfTa-hp4rVDpjkDpY6_gJTWeJYsbf14D4U1bdfRnU0fxh2Y0OQh1McPBJru7ZCPr-VZtHTfluQtdqgEKksRelSY6QAb52PdNUb7D3gP9nhTs0nlqmqArsPtF3HmYiWwNswET802BPH1Vh/s1600/gasusage.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtfTa-hp4rVDpjkDpY6_gJTWeJYsbf14D4U1bdfRnU0fxh2Y0OQh1McPBJru7ZCPr-VZtHTfluQtdqgEKksRelSY6QAb52PdNUb7D3gP9nhTs0nlqmqArsPtF3HmYiWwNswET802BPH1Vh/s1600/gasusage.png" height="210" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Hello, Mr. Putin!</td></tr>
</tbody></table>
You can see quite clearly when the winter started getting cold, when I was away for Christmas, or when I just got lazy and have not taken readings as often. I hear that the police uses electricity consumption to locate weed plantations. Does the BBC use gas consumption to find people for the Great British Bake Off?<br />
Anyway, assuming a linear interpolation between the previous meter reading in July and when I move out, I shall use about 1.75 cubic feet extra on top of today's meter reading. I doubt the utility companies use anything fancier to predict their bills, though they usually err on the side of more consumption. Let's see how far off I'll be.Lukasz Kopechttp://www.blogger.com/profile/02190627141723131040noreply@blogger.com0