data science, hackday

On parliamentary language analysis

winner

picture credits: Tracy Green

If you know me, you also know that I never miss the UK Parliament Hackday. This year turned into a joint event between Parliament, the National Audit Office and the Office of National Statistics, AccHack14 has been run over 2 days at the superbly located NAO offices in Victoria. And this year, I’ve won a prize for “Best Parliamentary App”!

I have long been fascinated by Hansard, the archive of Parliamentary debates.

I’m obsessed by Hansard. Hansard keeps me awake at night.

With these words, I’ve opened my presentation. The app I developed is pretty much a tool to search and analyse Hansard in an uncommon way, using a N-grams viewer. N-grams are pretty much sentences of N words. 1-grams are simple words (like “fox”), 2-grams are two words in sequence (like “quick fox”) and so on.  The tool I developed, Parli-N-Grams, allows the user to search n-grams in the Hansard corpus, inspired by Google Books Ngram Viewer.

You might ask: why is this a good way to search?

The best kind of search is a search that lets you discover by the simple action of searching.

Searching is about finding a result – in this case, finding a certain debate where a word or sequence of words was mentioned. However, by showing graphically the distribution of those n-grams over the years, Parli-N-Grams also lets the user navigate through the data to discover more:

  • how language evolves
  • how topics become more important in certain historical periods
  • how certain words replace certain other words

and so on.

MPs love to talk about benefits

Try plotting the word benefits:

benefits

MPs are seemingly using the word benefits in debates more an more. For comparison, plot both benefits and welfare:

welfarevsbenefits

You see how they started pretty much at the same level and then slowly diverge over the years? This is utterly fascinating. Another great example is the word war:

war

War enters abruptly the political discourse in the late 1930s, when the Second World War became increasingly inevitable, peaks in 1944 and slowly declines. What is really interesting is that not even the repeated military efforts in Iraq (see the peaks in 1991 and 2003) have made the frequency of the word return to the levels seen in the 40s. Are MPs consciously keeping the word war away from the debates? You will remember how the debate was getting heated as to whether the Iraqi War should be defined a “peacekeeping mission”. But it’s not just Iraq: not even the Falklands War (1982) gets a very high peak. If you’re curious, see also “terrorism“.

The way MPs refer to politics is seemingly changing, too. See what happens when plotting the word party:

party

Here we have a massive spike in 1983, a General Election year. Is it maybe because of the Labour-SDP split? After peaking in 1998 (the year after Labour get back into power) the word seems to decline all the way until peaking again in 2009, just before the General Election returned a hung parliament. Another great use I can envision for Parli-N-Grams is to analyse the evolution of language. See, for example, the distribution of frequencies for basically:

basically

How many other such patterns could we discover?

I need to fix a couple of things…

Parli-N-Grams is intended for social and political historians, and passionate language researchers, but it’s not a stable product yet.

I’m working on it, but be aware that:

  • the scripting is a bit rusty, so the website might crash here and then; restart the page if things don’t show up
  • my harvesting procedures were done in a rush, in PHP, without being particularly efficient: as a consequence, at the moment Parli-N-Grams only works for 1-grams (i.e. single words)
  • during the demo I also showed a nice visualizer for the debate transcripts, but I’ve now disabled this feature; I will re-enable as soon as I decide on a way to efficiently search through the files (ideally using ElasticSearch or similar product)
  • I haven’t normalised the results, Google’s Ngram Viewer doesn’t either, and I’m still thinking if it’s more interesting this way or not; I’ll blog about it soon
  • more bad stuff that likely escaped me (if you like sed, please have a look at the filter and laugh at me).

Would you like to know which party mentions “benefits” more?

As I’ve said, I’m working to make Parli-N-Grams stable and usable, so that it can be enjoyed by historians, journalists, and whoever shares my obsession with hansard. The roadmap is as follows:

  1. get Parli-N-Grams to be stable, reactive, working on most browsers
  2. add 2-grams, 3-grams, 4-grams and 5-grams (I’ll likely stop here)
  3. add optional segmentation controls; for example, split by political party, to answer questions like which party mentions “benefits” more?
  4. make the harvesting an ongoing procedure; I would like Parli-N-Grams to update automatically every time we receive new transcripts from Hansard.
  5. add an API to allow people to embed the charts.

 

If you have any comment/request/idea, please get in touch. Meanwhile, these are the relevant links:

Let me conclude by thanking Nick (National Audit Office), Tracy (UK Parliament) and Matt (Office of National Statistics): you’ve done a great job :) Big credit to the super-smart folks at DXW, who’ve won the overall “Best In Show” prize with a very smart and elegant website investigating housing data, Right-to-Buy-Bye. They’ve also blogged about their experience.

Standard
gov, hackday, my projects, open data, Work in IT

chaMPion

I never thought hackdays could be so much fun that I would end up attending not just one but two in about ten days, getting flu in between. Oh, and that my team would end up winning the overall Best in show award over 27 other hacks and almost 100 people! Which is what this blog post is about…

First of all: credit where credit is due

The folks from Rewired State deserve a massive thank you for setting up such events, and for showing me that no matter the age and background of the people you work with there is room for great results bacause geeks more than often work well together, in teams, despite what stereotypes like to say.

ChaMPion: what is it?

The idea behind chaMPion is rather simple: you want to find MPs who care about what you care. Often their “declared interests” are not particularly meaningful or up to date, so we decided we would mine the content of their speeches.

ChaMPion is a tool that allows the user to enter a given topic and returns a list of MPs who have spoken about that topic, ranked by relevance.

How does it work?

In easy steps:

  1. we downloaded the extract of the Commons debates for all the sessions of Parliament since the first sitting in May 2010 following the General Election to the latest in November 2012
  2. we parsed these extract and aggregated the speeches by MP – as a result we obtained a map associating any given MP to all of his or her speeches
  3. for each MP we run an algorithm that calculates their keywords distribution; specifically we used Topia.Termextract which, given a text, determines its important terms and their strength
  4. we calculated a ratio for each word over the total of terms extracted for that MP and used this as a basis for our rank
  5. we built an API that searches by keyword and a captivating UI that displays the results graphically, together with other data for the MP and his or her constituency harvested from other sources.

Did you find anything interesting?

Yes! For example, if you search for phone the winner is Tom Watson; if you search for rape, it’s Caroline Flint.

Why didn’t you use X, Y, Z?

YES, you are right, this is not perfect, but it was meant to be just a quick hack that received much more interest than we were anticipating :)

For example, using Topia.Termextractor was not my first choice. For a semantic analysis of this kind a beautiful mathematical tool called Latent Dirichlet Allocation (LDA) is generally the natural choice. LDA runs a statistical analysis over a corpus of text, assuming that a document is about a collection of topics. It then returns the distribution of such topics. It’s not difficult to understand. For example, it might say that a speech by Tom Watson is 30% about phones, 30% about news and 40% about crime.

Unfortunately, I didn’t manage to find a library for LDA that worked on my laptop.

Will you keep developing it?

Given we received some pretty heart-warming feedback the answer is yes. For example, I’m going to try and find (or develop) an LDA library to have finally a proper topic model.

We also plan to introduce more statistics, possibly at a single MP level, and to try and work out a temporal component as well, in order to display how interests change over time. This might not make sense for all the MPs, as most of them will give a speech very rarely, but there is certainly a subset for which this analysis is meaningful.

Starting next week, the website will be updating with data from the coming sittings.

Code?

The code for this hack is all on my GitHub account. Feel free to download it, modify it, run your services on top of it. I’ll keep uploading changes and the most recent stable version will always be found running at http://www.champion.puntofisso.net. Feedback is also very welcome, but beware that the code is very dirty until I manage to tidy it up a little. Requests for functionality are encouraged and will be considered :)

Another round of thanks

To wrap up, I gave Mark the input of “look there’s an interesting hackday” but I will never thank him enough for actually taking me seriously, setting up one of the best teams ever, and facilitating our conversations and work. Lewis has been a great partner in crime, giving his best on a simple but effective UI which has certainly been überimportant in conveying our idea and let us win.

Sharon has provided invaluable knowledge of the works of the Parliament and some incredibly good mock-ups of the final interface, while Hadley has helped with a great understanding of the datasets.

Together with our chats with Glyn, Sheila and Brett, we had some good fun discussing ideas and saving ourselves the burden of having to go through a set of certainly wrong hacks during the day.

A big recommendation: Cards Against Humanity is the best team-building tool ever conceived.

Standard