gov, hackday, my projects, open data, Work in IT

chaMPion

I never thought hackdays could be so much fun that I would end up attending not just one but two in about ten days, getting flu in between. Oh, and that my team would end up winning the overall Best in show award over 27 other hacks and almost 100 people! Which is what this blog post is about…

First of all: credit where credit is due

The folks from Rewired State deserve a massive thank you for setting up such events, and for showing me that no matter the age and background of the people you work with there is room for great results bacause geeks more than often work well together, in teams, despite what stereotypes like to say.

ChaMPion: what is it?

The idea behind chaMPion is rather simple: you want to find MPs who care about what you care. Often their “declared interests” are not particularly meaningful or up to date, so we decided we would mine the content of their speeches.

ChaMPion is a tool that allows the user to enter a given topic and returns a list of MPs who have spoken about that topic, ranked by relevance.

How does it work?

In easy steps:

  1. we downloaded the extract of the Commons debates for all the sessions of Parliament since the first sitting in May 2010 following the General Election to the latest in November 2012
  2. we parsed these extract and aggregated the speeches by MP – as a result we obtained a map associating any given MP to all of his or her speeches
  3. for each MP we run an algorithm that calculates their keywords distribution; specifically we used Topia.Termextract which, given a text, determines its important terms and their strength
  4. we calculated a ratio for each word over the total of terms extracted for that MP and used this as a basis for our rank
  5. we built an API that searches by keyword and a captivating UI that displays the results graphically, together with other data for the MP and his or her constituency harvested from other sources.

Did you find anything interesting?

Yes! For example, if you search for phone the winner is Tom Watson; if you search for rape, it’s Caroline Flint.

Why didn’t you use X, Y, Z?

YES, you are right, this is not perfect, but it was meant to be just a quick hack that received much more interest than we were anticipating :)

For example, using Topia.Termextractor was not my first choice. For a semantic analysis of this kind a beautiful mathematical tool called Latent Dirichlet Allocation (LDA) is generally the natural choice. LDA runs a statistical analysis over a corpus of text, assuming that a document is about a collection of topics. It then returns the distribution of such topics. It’s not difficult to understand. For example, it might say that a speech by Tom Watson is 30% about phones, 30% about news and 40% about crime.

Unfortunately, I didn’t manage to find a library for LDA that worked on my laptop.

Will you keep developing it?

Given we received some pretty heart-warming feedback the answer is yes. For example, I’m going to try and find (or develop) an LDA library to have finally a proper topic model.

We also plan to introduce more statistics, possibly at a single MP level, and to try and work out a temporal component as well, in order to display how interests change over time. This might not make sense for all the MPs, as most of them will give a speech very rarely, but there is certainly a subset for which this analysis is meaningful.

Starting next week, the website will be updating with data from the coming sittings.

Code?

The code for this hack is all on my GitHub account. Feel free to download it, modify it, run your services on top of it. I’ll keep uploading changes and the most recent stable version will always be found running at http://www.champion.puntofisso.net. Feedback is also very welcome, but beware that the code is very dirty until I manage to tidy it up a little. Requests for functionality are encouraged and will be considered :)

Another round of thanks

To wrap up, I gave Mark the input of “look there’s an interesting hackday” but I will never thank him enough for actually taking me seriously, setting up one of the best teams ever, and facilitating our conversations and work. Lewis has been a great partner in crime, giving his best on a simple but effective UI which has certainly been überimportant in conveying our idea and let us win.

Sharon has provided invaluable knowledge of the works of the Parliament and some incredibly good mock-ups of the final interface, while Hadley has helped with a great understanding of the datasets.

Together with our chats with Glyn, Sheila and Brett, we had some good fun discussing ideas and saving ourselves the burden of having to go through a set of certainly wrong hacks during the day.

A big recommendation: Cards Against Humanity is the best team-building tool ever conceived.

Standard
computer science, my projects, Work in IT

Production systems can’t be beta

Warning: this is just one of those ranting, whinging, blog posts you all developers like.

Yesterday I have been struggling almost half an hour with a python script. A very simple one: connect to a system, download an XML. This XML is a paginated list containing the number of the current page, the next one, and the last. The script should have simply got to the next page, read the “next” number, downloaded such page, and terminate when current == last.

Easy, right?

Except I spent half an hour trying to understand why the script was going into an infinite loop. I am an experienced programmer, but not a massively confident one: when things don’t work, I check my code. You can call it coding modesty if you prefer.

It turns out the problem was in the XML: whatever the page, it always contains “this = 1” and “next = 2”. This is supposed to be a production system, at its version 3, for which the institution I work for pays a huge amount of money.

This is quite a big bug on a basic function of what is supposed to be a production system. Which prompts me the obvious question: have they ever tested it?

Standard
HCI, my projects, Web 2.0, Work in IT

Aggregated values on a Google Map

UPDATE 27/08/09: The functionalities of my version of MarkerClusterer have been included in the official Google code project, you can find it gmaps-utility-library-dev. The most interesting part was the so called MarkerClusterer.

Imagine you need to show thousands of markers on a map. There may be many reasons for doing so, for example temperature data, unemployment distributions, and the like. You want to have a precise view, hence the need for a marker in every town or borough. What Xiaoxi and other developed, is a marker able to group all the markers in a certain area. This is a MarkerClusterer. Your map gets split into clusters (of which you can specify the size – but hopefully more fine grained ways of defining areas will be made available) and you show for every cluster a single marker, which is labelled with the total count of markers in that cluster.

I thought that this opened a way to get something more precise and able to make reasoning over map data. Once you have a ClusterMarker, wouldn’t it be wonderful if you had the possibility of displaying some other data on it, rather than the simple count? For example, in the temperatures distribution case, I would be interested in seeing the average temperature of the cluster.

That’s why I developed this fork of the original class (but I’ve applied to get it into the main project – finger crossed!) that allows you to do what follows:

  • create a set of values to tag the locations (so that you technically attach a value to each marker)
  • define a function that is able to return an aggregate value upon the values you passed, automatically for each cluster

That’s all. The result is very simple, but I believe it is a good way to start thinking about how the visualization of distributed data may affect the usability of a map and the understanding of information it carries. Here’s a snapshot of the two versions, the old on the left (bearing just the count) and the new on the right (with average data). Data here refer to NHS Hospital Death Rates, as published on here. If you want to see the full map relating to this example, click here.

Standard
Work in IT

The hunt for a Google job

The first time I got in touch with a Google recruiter was more or less a week after I’d decided to enrol for a PhD. Apparently this – very kind, I must say – recruiter was browsing uni pages and found my profile. At the time, apart from telling her that I was due to start a PhD in some months, I was very interested in Systems Administration. She did all her best to convince me to apply as a Developer. Weird, but probably there’s some rules here. Of course after a couple of interviews in which I told them that I was not interested in moving to Switzerland and that I wanted to do a PhD, they decided not to pursue with my profile.

Now, again, a recruiter has contacted me. A week after having started a new job. This time, I’ve moved onto being a developer. Guess what? She wants me to apply for a Systems Administration position.

Google, you have knowledge of everything on the network. What about trying to tune your timing and select me for something I’m actually skilled for? :-)

Standard