data science, hackday

On parliamentary language analysis

winner

picture credits: Tracy Green

If you know me, you also know that I never miss the UK Parliament Hackday. This year turned into a joint event between Parliament, the National Audit Office and the Office of National Statistics, AccHack14 has been run over 2 days at the superbly located NAO offices in Victoria. And this year, I’ve won a prize for “Best Parliamentary App”!

I have long been fascinated by Hansard, the archive of Parliamentary debates.

I’m obsessed by Hansard. Hansard keeps me awake at night.

With these words, I’ve opened my presentation. The app I developed is pretty much a tool to search and analyse Hansard in an uncommon way, using a N-grams viewer. N-grams are pretty much sentences of N words. 1-grams are simple words (like “fox”), 2-grams are two words in sequence (like “quick fox”) and so on.  The tool I developed, Parli-N-Grams, allows the user to search n-grams in the Hansard corpus, inspired by Google Books Ngram Viewer.

You might ask: why is this a good way to search?

The best kind of search is a search that lets you discover by the simple action of searching.

Searching is about finding a result – in this case, finding a certain debate where a word or sequence of words was mentioned. However, by showing graphically the distribution of those n-grams over the years, Parli-N-Grams also lets the user navigate through the data to discover more:

  • how language evolves
  • how topics become more important in certain historical periods
  • how certain words replace certain other words

and so on.

MPs love to talk about benefits

Try plotting the word benefits:

benefits

MPs are seemingly using the word benefits in debates more an more. For comparison, plot both benefits and welfare:

welfarevsbenefits

You see how they started pretty much at the same level and then slowly diverge over the years? This is utterly fascinating. Another great example is the word war:

war

War enters abruptly the political discourse in the late 1930s, when the Second World War became increasingly inevitable, peaks in 1944 and slowly declines. What is really interesting is that not even the repeated military efforts in Iraq (see the peaks in 1991 and 2003) have made the frequency of the word return to the levels seen in the 40s. Are MPs consciously keeping the word war away from the debates? You will remember how the debate was getting heated as to whether the Iraqi War should be defined a “peacekeeping mission”. But it’s not just Iraq: not even the Falklands War (1982) gets a very high peak. If you’re curious, see also “terrorism“.

The way MPs refer to politics is seemingly changing, too. See what happens when plotting the word party:

party

Here we have a massive spike in 1983, a General Election year. Is it maybe because of the Labour-SDP split? After peaking in 1998 (the year after Labour get back into power) the word seems to decline all the way until peaking again in 2009, just before the General Election returned a hung parliament. Another great use I can envision for Parli-N-Grams is to analyse the evolution of language. See, for example, the distribution of frequencies for basically:

basically

How many other such patterns could we discover?

I need to fix a couple of things…

Parli-N-Grams is intended for social and political historians, and passionate language researchers, but it’s not a stable product yet.

I’m working on it, but be aware that:

  • the scripting is a bit rusty, so the website might crash here and then; restart the page if things don’t show up
  • my harvesting procedures were done in a rush, in PHP, without being particularly efficient: as a consequence, at the moment Parli-N-Grams only works for 1-grams (i.e. single words)
  • during the demo I also showed a nice visualizer for the debate transcripts, but I’ve now disabled this feature; I will re-enable as soon as I decide on a way to efficiently search through the files (ideally using ElasticSearch or similar product)
  • I haven’t normalised the results, Google’s Ngram Viewer doesn’t either, and I’m still thinking if it’s more interesting this way or not; I’ll blog about it soon
  • more bad stuff that likely escaped me (if you like sed, please have a look at the filter and laugh at me).

Would you like to know which party mentions “benefits” more?

As I’ve said, I’m working to make Parli-N-Grams stable and usable, so that it can be enjoyed by historians, journalists, and whoever shares my obsession with hansard. The roadmap is as follows:

  1. get Parli-N-Grams to be stable, reactive, working on most browsers
  2. add 2-grams, 3-grams, 4-grams and 5-grams (I’ll likely stop here)
  3. add optional segmentation controls; for example, split by political party, to answer questions like which party mentions “benefits” more?
  4. make the harvesting an ongoing procedure; I would like Parli-N-Grams to update automatically every time we receive new transcripts from Hansard.
  5. add an API to allow people to embed the charts.

 

If you have any comment/request/idea, please get in touch. Meanwhile, these are the relevant links:

Let me conclude by thanking Nick (National Audit Office), Tracy (UK Parliament) and Matt (Office of National Statistics): you’ve done a great job :) Big credit to the super-smart folks at DXW, who’ve won the overall “Best In Show” prize with a very smart and elegant website investigating housing data, Right-to-Buy-Bye. They’ve also blogged about their experience.

Standard
open data

On the open ended nature of Openness

Some days ago I was joking with a friend of making t-shirts with a “Open Data is my mission” slogan. The problem of that mission is that its object is not particularly well defined.

I was involved in a couple of interesting discussions via Twitter about this, with a couple of people whose opinion I really value. On the day TfL announced their new api.tfl.gov.uk I tweeted my happiness about their Open Data licence. My happiness was not shared by Adrian Short:

adrianshort

Adrian suggested their API was all but an open one; that as it had restrictions, especially the requirement to register for a key, it could not be linked to the adjective “Open”. (The whole conversation can be accessed here).

In a similar direction went a quick exchange with Aral Balkan:

aral

These two conversations highlight a curious problem in the “Open” communities, whether they are -Source, -Data, -Whatever: we’re talking about some very loosely defined concepts. As I say in my response to Aral: Open Data is just a phrase – what really matters is the licence attached to the data.

Openness is measured on a continuous scale. If there is a threshold below which we shouldn’t call some data “open”, that threshold has not been defined yet. It’s relative (to the data, to the context, to the country, to the user), it’s flexible, it’s got several possible meanings.

My personal position is to call open data whatever comes with no use restrictions (i.e.: you can use the data for whatever purpose you like). In legal terms, however, this gets complicated because we need to assign a licence to the data. When working with ODUG, for example, I always make a point of not accepting data releases with anything less than an Open Government Licence (or its Creative Commons / Open Database Licence equivalents).

Furthermore, in the not-so-public sector (which is what I generally call TfL), things are clearly complex, especially given the expectation (which I do not personally agree with, but this has no effect to this discussion) that a transport agency in a metropolis should be profit-making. TfL’s licence is probably not the best worded ever, but it is an Open Licence:

tfl

Yes, as Adrian notes, TfL can revise the Licence at any point. But until they do, they allow to copy, adapt, exploit the information with only requirement that of attribution. This is not much different from OGL.

Does the requirement to sign up for an API key justify the critique? This is clearly a complication that comes from the real-time nature of this data. A system with such a huge amount of data generated in a short time needs provisioning, and the best provisioning comes from knowing how many users can access the system. In this case I don’t think that having to register for a key is affecting the openness of the data because there is no restriction on who can register. Of course an improvement would be to have the possibility of anonymous registrations and I would support this; however, the SLA might still give priority to users who are not anonymous, simply because it knows more about their requirements. Openness is a compromise, one that comes from opposing needs clashing.

The non-real time datasets could be distributed without registration, this is where I agree with Adrian, but I don’t think this can justify the negativity against this data release, a step that goes in the right direction. Does anyone want to bring this up with TfL?

On a similar note, Aral initiated a somewhat long and inflamed thread about a similar issue: the use of Open Data and the expectation that from something open should descend something open. In this case the focus was on my friends at @transportapi, whom I think are doing a great job of showing how Open Data can create business.

transportapi

(Full conversation here).

Some interesting questions emerge from the thread:

  • Aral Balkan: “How’s this not closing off open data via a proprietary system only to license it commercially via an API?”
  • Emer Coleman: “We are DaaS provider. 1,000 hits a day for free then charge per hit with SLA’s once exceeded but also don’t have any IP on downstream products or services and more open licensing”.

The thread goes on and on with similarly opposing views. The question emerging is one: is there a (moral) obligation for Open Data users to be as open as the starting data?

I will take the “risk” of being seen as an Open Data moderate: my view is that this question doesn’t have a straight answer, it all depends on the level of maturity of the Open Data movement in that specific context and the product. Once again, as in TfL’s case, we’re talking about a relevant amount of real-time data. In this specific case, the data is heavily modified by Transport API to make it cleaner. It is a relevant chunk of work. It would be unsustainable to provide it for free and without registration the service level would soon degrade. Hence, once again we need a compromise. Building sustainable businesses on top of Open Data is still something new. But sticking to the legal: the licence does not place limitations on the use of the data. This can be open enough for some and not for others. “Open Data is a broad church”, says Jonathan Raper in the same thread. Sustainable Open Data-powered businesses create a virtuous circle that encourages more data releases, and I think we should welcome it.

One final note: we should probably stop capitalising the words “open data” and accept that multiple views will always be possible. Once again, open data is a compromise, as this debate shows. By keeping it on we can make that compromise produce useful results and the openness agenda advance.

Standard
gov, open data, policy, Web 2.0

Making Open Data Real, episode 3: the corporation

I have submitted my views to the Public Data Corporation consultation. Here are the answers.

Charging

Q1 How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability? Please provide evidence to support your answer where possible.

 I strongly believe that the Government should do its best to keep free as much as data it’s possible. In all honesty, I believe that all data should be kept free as there are two possible situations:

– data are already available, or refer to processes that already produce data, in which case the cost of publishing can be kept relatively low;

– data are not available, in which case one should ask why this dataset is required.

In the second case, I would suggest that the agency releasing such dataset could gain in efficiency, justifying the release of the data for free to the public.

There is also a consideration of what a data-based business model should look like. I think companies and individuals using public data as a basis for their business are finding it very hard to generate ongoing profit based on data only. Which brings me to the idea that charging for such data might actually make such companies lose their interest in using them, with a loss of business and service to the community. 

A good example to this point is represented by real-time transport-related mobile apps: they provide, often for a price that is very low, an invaluable service to the public. These are data that are already available to some agencies, as they are generated by a process of driving the transport business to higher efficiency and effectiveness by knowing the location of the transport agents (buses, trains, etc…). Although in some cases this requires costs for servers to support a high demand, in absolute and relative terms we are talking about limited resources. Such limited resources create a great service to the public, effectiveness for the transport company, and possibly some profit for the entity releasing the software. The wider benefit of the release of these data for free is much more important than the recovery of costs through a charge. That’s why I question in first place the need for a Public Data Corporation, if its goal is just that of charging for access to data.

 Q2 Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and re-use? Who would these benefit and how? Please provide evidence to support your answer where possible.

 Surely, transport and location based datasets are the most important: they allow careful planning by the public and, as a result, a more efficient society. But I would not talk about specific datasets. I would rather suggest the Government to have an ongoing relationship with the data community: hear what developers, activists, volunteers, charities ask for, and see if such requests can be satisfied by issuing a dataset appropriately.

Q3 What do you think the impacts of the three options would be for you and/or other groups outlined above? Please provide evidence to support your answer where possible.

 As I outlined in Question 1, I think data should be kept free. Hence, the best option is Option 1, provided that there is a genuine commitment to release more data for free. As I said the real question is whether data are available or not. When data are available, publishing and managing their update is a marginal cost to the initial process. When data are not available, the focus should be moved to understanding whether their publication can improve ongoing processes.

The freemium model works in the assumption that there is a big gap in the provision of a basic version of the data with respect to a more advanced service. I do not believe that this assumption holds for most of the datasets in the public domain.

Q4 A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?

I think that organisations involved in the PDC should keep to their public task. 

The risk in letting them develop commercial data product outside the public task is that the quality of the free portion of the data would plummet.

Q5 Are there any alternative options that might balance Government’s objectives which are not covered here? Please provide details and evidence to support your response where possible. 

I cannot see any other viable alternative, unless we consider the very unpopular idea of asking the developers for part of their profit, if any, in a way that shadows the mobile apps market. However, I think that the overhead in doing so is not worth setting up such a system.

 

Licensing

Q1  To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC? 

I think that realistically developers and other people interested in getting access to public data want to have clear and simple terms and conditions. I am not a legal expert and cannot possibly comment on the content of such licensing regime, but I would like it to be clear, short, and understandable to people who are not lawyers. The Open Government License, and any Creative Commons derivative, is a good example.

Q2  To what extent do you think each of the options set out would address those issues (or any others)? Please provide evidence to support your comments where possible.

Once again, I would like to stress the fact that the Open Government Licence is the ideal licence for any open-data. This would suit Option 3: creating a single PDC licence agreement, with a simple, clear, short licence to cover all situations. Option 2, an overarching PDC licence agreement that groups all commonalities of a number of licence, is possibly a second best, but it comes with a great risk of lack of simplicity, and confusion.

Option 1, a use-based portfolio of standard licences, would possible make sense in terms of clarity, but it complicates greatly the management of legal issue for the licensees. The consultation highlights that “rights and associated charges [would be] tailored to specific markets”, making it very difficult to understand such licences.

Naturally, if these licences need to be more restrictive than the Open Government Licence, I still think that a single restrictive licence, on the model of what the State of Queensland in Australia has done, would be the best idea for maintaining clarity and simplicity.

Q3 What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments

It’s very hard to tell at this stage, but I think that overcomplicated licences would greatly slow down access to the data and, consequently, delay the development of services to the community and the possibility of creating sustainable business. That’s why my choice goes to a single PDC licence agreement, possibly the Open Government Licence itself, in order to get services quickly developed and available. 

 Q4 Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?

I reckon there will be situations in which changing the models will have a positive impact as well as some cases in which there will be a local negative impact. We need to look at the overall benefit to society.

 

Oversight

Q1  To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?

I would say the current regulatory environment is appropriate and ready to deliver the vision for a PDC, having already produced a very effective OGL. The problem is not in delivering the PDC, it is rather in questioning the need for the corporation tout-court.

 Q2 Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?

 The only oversight activity needed at this stage is a deep analysis questioning the need for a PDC. I would strongly recommend to question the need for charging and using licences other than the OGL. A PDC charging for data risks to destroy the thriving open data ecosystem and deprive the community of great services. The development of a rich ecosystem will generate, at some point, an income for the Government through taxation. It’s just not the moment to think about directly charging for data.

 Q3 What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)?

I would recommend an ongoing review to be held no more than every 7-8 months, no less than every 18 months.

Standard
gov, open data

Making Open Data Real, episode 1: the gathering

Not everyday you get the opportunity to attend an event at Cabinet Office. Moreover, not everyday they’re inviting you to an event you actually care about. Hence, here I am at 22 Whitehall for a discussion about the Open Data Consultation with their Transparency Team.

The people attending this kind of events usually belong to the following tribes:

  • developers want data to be released as quick as possible and have in mind possible applications/visualisations/uses of the data; tend not to care much about the legal implications
  • openness campaigners push for data to be released no matter whether they can be useful or not; their only concern is transparency (“you’ve got nothing to hide, right?”)
  • privacy campaigners are not necessarily against data release, but are over-worried about big-brotheresque implications (where Big Brother is in this case your car insurer, rather than the Government)
  • policymakers which is a cool description of the average Civil Servant involved in this: they support data to be released with moderation, and are usually worried. And they don’t know about what.

Such a diverse gathering incurs easily in the risk of over-generalising the discussion, which is technically what has happened. However, I guess this was exactly the goal of the Cabinet Office Transparency Team: see how these different people tend to perceive the Open Data issue, and what common grounds can be found. Necessarily, such common grounds are generalistic and tend to involve a discussion about fears, hopes, effects of data releases, and what they want from each other.

The workshops was pretty much interactive and helped each person interact with others and get in contact with, sometimes, a completely different point of view. Evidently there are many hopes about Open Data: that it can be better, quicker, machine readable, and most importantly linked. Many people attending the workshop also stressed they would like the process of data release to be more transparent. Also some fears were made explicit, especially about the possibility of low-quality, meaningless, data being released.

I think, however, that the two most important points made in this respect were

  • sustainability of the data infrastructure: we don’t want Open Data to be released and go offline the day after because the server is engulfed by excessive demand; sustainability also in the sense that we want the agency releasing the data defining a process for updating the data.
  • engagement: the agency releasing Open Data needs to set up a way to interact with developers and campaigners to respond to their queries about the data released, and possibly some kind of “customer service” structure.

I strongly believe these two points to be the key to make the Open Data movement successful and I was frankly surprised of hearing someone dismissing them as “we just want the data”. Although I agree that some data is better than no data, we shouldn’t be driving the system to the frustrating situation in which we can’t affect the Open Data release process because such process hasn’t been defined properly. Moreover, although sometimes low-quality data is acceptable if there’s no alternative, I wouldn’t push the agencies to release data whose quality hasn’t been assessed: we don’t want to drive the whole quality down.

I fully understand that in the view of the Government and of some campaigners Open Data release can be a way to deal with Freedom Of Information requests in a more automatic way, and this surely means that data must be released as and when available. However, we have a historical chance to define the way data should be made public and what kind of added value we expect from them. This is an opportunity not to be missed.

Some interesting points were made when discussing what to expect from the Government and from the other actors. For example, the idea of re-sharing seems to be finally part of the common culture of data: most users are ready to be both users and consumers of open data, and push for everyone to make their data available. These data can be, in turn, derivative data from the original agency: a process that can enrich and empower the final users.

I do not particularly agree with those saying that Government should set the data release and step out of the game: I think that there is a need for a central assessment of the quality of data in order to avoid “crap data” to become mainstream and I can’t see many alternatives to a central agency, as Ofcom is for communications or Ofsted for education. What the Government needs to do is to make such procedures simple, to help other actors to release Open Data with an easy legislation, and to extend access to procurement for SMEs who currently struggle to satisfy the financial requirements even though they might offer better services than bigger companies. I believe that the Government should maintain its regulatory powers in this context in order to make data more relevant, accessible, democratic, genuinely open.

There is some concern about privacy, of course. One of the main point is that once you start releasing data you don’t know how these will be used and by who. Worringly, data don’t need to be directly referring to a person to identify them. Identification is not a binary function. The classical example is how a car insurance company (yes, I pick on them easily!) can alter its prices after analysing crime rates data. This is something they couldn’t do before. In a way, where I live now identifies me strongly than before, and the car insurer can amend their behaviour towards me because of Open Data although they don’t have perfectly identifiable information about me.

Should this prevent crime data to be released? I don’t think so. I would rather call for more regulations and for punishing this kind of behaviour, but I also think this concern shouldn’t be part of the Open Data movement: we only need to care about transparency and, in my case, efficiency of the systems that will be used to release the data. Concerns about privacy need to be addressed, but abuse of data is a widespread problem that does not affect only the Open Data context, so it should be tackled by another, more general, task-force.

I will be commenting about the points of the Open Data Consultation in a following post. For the time being, I would recommend reading what Chris Taggart has written about his response to the ODC.

Standard
geo, gov, mobile, my projects, open data, open source, policy

Outreach and Mobile: opening institutions to their wider community

[Disclaimer: this post represents my own view and not that of my employer. As if you didn’t know that already.]

Do the words “mobile portal” appeal to you?

I have been working extensively, with a small team, to launch St. George’s University of London‘s mobile portal since last January after we decided to go down the road of a web portal rather than that of a mobile app. The reason for this choice is pretty clear: despite the big, and growing, success of mobile apps, we didn’t want to be locked in to a given platform or to waste resources on developing for more platform. Being a small institution it’s very difficult to get resources to develop on one platform, even less on multiple ones. We also wanted to reach more and more users, and a mobile portal based on open, accessible, resources made perfect sense.

As many of the London-based academic institutions, St. George’s needs to account for two different driving forces: the first is that as an internationally renowned institution it needs to approach students and researchers all over the world; the second is that being based in a popular borough it is part of the local community for which it needs to become a reference point, especially in times of crisis. Being a medical school, based in a hospital and a quality NHS health care structure, emphasizes a lot the local appeal of this institution.

This idea of St. George’s as an important local institution was one of the main drives behind our mobile portal development. We surely wanted to provide a good, alternative, service to our staff and students, by letting them access IT services when on the move. However, the idea of reaching out to people living and working around us, to get St George’s better known and integrated within its own local community, lead us to a thriving experience developing and deploying this portal. “Can we provide the people living in Tooting, Wandsworth, and even London, with communication tools to meet their needs, while developing them for people within our institution?” we asked ourselves. “Can we help people find more about their local community, give them ideas for places to go, or show them how to access local services?“.

This coalition government had among its flagship policy that of a “Big Society”, having the aim “to create a climate that empowers local people and communities”. Surely a controversial topic, nonetheless helpful to rediscover a local role for institutions like us to get them back in touch with their own local community, which in some case they had completely forgotten.

In any London borough there are hospitals, universities, schools, societies, authorities. No matter their political affiliation, if each of these could do something, they would improve massively the lives of the people living within their boundaries. Can IT be part of this idea? I think so. I believe that communication in this century can and does improve quality of life. If I can now just load my mobile portal and check for train and tube times, that will help me get home earlier and spend more time with my family. If I can look up the local shops, it will make my choices more informed. It might get me to know more local opportunities, and ultimately to get me in touch with people.

Developing this kind of service doesn’t come with no effort. It required work and technical resources. We thought that if we could do this within the boundaries of something useful to our internal users, that effort would be justified, especially if we tried to contain the costs. With this view in mind, we looked for free, open-source, solutions that we might deploy. Among many frameworks, we came across Mollyproject, a framework for the rapid development of information and service portals targeted at mobile internet devices, originally developed at Oxford University for their own mobile portal. When we tried it for the first time, it was still very unstable and could not run properly on our servers. But we found a developers community with very similar goals to ours, willing to serve their town and their institution. We decided to contribute to the development of the project. We provided documentation on how to run the Molly framework on different systems, and became contributors of code. Molly was released with its version 1 and shortly afterwards we went live.

Inter-academic collaboration has been a driving force of this project: originally developed for one single institution, with its peculiar structure and territorial diffusion, it was improved and adapted to serve different communities. The great developments in the London Open Data Store allowed us to add live transport data to the portal, letting us have enthusiastic reactions from our students, and these were soon integrated in the Molly project framework with great help from the project community. I think this is a good example of how institutions should collaborate to get services running. A joint effort can lead to a quality product, as I believe the Molly project is.

The local community is starting to use and appreciate the portal, with some great feedback received an the Wandsworth Guardian reporting about a “site launched to serve the community”. I’m personally very happy to be leading this project as it is confirming my idea that the collaborative and transparent cultures of open source and open data can lead to improved services and better relationships with people around us, all things that will benefit the institutions we work for. The work is not complete and we are trying to extend the range of services we offer to both St. George’s and external users; but what we really care and are happy about is that we’re setting an example to other institution of how localism and a mission to provide better services can meet to help build better communities.

Standard