Getting data is hard to do: Clean data on House of Commons Members

Data on Members of Parliament is notoriously non-standardised and difficult to sift through. That’s why Dr. Larissa Peixoto Gomes is sharing her database on the House of Commons.

Bjørn Erik Pedersen [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

Getting the data

Data collection is one of the most delicate steps of research. This is partly true because collection also means clean-up, making the data ready for use, standardised and categorised. For political scientists, this is incredibly important, as we replace words with numbers, creating dummy variables in order to conduct our analyses. If the data is not standardised, it is possible that automatic categorisation will identify small differences, for instance, in capitalisation, as full differentiation and create two different variables.

While I was developing my database for my PhD dissertation, I was struck with the difficult problem of non-standardised data from the House of Commons regarding the MPs’ names, parties, and gender. It is something that affects us that are attempting to use big data and optimistically believe that the data will be ready for use; but data is made at different points, by different people, under varying constraints. I would never mean to disparage the hard work of House of Commons clerks, who provide us with enormous amounts of excellent quality data and information. These things slip through the cracks, especially when people on the providing side can’t really imagine or foresee how we would use that data.

Methodology

For each type of legislative action I looked into, different uses of names appeared for instance: LASTNAME, NAME; NAME, LASTNAME; TITLE (lady, sir, baroness, viscount); HONORIFIC + LASTNAME (Mr, Miss, Ms, Mrs, Dr). Moreover, these are not standardised to each person. One example is MP Dan Poulter (Conservative), who sometimes appears as “Dan Poulter”, sometimes as “Dr Poulter”, sometimes as “Poulter, Dan”, and lastly, “Poulter, Dr”. John Thurso (LibDem) will appear both as “Viscount Thurso”, “Thurso, Viscount”, “John Thurso”, and “Thurso, John”. In addition, these may all appear within the same file, although I did not check if they were standardised per year. There are also many, many, many missing values, specifically, MPs without parties assigned to them. For those, I did a Google search and filled in the blanks. Over 200 MPs.

There is also only one file with data on the MPs: the members file. Which means that any other database you download or create (in R, with the hansard package created by Evan Odell) will probably not have party and gender.

Accessing the data

After cleaning up the data using Excel and its handy formulas and putting together a file with MPs, parties, and gender to use as a base, I’ve decided to share that file on my personal website and on the PSA blog. Anyone is welcome to use it. There are 881 MPs compiled here – but since my own research is restricted up to 2017, recent switchers are still categorised within their old parties. Anyone who switched parties within the 2000-2017 timeframe is categorised as the party they were in the longest or the party they were in when in Parliament. As common usage, men are 0 and women are 1.

After downloading this, you can use an OR formula on Excel to categorise your data; note that Excel will only accept 250 arguments (at least the 2016 version), so you’ll have to do it in chunks. (E.g. =OR(A2=$B$2; A2=$B$3; A2=$B$4; A2=$B$5)=TRUE). Although I’m sure there’s a faster way to do it in R. My only hope is that this helps other researchers out there!

Download the data here.

Dr Larissa Peixoto Gomes is a political scientist with a PhD from the Federal University of Minas Gerais, Brasil. She researches legislative institutions, political representation, gender, and ethnic minorities. Follow her on Twitter: @larissapolitics

This post was originally published on the PSA blog, and is reposted with the permission of the author.