Open Data: Sesame, or Pandora's Box?

Natalie Nelissen

9th July 2018

Back to blog

Natalie, our evidence and evaluation lead, blogs about the potential of open data and how we to optimise its use.

Open Knowledge International (OKI)) define openness in relation to data and content as follows: ‘Open means anyone can freely access, use, modify, and share for any purpose.’ This echoes the meaning of and is closely linked to ‘open source’ software, which gives you methods to analyze those data, and ‘open access’, which means sharing your analysis results.

Examples of open data in our everyday lives include the use of official weather data, not just to find out if we should go for a run tonight but also to buy a house that is unlikely to flood. The government, NHS and public health organisations already offer an ever growing collection of open data, but often depend on third parties to analyze and present the data, such as the BBC NHS trackers like this one.

The magical phrase ‘Open Sesame’ resulted in treasure and a happy ending for Ali Baba, if not so much for the 40 thieves. On the other hand, when Pandora opened a promising-looking jar, she unintentionally released previously trapped evils into the world. Where does the ever growing vault of open data sit on this spectrum?

Sesame

Publishing its data can make an organisation more transparent and accountable, increasing public trust. This is especially the case when the data was acquired, directly or indirectly, with taxpayers’ money. Also, if data is provided by individuals or groups, it seems only natural that they can see their own data. An example of the latter is human genes, nature’s instructions for how to build a person. Up until 2013, it was possible to patent naturally occurring human genome sequences in the USA, meaning that scientists or doctors would need to pay fees in order to help cure diseases linked to those sequences.

In addition to enabling and accelerating research, engaging with open data can empower communities, for example neighbourhood initiatives acting on crime rates. It can also enable individuals to make better choices, such as where to live and send children to school. Open data can improve decision making since there are more eyes looking at the same data, leveraging the power of crowd-sourcing. A lot of non-expert eyes may beat the expert, as demonstrated by Foldit players beating scientists to designing proteins.

In addition to social value, open data has the potential to generate significant economic benefits, both for the organisations releasing it and the companies using it. For example, data released by the Canada Revenue Agency helped expose $3.2 billion in tax evasion by illegally operating charities. In 2015, there were 270 UK companies using, producing or investing in open data, amounting to an annual turnover of over £92 billion. McKinsey estimates that open data’s global economic potential could be over $3 trillion per year.

Pandora

One of the greatest concerns surrounding open data is the potential threat to privacy. For example, when combining multiple sources or for rare conditions, there may be enough information to identify individuals. Having more data, more details and more links between different datasets offers more context and the greatest potential benefit for society, but at the same time also the greatest risk to an individual’s privacy. Furthermore, increased knowledge is not always desirable, for example if the exact location of an endangered plant or animal is known, the increase in curious visitors could make the situation worse.

Another big issue is associated costs. Firstly, there are usually one off costs to get data ready for publication, such as anonymising, creating metadata, formatting and uploading. Secondly, there are ongoing costs related to hosting and managing, such as generating publicity and keeping up to date. Who should pay for these processes? Also, if public funding was used, is it OK for companies to make profit or benefits to be realised for only a small proportion of the population?

While open data is by definition open to everyone, in practise most data can only be interpreted and analysed by technical specialists, or those who have the resources to employ them. For example, to analyze a biomedical dataset, you usually need data mining experience as well as medical knowledge in order to pose relevant questions, check data quality and interpret results. For complicated data-sets or analyses, you may also need specialist software or someone to write custom code. This restricts access and could cause a data divide alongside the existing digital and economic divides.

Next Steps

Even if we agree that open data at least has the potential for good, its implementation is still far from optimal. People, including data mining or content experts, often don’t know which data is available, or where to find it. For example, the wealth of the UK’s open health data are scattered in multiple locations, as recently compiled by ODI.

From discussions with NHS data analysts, it becomes clear that even they are often not aware of all of these. One possible solution is to have a dedicated team and online repository curating these data sets, such as listing, updating, signposting similar datasets and linking to outcomes, such as reports or software.

As previously mentioned, currently most data sets require experts to understand and use them. Good metadata, providing a full description of the dataset, can open up the data to more than just content experts. For example, abbreviations and units in medical datasets need to be clarified to allow interpretation by non-biomedical people. Also, in order to interpret data and its quality, it is important to understand how it was collected and processed. Furthermore, to reduce the need for programming knowledge or commercial software, providers should offer standard file types and tools, such as APIs, that allow downloading and interacting with data.

Once the data is in one place and easy to use, people still have to be made aware of this location and possibly incentivised to start using the data. This can be implemented by allocating funding to go towards publicity, community engagement, training and resources. For example, competitions or hack-days could set the challenge to find the best way to analyse a given data set. The real potential of open data can only be unlocked by efficient crowd-sourcing, allowing both a range of experts and non-experts to contribute a multitude of different fresh perspectives.

Take home message

While open data promises increased transparency and tremendous social and economic value, it also threatens privacy and requires a continued investment of finance and resources. For better or worse, the vault of open data has been opened, but relatively few have found it or ventured inside what currently resembles a dusty, cluttered (and ever growing) attic. One important issue to tackle to realise the full potential of open data is to make it accessible to as wide an audience as possible. This includes providing sufficient information and background about the data as well as ensuring that more people can find and analyze the data. The NHS is currently preparing a spring clean of its open data collection, so keep an eye out for changes here.

Natalie Nelissen

Research Fellow