issues

18. Data Infrastructure

Issues in Open Data - Data Infrastructure

  • Key Points
  • Timeline
  • Read & Engage
  • Cite
  • Understanding data infrastructure by using an analogy to physical infrastructure like roads and rail helps us to see the range of components that make up data infrastructure - from datasets and servers to standards, policies, rules, and governance mechanisms.
  • Just as the quality of physical infrastructure, and the capacity of governments or the private sector to maintain it, varies across the world, access to high quality data infrastructure is unequally distributed. This leaves some countries much less able to secure the benefits of open data.
  • Open standards, identifiers, and registers (reference data) are the essential building blocks of data infrastructure, yet they often lack investment and see limited adoption.
  • In the future, data infrastructures will need to accommodate both open data and more restricted datasets. It will be important to build both trust and openness into our data infrastructures to maximise their social and economic value.

How to cite this chapter

Dodds, L. & Wells, P. (2019) Issues in Open Data - Data Infrastructure. In T. Davies, S. Walker, M. Rubinstein, & F. Perini (Eds.), The State of Open Data: Histories and Horizons. Cape Town and Ottawa: African Minds and International Development Research Centre.

Print version DOI: 10.5281/zenodo.2677811

Data Infrastructure

Introduction

Infrastructure powers our societies. It provides the fundamental services and systems that enable our economies to function, allow us to communicate, and support our daily activities. When we think of infrastructure, we first think of roads, bridges, water supplies, and electrical grids, but infrastructure also takes less tangible forms, such as ideas, basic research, and the internet.1 In this chapter, we discuss why data should be treated as infrastructure. It is a public good that enables the creation of a wide range of products and services. All sectors of our economies, at the local, national, and global level, rely on it. Roads help us to navigate to a destination; data helps us to navigate to a decision.

Just as societies and sectors strategically plan, invest in, and protect the physical infrastructure they rely on, we must also begin to do the same with our data infrastructure. Having a strong data infrastructure will become more vital as our populations grow and our economies and societies become ever more reliant on getting value from data to meet a range of needs.2

Good infrastructure is there when we need it, but, at the moment, too much of our data infrastructure is unreliable, inaccessible, siloed, or can only be used if you can afford access. Data innovators struggle to get hold of data and to work out how they can best use it, while individuals do not feel that they are in control of how data about them is used or shared.

Data infrastructure should be as easy to use as our road networks. The time and effort that goes into working around poor data infrastructure, due to the equivalents of potholes, toll booths, and missing intersections, would be better spent building services that improve our lives. To maximise value from data, we need data infrastructure to be high quality, and as open as possible, while protecting privacy, national security, and commercial confidentiality.

The rapid development and adoption of technologies that help us to collect and process data in new ways, and in ever-increasing volumes, is creating new data infrastructure in sectors such as health, transport, and agriculture. It is important that we make conscious decisions about how these sector-specific data infrastructures are designed,3so that, over time, they are interoperable and form a connected data infrastructure (with appropriate measures in place to build and maintain trust). We must ensure that data is used ethically, that people are engaged in decisions about the data that impacts them, and that there is equitable access to, and benefits from, that data.

Defining data infrastructure

Describing data as infrastructure is more than just an analogy. In his exploration of the social value of infrastructure, Brett Frischmann defines infrastructural resources as resources that are consumed non-rivalrously, are required as inputs to support downstream activities, and are used to create a variety of goods and services.4This definition clearly applies to data. Consumption of data is non-rivalrous: the same data can simultaneously be used by many different users. Data is a necessary input for making better decisions and a raw material for new products and services. Data is also used to create a wide variety of goods and services from commercial products5to art.6

By understanding data as infrastructure in this way, we can begin to apply existing analysis and other insights that have been developed for “traditional” infrastructure to the management and supply of data. For example, we can start to acknowledge the level of planning, investment, and maintenance that is required to support that infrastructure in order to deliver benefits for the public good. We can also better appreciate that data infrastructure should be designed to support and enable a variety of uses, just as our roads support multiple modes of transport, in order to maximise benefits for all.

However, we also need to identify the individual building blocks of data infrastructure. Road infrastructure consists of more than physical assets. In addition to the physical assets like roads and traffic signals, the infrastructure also includes the standardised road markings and symbols that help us to safely and fairly use the roads, the policies and guidance that define how and where roads are built, and the organisations that maintain them. Similarly, data infrastructure also consists of more than just physical assets like servers and networks. A useful definition of data infrastructure must recognise more than just the technical infrastructure that we use to collect and store data.

The quality and connectivity of road infrastructure across different countries and even within countries can vary considerably (e.g. roads may be well-developed networks in cities and of poorer quality in rural areas). We should be able to assess the quality of our data infrastructure and define ways to strengthen it.

We provide a suggested definition of data infrastructure below. This broad definition reflects that used by the geospatial data community in their definition of spatial data infrastructure.78

Defining data infrastructure

Data infrastructure consists of:

  • Data assets, such as datasets, identifiers, and registers.
  • Standards and technologies used to curate and provide access to data assets.
  • Guidance and policies that inform the use and management of data assets and the data infrastructure itself.
  • Organisations that govern the data infrastructure.
  • The communities involved in contributing to or maintaining it, and those who are impacted by decisions that are made using it.

Infrastructures may be local, regional, national, or international. For example, to deliver the Sustainable Development Goals (SDGs), numerous sector-specific networks have been working to develop standards, guidance, and organisational structures that can capture comparable indicator data to measure progress toward the goals. This involves identifying existing sources of data that might be brought within the SDG “data infrastructure”, establishing new flows of data and thinking about how they will be maintained.

Data infrastructure at all levels might be maintained or used by government, private, or third-sector organisations. In developing countries, road infrastructure is being built through foreign investment, which raises questions about who benefits from that infrastructure. Similarly, we should understand who controls our data infrastructure9and recognise that there is an evolving variety of governance models. The right model for the governance of a given element of data infrastructure might vary across sectors, nations, or communities.

Aspects of data infrastructure

Identifying and describing data infrastructure

Reflecting on the nature of data infrastructure can help the open data community to achieve a number of goals that are outlined below.

Firstly, our definition of data infrastructure clearly highlights the importance of the variety of actors that play a role in designing, managing, and governing data infrastructures, or in creating value from the data assets that they provide. This systems-related thinking can help us to publish data in ways that will enable a variety of long-term impacts. For much of the last decade, open data initiatives focused on encouraging the creation of data portals to help data stewards publish data and to support people in finding the data they need to create things. OpenDataSoft now lists over 2 600 open data portals around the world10with the highest concentration in Europe and North America, where cities, government agencies, and national governments have created their own portals. Where portals are being used as a means to creating ecosystems of applications and services, we should invest appropriately in making them sustainable,11ensure they are appropriately governed, and encourage the adoption of open standards to provide access to data.

Secondly, defining data infrastructure effectively can help us to recognise and describe existing gaps or deficiencies in data infrastructure.12We can ask meaningful questions about whether that data infrastructure is as open as possible and advocate for practical interventions to help unlock additional value, such as through the creation of open standards or improved governance.

Examples of infrastructure: CrossRef and ORCID

Over the last decade, the research community has improved the data infrastructure for scholarly research. This has involved the creation of a range of data assets, organisational models, and policies mandating open access and licensing to enable better discovery and use of research outputs.

CrossRef and DataCite are not-for-profit membership organisations that provide unique identifiers for papers and datasets. Another not-for-profit, ORCID, provides similar identifiers for researchers. CrossRef is currently leading projects to create identifiers for other parts of the research ecosystem, including organisations and grants. They have also negotiated the licensing of open data from social media platforms13to support analysis of the debate around scholarly research.

Each of these organisations plays a role in the research data infrastructure. They each support a wide variety of organisations, including established publishers, startups, and research institutions in developing a range of applications and services that support the research ecosystem.

There are many existing examples of data infrastructure. Some are provided by governments, while others are supported by sector or community initiatives. For example, CrossRef and ORCID (see box, Examples of infrastructure: CrossRef and ORCID) support the research and publishing sectors, while Europeana14is enabling the cultural heritage sector to exchange and archive data, and initiatives like MusicBrainz15and Discogs16are providing data about music to support a variety of commercial and non-commercial projects.17Projects like Wikidata18and OpenStreetMap (see box, Examples of infrastructure: OpenStreetMap) are also best understood and evaluated as examples of data infrastructure.

Efforts to describe and document government information infrastructures predate the open data movement, often focusing on the internal interoperability of data within government. Open data initiatives have helped to refocus this work, so that government data is recognised as part of a wider open data infrastructure that is of benefit to businesses and society, as well as to government for its internal use. However, many early open data initiatives and advocacy campaigns encouraged governments to release a standard list of discrete datasets and to upload them to central data portals,19rather than adopting tailored approaches to discussing, releasing, and supporting the use of data that met the needs of local communities.

Adopting more of a systems-thinking approach to documenting and mapping20existing infrastructures helps us to better understand the specific challenges and pressures they face. For example, the increasing collection of data by commercial organisations is impacting our national and global data infrastructures for geospatial,21transport,22and weather data.23Similarly, the increasing use of commercial data sources in national statistics24or local government policy-making25raises questions about the quality of those data assets and how well they represent the communities impacted by their use.

Finally, by understanding the shape and design of existing data infrastructures, we can identify common patterns that will help us to identify principles2627to inform the design of stronger infrastructure. New data infrastructure is being created in a number of sectors. Examples include the Open Banking Initiative in the United Kingdom (UK)28and Mexico,29OpenActive,30GODAN in agriculture,31and the DFID Digital Strategy in global development.32

In 2018, the Open Data Charter updated its strategy, moving from a focus on “high value datasets” to a focus on “publishing with purpose”.33In line with this shift, it will be important to supplement our existing understanding of how data supports change with insights into successful patterns for delivering better services using data,3435and further, how these patterns can be replicated in other contexts nationally and internationally when the right data infrastructure exists to support them.

Examples of infrastructure: OpenStreetMap

OpenStreetMap (OSM) illustrates how an open, collaborative process can produce a global data infrastructure. Over the last 12 years, OSM has grown from a project supported by only individual contributors to one that now routinely receives contributions from a wide range of community groups, as well as commercial organisations like Mapbox and Facebook in addition to government agencies.

Supported by the provision of aerial imagery from a number of commercial sources, the OSM community has created more than just a global geospatial dataset. They have also constructed the necessary governance framework that supports its use in the creation of a wide variety of products and services. These uses include community-led mapping projects to support the investigation of land rights, humanitarian mapping, as well as a number of commercial applications.

For more detail, see Chapter 8: Geospatial.

The role of standards, identifiers, and registers

Open standards for data are reusable agreements that make it easier for people and organisations to publish, access, share, and use better quality data.36When data is published using open standards, it can be used with off-the-shelf tools that support those standards, reducing the time and effort required to unlock value from the data.

Early open data efforts focused on “raw data”, and then pushed more for use of open file format standards (e.g. CSV rather than PDF). Where data has become available in standard formats, emphasis is now increasingly being placed on the creation or publication of data that uses common open standards for field names and definitions, identifiers, and classifications. The challenge this presents should not be underestimated as the vast majority of data published through open data portals makes limited use of open standards beyond file format, and, in many countries, data is still predominantly only available in PDF format.

Open standards are created using open, collaborative processes37that enable many different stakeholders to create a useful agreement. Standards for data are essential whenever we want to consistently collect and exchange data. Some standards help to define shared vocabularies, while others define how we exchange data or capture best practices and guidance. We can also combine simple standards to create complex standards and workflows.

Creation and adoption of open standards can generate a variety of technical, economic, and policy impacts. Examples include the Open Contracting Data Standard, which is helping to create transparency around public spending in Paraguay and Nigeria,38and the General Transit Feed Specification, which is supporting the publication of transport data for cities around the world and can be consumed using a variety of tools and services.39

Open standards for data that have been developed, however, are often not adopted. There are many reasons for this, including a lack of engagement with stakeholders and potential users, which can often lead to poorly defined standards, as well as difficulties in finding existing standards.40Governments can also overlook their ability to encourage adoption by building requirements to support standards into the procurement and delivery of public services. As of 2018, the Open Data Institute (ODI) and others are working to develop new guidance and tools to support the creation and adoption of standards.41

To enable the consistent exchange and use of data, we often need to standardise specific elements within our data assets. For example, we may need standard identifiers for the organisations, places, and products referenced in our data assets. Standard identifiers help to link together datasets from multiple sources. They are the means through which we create the “junctions” between and within data infrastructures. Simply listing valid identifiers is seldom useful. Registers42are datasets that provide consistent, accurate, and up-to-date lists of information that can also help to improve the quality of data collection and linking. Registers typically consist of identifiers and some basic reference data. Examples of registers include lists of countries and government departments, companies, and food types (see box, An example: UK government registers).

Openly licensed standards, identifiers, and registers are vital building blocks for data infrastructure.43There is still a great deal of work to be done to support the adoption of existing standards, and to create new standards to help improve data collection and use in specific sectors. If the open data community is to create impact through the purposeful release of specific datasets, we must ensure that we are considering opportunities to improve our data infrastructure by creating openly licensed reference data.

Figure 1: What can we standardise? Source: http://standards.theodi.org/introduction/

An example: UK government registers

The UK government has long been a leader in open data and digital government. The UK’s Government Digital Service (GDS) recognises the importance of making better use of data, while the National Infrastructure Commission has noted data emerging as a form of infrastructure.44In 2015, GDS started exploring how to build a series of registers: authoritative lists that people can trust.45

The list of registers published by GDS includes the list of countries recognised by the UK government and lists of local authorities, schools, and job centres. The content of the registers vary with what is being listed. For example, the list of local authorities includes two name variants and the start and end date for the entry, while the register of schools includes an address and the name of the head teacher. Each register is governed by a single government employee known as the custodian.

The GDS registers are driven primarily by the needs of government data users, but they are open for anyone to use. External organisations, such as MySociety and Transparency in Supply Chains (TISC), use them in their work.

The documentation for registers, including the list of registers, how to use them, how they are governed, and who maintains them, is publicly available. This is in line with the GDS commitment to working in the open.46There are currently 34 live registers with 36 more in the process of being developed.

Governance and trust

With the ever-increasing use of data comes questions about governance and trust. People, communities, businesses, and governments must have trust in data if we are to maximise its social and economic value. This is particularly important for personal data. If we do not have trust in how personal data is used, then people will refuse to consent to data collection or withdraw from initiatives that share and use data. Much of open data is derived from personal data. For example, data collected by national censuses is used to produce official statistics, and data collected on passengers’ use of transport is used to create timetables. Meanwhile, some societies have decided that some personal data should be published openly like the names and addresses of the directors of corporations.

To increase levels of trust, the whole data ecosystem will need to build ethical considerations into how data is collected, managed, and used in order to ensure equity around who can access and use data and how the benefits are distributed. This will only be possible through meaningful engagement with the people and organisations potentially affected by the data. Within the context of data infrastructure, some of these activities will be performed by data stewards, who may create tools and services using data or draw on data to make decisions. Good data governance mechanisms are needed at every stage.

Data ethics is a branch of ethics that evaluates data practices with the potential to adversely impact people and society – through data collection, sharing, and use.47Ethical issues are not limited to personal data. A number of organisations have published tools to help organisations make more ethical decisions about data,48while others are publishing practices for ethical service design.49Complying with legal obligations (e.g. supporting the rights granted by the European Union General Data Protection Regulation) is just one aspect of treating data ethically. Data-related activities can be unethical but still lawful; therefore, putting good regulations in place also helps to create ethical social norms.

The World Wide Web Foundation, in a review of low- and middle-income countries, has found that progress on introducing appropriate safeguards and regulations is mixed, and affected communities are not always engaged in shaping these interventions.50As more countries and sectors contribute to data infrastructure, we must ensure that people are actively engaged and can contribute to both the data and its governance.

It will be particularly critical in the coming years to address issues of equity, ethics, and cultural diversity emerging from cross-border data transfers and the increasing use of data within service delivery. Many low- and middle-income countries lack the ecosystems to use data and build competitive data-enabled services like those provided by firms from high- and upper-middle income countries such as the United States, the UK, and China. Making data infrastructure as open as possible with the appropriate safeguards can help create a fair, competitive, and more level playing field,51but it will also need other interventions, such as capacity building and better tax regimes,52to create an effective systemic response and avoid either a future dominated by data monopolies or a future where our data infrastructure has only limited utility.

People, processes, and progress

It is also worth reflecting on the work required to develop and maintain data infrastructures, as well as the people and processes required to make them successful. Data and information infrastructures have historically been developed by a range of organisations, including governments and the private sector. Key infrastructure components like data policies, legislation, and standards for data exchange have been developed by national and international bodies that convene expert groups to help with the necessary work.

Participation in these processes has often been limited to those people and organisations that can invest time and money in contributing or can see the immediate value in contributing. Like many aspects of infrastructure development, the significant investment and work required is easily overlooked. This can lead to lack of investment in the creation of the infrastructure required to deliver on policy goals or to help to create more open markets. The need for significant investment can also unintentionally exclude some organisations and communities from participation in the creation of data infrastructure. This may lead to the creation of infrastructure that is less optimal because of the lack of consideration of different or diverse perspectives and needs.

The ability to work online, using a variety of agile collaboration tools, is creating more opportunities for communities to collaborate in the creation of standards and other aspects of data infrastructure. The adoption of open government initiatives, like open policy-making and the OpenStand principles,53reflects a conscious effort to work more openly, include more views, and produce better outputs.

Greater sharing of insight through case studies, peer networking, and other activities will also help to highlight the investment required to develop data infrastructure, to build skills, and to cement an understanding of successful approaches. Initiatives like the Interoperability Data Collaborative54and the African Regional DataCube,55both part of the Global Partnership for Sustainable Development Data, reflect this approach to creating stronger data infrastructure.

The rise of Wikipedia and OpenStreetMap reflects another aspect of how data infrastructures are evolving. For these examples, the work of curating and maintaining data infrastructure falls to a community of volunteers and supporting organisations, reflecting a more networked, collaborative approach to the stewardship of data. We are at the early stages of understanding how to develop and maintain data infrastructure in this way. These initiatives offer opportunities for communities around the world to participate in the collection and maintenance of data about their local area and lives; however, there is also a risk that a lack of resources, skills, engagement, and investment may lead to the creation of infrastructure which excludes or does not reflect the needs of specific communities that are unable to participate.

Future directions

Data standards only made it onto the agenda of the International Open Data Conference in 2015, and, as yet, there has been little focus on the other aspects of data infrastructure. The open data community should recognise that after more than a decade of open data initiatives, we are not just releasing datasets, we are actively creating data infrastructure. The choices we make about licensing, technology, access, and governance will help to shape the ecosystems that this infrastructure enables.

Data exists across a spectrum. Not all data can be open. We need to design data infrastructures that support both open data and data shared in more restricted ways. The organisations that contribute to data infrastructure and use it will come from the public, private, and voluntary sectors, so we need to design for, and engage with, this mixed economy much more than we are doing now. We need to avoid treating open data as a special case by designing openness into any and all broader approaches to data sharing.

We will also need a step change in investment and in the level of effort put into creating data infrastructures. This requires a commitment to longer-term projects and support for the key infrastructural work required to make them successful. Embracing systems thinking, and exploring existing work on the social and economic impacts of infrastructure, as well as focusing on commons-based approaches, will help us move forward on key questions related to measurement, impact, and adoption.

We must also recognise the role that other organisations working on other aspects of the open movement have in helping to create stronger data infrastructures. In the last few years, the open source movement has also been discussing their outputs as infrastructure,56particularly in recognition of how vulnerable large parts of the open source ecosystem are when key components are not maintained. We need to work with those promoting open cultures and digital rights, because they can help us understand how to retain trust while creating openness. Working together as a community, and in collaboration with other parts of the open movement, will help us maximise our impact.

We must continue to emphasise the importance of open standards, identifiers, and registers as the building blocks of data infrastructure, but we also need to ensure that we are placing enough emphasis on governance and the policies and legislation that help to create ethical and equitable access to the benefits of data. We need both trust and openness in our data infrastructure to maximise its social and economic value. If we get this right, then when we write about the state of open data and data infrastructure ten years from now, we will be able to tell the story of sound theory put effectively into practice.

Leigh Dodds

The Open Data Institute (ODI)

Leigh Dodds is an open data practitioner with experience working with a variety of sectors and organisations, developing and promoting the adoption of best practices for publishing and consuming data.

Peter Wells

The Open Data Institute (ODI)

Peter Wells is focused on data policy and helping people, organisations, and communities around the world to use data to make better decisions while being protected from harmful impacts.

Further Reading
References