Data cataloguing is hard!

I’ve heard this sentiment during user research sessions, but it is also my own experience of working on various data cataloguing projects. On the surface, it looks like it should be straightforward; you compile a list of the data assets that you hold in your organisation and make that available in a searchable form. There are even a bunch of tools including commercial offerings such as Informatica or Microsoft Purview, and open source solutions like DataHub or CKAN. These offer technical solutions, and some can even automate metadata capture, but this hides the fact that the real challenge is a cultural one. In this blog post, we explore some of the reasons why this is so.

Mismatched goals

The goals for engaging with a data catalogue are different for those who are looking for data (data consumers), those who own the data (data publishers), the central data team who run the catalogue (service owners or admins), and senior members of the organisation (Chief Data Officer, Chief Digital Officer, etc).

Data consumers want to use the catalogue to help them discover data for a specific purpose. User research for the Central Digital and Data Office Data Marketplace discovered that finding data in government was challenging due to the lack of an accessible data catalogue, sometimes taking over six months, and often resulting in nothing being found. When searching for data in a catalogue, you are mostly reliant on detailed textual descriptions of the data asset that gives the context of why the data was gathered and its content. Once a consumer has identified potentially viable data assets from their descriptions, they want to be able to assess the freshness, quality, and structure of the data.

Data publishers is a term that covers the team that provides the data, i.e. the data owners and their stewards. Their role in data cataloguing is to ensure their data are accurately described in the catalogue. There are generally no incentives for them to make their data more discoverable and reusable by others, and they often worry about people misusing or finding errors in it. As such, they are not willing to put in the time and resources required to provide the metadata that helps make data discoverable, i.e. the detailed descriptions, as they are not needed for their own use of the data.

The central data team wants a catalogue to support them in ensuring data governance is being applied appropriately, that data is of good quality and conforms to standards, and that they can trace its lineage. While all of these goals support the data discovery journey, they are needed at a more detailed level, so increase the burden on data publishers. The data team also supports the needs of senior members of the organisation by providing insights into the utilisation of data, and ensuring an accountable owner for every data asset.

Auto scanning

Auto scanning to generate metadata is often seen as the panacea of data cataloguing. What could be better than giving the catalogue system access to your data storage and magically getting a list of all your data assets to allow consumers to search for data?

Data cataloguing tools such as Microsoft’s Purview, DataHub, and Informatica provide functionality to scan your data estate and create catalogue entries for your data. However, this only provides the technical metadata, e.g. its filename, last modified date, and format. While this reduces the burden on the publisher, this automatically scanned metadata does not help make the data discoverable.

Discovery relies on the business metadata that provides the context for the data, like a description of what is contained in the data, why it was collected, thematic coverage, or usage rights. The automatically scanned metadata fields do help the consumer assess the data asset, e.g. giving them confidence that it is being kept up to date as it has a recent last modified date. You can think of this a bit like the difference between having a good Readme file on your git repository and seeing the commit history to ensure that the codebase is still being actively maintained. Of course, some automatically generated metadata can be used for discovery, such as the ability to tag columns as containing certain types of identifiers (e.g. national insurance number), which is a useful attribute for linking data.

Culture and incentives

For data cataloguing to be successful, there has to be a culture that supports and nurtures it. It should be seen as part of the greater good rather than an additional burden. This sometimes requires a culture shift from “This is my precious data and I may let others use it if they are nice to me” to “Hey, I’ve got some interesting data and I want as many people to know about it and make use of it as possible.” A move from a fear of being exposed for not having sufficient data quality to one of collective improvements.

In the early days of data.gov.uk, there was a culture of ensuring that data descriptions were added and maintained in the catalogue. There was buy-in at senior levels to ensure that assets were added to the catalogue. However, priorities changed, leading to a reduction in the rate of publishing and the quality of descriptions. data.gov.uk is now only seeing mandated data assets added to it. Hopefully lessons have been learnt from this experience and are being applied to the new cross-government Data Marketplace.

In academia, cataloguing data produced as part of research projects has been achieved by making it a requirement of grant funding. When applying for funds, a researcher needs to identify where data will be archived at the end of the project. As part of the archive process, a description of the data and the project where it was collected are provided with the aim of helping others to Find, Assess, and Reuse the data.

There is a good understanding of what’s needed to make data discoverable, although this may not always be straightforward when you consider versions and editions of a data asset. There are technical solutions out there that broadly support publishers and consumers to greater or lesser extents to meet their needs. But the real reason cataloguing projects fail is that the data publishers are not incentivised and supported to publish, and maintain, high quality descriptions of their data. Data cataloguing needs to be recognised as a core part of the publisher’s job role, with objectives linked to it. We need to move to a culture of data as a product, where it can be utilised by many across interoperable systems. This will lead to cost savings by removing duplication.

The publishing challenge cannot be solved by technology alone, even if you deploy a cataloguing solution that can scan your data estate and hook it up with the leading GenAI solutions. There is a need for data publishers to curate their records by adding in the detailed description and linking to other data entries, forming them into data products, and ensuring that the catalogue has a complete list of their data. Data publishers need to invest their time to ensure their data is discoverable. They need to be rewarded and recognised for this effort and also to be in the mindset that the data can provide benefit to others and should be shared if it can. Doing this right will bring cost savings and opportunities to design and deliver better public services.

Alasdair Gray

Principal Data Architect

Alasdair is an expert in the FAIR Data Principles with an emphasis on making data Findable and Interoperable to support Reuse. He works on data transformation and cataloguing projects.

Contact Alasdair