Before LLMs, the quality of metadata was something often maligned, but rarely invested in. Let me a draw a distinction between metadata and data, here - there has been investment in data quality.
VC investment has been plentiful in the broader data quality space. Investment has flowed into data observability tools like Metaplane, Monte Carlo and Accel. There are still new data observability startups being founded and funded. Investment has also flowed to data contracts tooling such as Avo, Syft, Iteratively (now part of Amplitude) and Gable. Plus even more novel approaches in the space, like Schemata.
Investment has flowed this way because of recent history in the data space. 11 years ago, "The Sexiest Job of the 21st Century" was that of the data scientist. This was short-lived, as soon people realised that data scientists needed data. Data that existed, data pipelines that flowed - that cleaned and fixed data. This, then, naturally became the era of the data engineer, which we are still in.
More recently, the role of analytics engineer has superseded DS as "The Sexiest Job of the 21st Century". You could argue that analytics engineers are data engineers focused on a specific part of the data engineering workflow. They are also more focused on metadata management and quality than any data role was in the past. They are often the folks responsible for data catalogs and semantic layers in their orgs.
We have automated a lot of the data engineering workload related to infrastructure. It is much more common today to buy best-in-breed SaaS tools, that save labour costs and time. We don't have to have a team of data engineers looking after a "Big Data" stack any more. During this era of the data engineer, Snowflake was founded, launched, grew and was finally floated as the biggest software IPO ever - almost no-one uses Hadoop any more.
However, data quality remains an issue...
Many consider the idea of data contracts as a solution to data quality issues, and, if adopted by an organisation, could indeed improve things. Data quality isn't purely a technical problem, but a process and organisational one, at that. It can't be solved by infrastructure alone, which is the main reason it persists as an issue. If we could have solved it by throwing money and engineers at it, we would have already solved it by now.
Data contracts also improve metadata quality, as well as data quality. There are varying perspectives on what should be in a data contract, but all agree that data types should be. However, this is actually a pretty tiny part of metadata in general. Metadata provides context on what data means and how it should be used. It has been neglected because human data teams compensate for poor metadata by storing missing metadata in their minds.
The argument comes back full circle to data modelling, and data modelling done up-front during product engineering. If this happens, a lot of the metadata present during product engineering is preserved for use by analytics engineers later on. This metadata is incredibly rich. How the data relates to entities in the real world, how it connects together, what it means, how to measure outcomes… is all preserved. It makes data contracts non-contentious and obvious to derive from the data model being created or changed.
It is hard to measure the impact of poor metadata when humans have done varying levels of a good job of storing it mentally. If having poor metadata meant that humans couldn't use data at all, we would have dedicated the time and investment to improve it...
In this coming era of AI and LLMs, metadata quality will be as important as data quality. LLM applications need rich, high quality metadata in order to use data. It is the case that they can't reliably use data without metadata.
It is not practical to include missing and fixed metadata, related to your org, in LLM pre-training. It is possible, in prompt engineering, to provide metadata to compensate for poor quality base metadata. However, this is an inefficient path. It results in a need to fine-tune the LLM to prioritise which metadata is correct. Which is correct? The base metadata input from RAG, or the ancillary metadata input later? Even where you are providing metadata that doesn't exist elsewhere, this is then metadata maintenance in the wrong place.
Having complete, clear and correct metadata in a centrally-accessible location, like a catalog or semantic layer, allows both humans and LLMs to benefit. It also means there is only one place to store and maintain the metadata. When there is a single source of truth metadata, you can invest more time in improving and maintaining it.
Metadata is more "active" than it has ever been before. It has been considered a kind of data wealth as a byproduct of data operations, but the value of this wealth has yet to be realised. LLMs may be the way we refine the value of metadata into a consumable product.
Great blog as always, David, reading my mind and bringing a lot of good food for thought.
It'd be great to hear your expanded thoughts on
"a centrally-accessible location, like a catalog or semantic layer"
Do you see catalogs and semantic layers as two separate things? Or two services of the same thing? Do you need both? Are they competing or collaborating? Could semantic layers just be generated out of a catalog? Is semantic layer just an "edge" feature of a "core" catalog? Is a semantic layer a "bespoke" view of a "universal" catalog? Can you have a semantic layer without a catalog, and if so, is it managed differently? Is "semantic steward" an emerging data role? I'm working through some ideas in this space, I think it's definitely an area of interest.