Catalog of Catalogs

Considering the different types

Nov 14, 2024

a close-up of a computer — Photo by Daniel Forsman on Unsplash

If you’re not on Bluesky yet, I’d really recommend it. More of us have joined since I wrote a post about it a couple of weeks ago, and it’s a really great place for data folks at the moment. I am @jayatillake.bsky.social there.

#150 - Back to our roots

David Jayatillake

October 31, 2024

Around late 2021/early 2022, data twitter was a thing. A bunch of data founders, dev rel people, VCs and folks who do data roles in industry were part of it.

Read full story

I think I’m going to find it a really helpful place to find inspiration for new posts. There was a particularly interesting thread yesterday that I thought was worth delving further into:

Arynn MP

is right - there is a decent amount of confusion around catalogs. Jacob, Arynn and I work for data vendors - so if we aren’t clear, then how can we expect data folks working in industry to understand what’s going on?!

The problem is driven by the fact that any tool in the data stack which generates or harvests metadata acts as a catalog, whether or not they market these features or even make them easily accessible.

Anything in the production path for data: connector, transformation, data warehouse/lakehouse, semantic layer, BI tool, orchestrator, reverse ETL… generates metadata to describe the objects it controls and the actions they perform or are used to perform.

Then you have purpose built data catalogs and observability tools which harvest metadata from all of these tools, but in the process of doing so, and making it accessible, actually also end up generating metadata too!

Observability tools run heuristic and probabilistic data tests, which then become objects with actions generating metadata. The data asset has a test, the test has a run and an outcome. It’s impossible to do anything with data without generating metadata.

Even purist data catalogs have user activity and sync logs, at a minimum, in terms of metadata generated. Some of this metadata, like user searches, aren’t necessarily connected to any data assets at all, but are still really valuable in terms of knowing what data products/assets are wanted and needed.

There are also catalogs that aren’t really intended for human use, like with Iceberg, where the catalog is intended for a query engine to read in order to determine an efficient query plan. Let’s call this system ‘metadata in a system catalog’. All databases and data warehouses store this kind of metadata too, for the same purpose. Again, you could see how it could become valuable for users and engineers in conjunction with other metadata. Generically, system data is atomically stored and maintained in that system.

For example, if you knew access volume, pattern and performance for data assets (and you can if you want to), you could combine this metadata with the partitioning and indexing metadata from a system catalog, to know if you need to add, change or maintain this over time. In fact, this kind of work is a core part of data and analytics engineering today.

Where a semantic layer isn’t present, it is possible for orchestrator and observability tools to collect metadata from BI tools to extend lineage. The point is that the data engineer isn’t concerned with consumption exactly, but they are concerned with maintaining production which enables consumption. If they could know their pipeline changes don’t cause problems to the semantic layer, they could happily stop there from a testing point of view.

I made the diagram above to show span and focus of scope for catalogs, based upon where they are in the stack and also the persona they serve. Traditional data catalogs have a strength in connecting to the whole stack, provided they have sufficient connectors for any given org. However, this is a real challenge and they end up needing more connectors than Fivetran in order to do this. They also have a challenge in that they aren’t for any particular persona - because they want to serve them all to be a viable product to buy - but in doing so, they don’t serve any of them that well.

If you think of the personas - their tools actually serve them better for metadata catalogs. For Data Engineers, the orchestrator is actually the best place1 for a metadata catalog2. Data Engineers collect, move and transform data and put these pipelines into production. The orchestrator contains every step of their workflow, as this is the tool they use to put their pipeline into production. For any given pipeline, they can see every step that happens in their orchestrator, especially if that orchestrator offers good transparency into the actions that happen in each step. This is like how Dagster offers deep insight into individual dbt models that may be run as part of one node in a DAG.

For data analysts and business consumers, and data consumers in general, having metadata about assets they shouldn’t or can’t consume is not really helpful - it’s mostly noise. It’s only signal when it educates them on the provenance of the data they are consuming. Some might say it could also be helpful for them to know what data exists so they can request that data become consumable, but often this leads to red herrings where some stale old, misnamed data exists in storage that can’t be made available. Data consumers don’t need to know about the whole contents of a data lake/swamp.

Data consumers often have useful metadata made available to them in the places they consume data, like BI tools, reverse ETL tools or bespoke data applications made for them. These metadata stores often show them what exists to consume, which is helpful but often leads to inconsistency across these many consumption sources. Large enterprises often have many of these tools and many of the same type of tool, too. This leads to confusion for data consumers about what data assets to use due to a lack of clarity and simplicity in the metadata describing them, and also having silos of this metadata.

The only viable solution to this that I have seen is to govern and control access to consumable data from a semantic layer. Data consumers can see the core entities that they have access to here, what they mean, how they are defined, their provenance (for analysts) and also how these entities with attributes and measures are consumed elsewhere.

A general data catalog could serve this purpose, but essentially it would be recreating a semantic catalog through scoping and the provider would have to deeply support the abstractions and access control that the semantic layer governed… which is unlikely, in all honesty. In the ZIRP era, now an increasingly distant memory, budgets could have accommodated a data catalog and a semantic layer (which contains a catalog). In this era, given the data catalog is not in the production flow and the semantic layer is, it would make more sense to just use the catalog in the semantic layer and save money and complexity.

In short, I think the ideal solution is to allow the data user to use a catalog where they already work. For the data engineer, this makes sense to be in their orchestrator, but it would be incredibly helpful for that orchestrator to expose downstream consumption of the assets/products the data engineers maintain3. This could end at a semantic layer in cases where an organisation uses one, but otherwise would need to support the many consumption tools used by the org.

For data analysts, business users and other data consumers, the semantic layer is the ideal place for their catalog due to simplicity, consistency, clarity and limited scope. They can find out about an entity and then know where data assets about that entity exist, how popular they are and jump straight to them without needing to wade through the Night King’s legion of dead dashboards.

Analytics engineers are the only data users who probably need to look across many catalogs including orchestrator, system catalogs in source systems like Salesforce, table catalogs like Iceberg, data warehouse/lakehouses and source databases, plus semantic catalog 😅4. There is good reason that analytics engineers are sometimes referred to as data librarians.

I used to think that the transformer was a good place to have a catalog like with dbt docs, but the truth was that when I saw dbt docs for the first time, I hadn’t ever seen a data catalog outside of a Google doc. Transformers don’t have the ability to get lineage from connectors unless the connectors inject, like with Fivetran dbt packages, nor can they connect to BI tools. Good orchestrators now can absorb the full metadata of a transformer, therefore an orchestrator is a better place for a catalog than a transformer in all circumstances where one is used.

Apart from during an incident where this switches to observability tool, ideally the metadata from observability could feed into an orchestrator.

If you don’t know something is in production, it’s easy to bring down production.

You might say that the data engineer also needs to look at all of these catalogs, but because the analytics engineer has the business and product context, they need to request what they need from source systems based on their metadata catalogs. They need to request for data to be stored in a certain way for future modelling and exposure, because they understand the patterns of how this will happen in a way that a data engineer may not. I fully understand that often both of these roles are done by the same person, which has both positive and negative trade-offs. Positive mostly being elimination of lost information up and down the chain. Negative being that these unicorns are hard to hire.

davidj.substack

#150 - Back to our roots

Discussion about this post