Building the intersection

what's needed for a full data backend?

Apr 06, 2022

bird's eye view of freeway — Photo by Deva Darshan on Unsplash

It seems like data app marketplaces are a natural place to progress to from the previous discussions of unbundling and rebundling. If you missed Tristan, Benn and I discussing this topic, here’s the link:

The Analytics Engineering Roundup

Ep 23: The Bundling vs Unbundling Debate

A debate has erupted on data Twitter and data Substack: should the modern data stack remain unbundled, or should it consolidate? In this conversation, Benn Stancil (Mode), David Jayatillake (Avora) and our host Tristan Handy try to make some sense of this debate, and play with various future scenarios for the modern data stack…

3 years ago · 1 like · David Krevitt

Benn talks about creating a complete analytics backend, as Rails or Django is for web app development and rightly suggests that we don’t have a complete solution yet.

Benn Stancil @bennstancil

@DSJayatillake @getdbt i'm outside of my technical depth here, but i think there's more to it than that. take the point from the post about access controls - to make that work, the app needs to be aware of who has what grants on the db. while you could just query for that, that can't be performant.

Reading through this documentation, I can see why a CTO I worked with said dbt reminded him of Django.

If you think about how things are defined with data teams that have adopted the Modern Data Stack:

Terraform (or other IaC or scripts) contains:
- Most asset creation that is not relations
- Role and User creation
- Permissions
- Other db specific object creation like stages and pipes in Snowflake
dbt contains:
- Relation creation and dependencies (models, refs, seeds, sources, materialisation)
- Relation assertion (tests)
- Functions (macros)
- Semantic definitions (metrics, entities, exposures, analyses)

During my time at Lyst, my team were able to almost completely operate in this space, save some things in LookML, which can be replaced by dbt semantic definitions, to deliver data.

So why don’t we have a complete backend here?

The Terraform and dbt realms here are more or less entirely separate. dbt doesn’t really know anything about roles and grants, the role used by dbt is given these by Terraform. dbt assumes it has the permissions needed to execute its tasks, resulting in DWH errors when it doesn’t.

With the web backends mentioned earlier, very fine-grained access control is possible, including row and field level. This data backend needs to be able to understand, from the accessor key, what permissions the accessor has. This could potentially be some kind of mapping from API accessor key to role on the DWH/IAM role, but this feels a bit clunky. I’m sure someone much better at engineering than me could make, or has made, a more elegant solution.

Bart Vandekerckhove, cofounder at Raito (automated security for analytical infrastructure) provided me with with the following snippet:

The Modern Data Stack is considerably complicating the risk surface. Whether it is Data Apps, reverse ETL, or even ELT, the data lineage is becoming longer and more complex, and to make things more complicated, many of the transformations are decentralised and elusive.
You’ll need a solution that matches the user of a Data App with the user on the DWH. In an ideal scenario, this is solved with SSO, but this is not always the case, so some kind of user resolution is necessary.
Additionally, you’ll have to dynamically adapt your access controls to the different stages in the data’s lifecycle. As data is being transformed, merged, and/or aggregated its sensitivity changes considerably, and it will be impossible to manually configure the access controls at every step. This is something we want to solve with Raito, so personally, I hope the metrics layer will have atomic metadata.

Currently, this data backend is primarily focused on serving metrics, which is logical as this is use case number one in getting value out of data. However, true backends cover a much bigger world with different components and their relations:

Nick Handel @nick_handel

@DSJayatillake I think that this is what semantic layers have historically failed to accomplish. So, it's broader. - What is contained in my warehouse? (Dims, Measures, IDs) - How does it relate? Relations, Structures (SCD, DIM, Partitions) Then you just programmatically generate everything

What if every entity we could want to reference or access, from an analytics viewpoint, was accessible from this data backend? In a web/OLTP backend, this could be: users, categories, orders… However, a web/OLTP backend aims to serve these requests at an atomic level, in the current state with as low a latency as possible (milliseconds or better).

I feel like this is where there are some key differences:

A data backend wouldn’t offer atomic data, eg pull a single user’s address - that’s not the point of it. If you wanted this you would use the existing web/OLTP backend
It would provide, based on the metadata from modelling:
- Aggregated data - eg metrics
- Grouped and deduplicated categorical data - eg unique taxonomies, hierarchies
- Enriched data - eg entity resolution pre-calculated or on the fly if real-time is needed, leads with conversion, user segments, predictions
- Data of a certain quantity - eg the required amount of events to run a product funnel analysis
- Data of a specific state - eg our orders as of a specific analysis date in the past
- Data relationships and dependencies
- Storage metadata - indexes, partitions, file location or the abstraction of these to ensure efficient access
- Entity metadata - eg dimensions, last event, cohorts
Latency isn’t as critical as with a web backend - if the data backend latency was a few seconds it usually wouldn’t matter
A data backend would provide atomic metadata - eg lineage of one field, dimensions of one metric

This might seem like my wishlist of what I would want from a data backend framework, and it kind of is, but also some of what I believe it needs to have to function and separate it from a web/OLTP backend.

I’d be glad to update this with additional things the community discourse surfaces!

davidj.substack

Discussion about this post