It seems like data app marketplaces are a natural place to progress to from the previous discussions of unbundling and rebundling. If you missed Tristan, Benn and I discussing this topic, here’s the link:
Benn talks about creating a complete analytics backend, as Rails or Django is for web app development and rightly suggests that we don’t have a complete solution yet.
Reading through this documentation, I can see why a CTO I worked with said dbt reminded him of Django.
If you think about how things are defined with data teams that have adopted the Modern Data Stack:
Terraform (or other IaC or scripts) contains:
Most asset creation that is not relations
Role and User creation
Permissions
Other db specific object creation like stages and pipes in Snowflake
dbt contains:
Relation creation and dependencies (models, refs, seeds, sources, materialisation)
Relation assertion (tests)
Functions (macros)
Semantic definitions (metrics, entities, exposures, analyses)
During my time at Lyst, my team were able to almost completely operate in this space, save some things in LookML, which can be replaced by dbt semantic definitions, to deliver data.
So why don’t we have a complete backend here?
The Terraform and dbt realms here are more or less entirely separate. dbt doesn’t really know anything about roles and grants, the role used by dbt is given these by Terraform. dbt assumes it has the permissions needed to execute its tasks, resulting in DWH errors when it doesn’t.
With the web backends mentioned earlier, very fine-grained access control is possible, including row and field level. This data backend needs to be able to understand, from the accessor key, what permissions the accessor has. This could potentially be some kind of mapping from API accessor key to role on the DWH/IAM role, but this feels a bit clunky. I’m sure someone much better at engineering than me could make, or has made, a more elegant solution.
Bart Vandekerckhove, cofounder at Raito (automated security for analytical infrastructure) provided me with with the following snippet:
The Modern Data Stack is considerably complicating the risk surface. Whether it is Data Apps, reverse ETL, or even ELT, the data lineage is becoming longer and more complex, and to make things more complicated, many of the transformations are decentralised and elusive.
You’ll need a solution that matches the user of a Data App with the user on the DWH. In an ideal scenario, this is solved with SSO, but this is not always the case, so some kind of user resolution is necessary.
Additionally, you’ll have to dynamically adapt your access controls to the different stages in the data’s lifecycle. As data is being transformed, merged, and/or aggregated its sensitivity changes considerably, and it will be impossible to manually configure the access controls at every step. This is something we want to solve with Raito, so personally, I hope the metrics layer will have atomic metadata.
Currently, this data backend is primarily focused on serving metrics, which is logical as this is use case number one in getting value out of data. However, true backends cover a much bigger world with different components and their relations:
What if every entity we could want to reference or access, from an analytics viewpoint, was accessible from this data backend? In a web/OLTP backend, this could be: users, categories, orders… However, a web/OLTP backend aims to serve these requests at an atomic level, in the current state with as low a latency as possible (milliseconds or better).
I feel like this is where there are some key differences:
A data backend wouldn’t offer atomic data, eg pull a single user’s address - that’s not the point of it. If you wanted this you would use the existing web/OLTP backend
It would provide, based on the metadata from modelling:
Aggregated data - eg metrics
Grouped and deduplicated categorical data - eg unique taxonomies, hierarchies
Enriched data - eg entity resolution pre-calculated or on the fly if real-time is needed, leads with conversion, user segments, predictions
Data of a certain quantity - eg the required amount of events to run a product funnel analysis
Data of a specific state - eg our orders as of a specific analysis date in the past
Data relationships and dependencies
Storage metadata - indexes, partitions, file location or the abstraction of these to ensure efficient access
Entity metadata - eg dimensions, last event, cohorts
Latency isn’t as critical as with a web backend - if the data backend latency was a few seconds it usually wouldn’t matter
A data backend would provide atomic metadata - eg lineage of one field, dimensions of one metric
This might seem like my wishlist of what I would want from a data backend framework, and it kind of is, but also some of what I believe it needs to have to function and separate it from a web/OLTP backend.
I’d be glad to update this with additional things the community discourse surfaces!