This won't be the last question

Why you should over-engineer with data

Apr 13, 2022

gray and brown stones on gray ground — Photo by Ana Municio on Unsplash

Over-engineering (or overengineering, or over-kill) is the act of designing a product or providing a solution to a problem in an overly complicated manner, where a simpler solution can be demonstrated to exist with the same efficiency and effectiveness as that of the original design.

With software engineering, it has long been the wisdom that you shouldn't over-engineer. In SWE, it can lead to slower delivery, unnecessary complexity and higher build and maintenance costs.

This principle often doesn't hold with data.

As a data person, you will have experienced the “game of requirements ping pong” with a stakeholder:

Stakeholder asks for X
You make X and share
Stakeholder says it's not quite what I wanted, here are more requirements
You make Y and share
Go back to step 3 and iterate a few times until you're exasperated
Stakeholder finally has something they're happy with, or gives up

You can ask what decision needs to be made, nail down the requirements in a Jira ticket, have a synch conversation with the real requestor… but the above process will happen to a greater or lesser extent.

Experienced data people, who know their stakeholders and organisations well, anticipate extra requirements. If they're asked for a trend of revenue, they may well group it by a few other dimensions so they can split and filter afterwards.

You know that one question leads to five and try to work in as generic a way as you can, to cover a large question space. This is in part due to the nature of data requests; they are often exploratory. Engineering resource is closely guarded by product managers, who usually know what and why they are building something. Stakeholders who make data requests usually aren't sure about what they need. They start where they think makes sense and then gradually zone in.

There is an element of this work that is always exploratory; Analytics work, in particular, is not guaranteed to be clear from the outset. A product and problem solving mindset can help here, where the impact and use of work is clear from the outset. The work can then be prioritised by impact and value to the organisation, too.

Data requirements should be built with product/commercial requirements. If you're going to execute in a certain way, how do you measure what success looks like? If you know the metrics up front, data people can help guide you to ensure that the necessary data is captured from the outset. They may also help guide you towards a better metric, but that's a blog post of its own. Data needs to be part of the roadmap, not an afterthought.

There is a clear technical reason for this, too. Generally speaking, if you break software, you can fix it with no ill effects (this is a contributor to the "move fast and break things" culture in the tech industry). This is not true for data! Often when data is broken, it can't be recovered, as you can't change the past. If no events are tracked in the application, you can't go back in time to track them. If you don't store needed properties, and don't even store the ID of the entity they belong to, you can't go back and fill this in. Sometimes you may have log files to use, but even these are often incomplete and aren't stored for long. I recall spending days reconstituting events from CloudFront logs, in combination with other data sources!

This is also why over-engineering and storing more data than you need, at more points than you need, immediately, is often a good idea. This is especially true where the events are about a decision by a user or customer, or are control lever events such as with marketing and CRM. You may not want to answer a question about that event now, but you probably will at some point - if you haven’t stored the data, then it limits your future ability.

It can also make you go “big” early when choosing architecture - choosing something which may seem overly complex at the time. If I were going to build a new product tomorrow, I would implement a {CDP} like Snowplow or Rudderstack as soon as the product starts to be built. This makes it easier to store event data regularly and consistently - piped to a useful place. This may seem heavy-handed, but as I’ve said above, you can’t go back in time to fix your data, and you might as well implement something that will grow with you. I don’t recommend building your own event collection system, it’s too critical to get wrong; SDKs are needed for a good system and they are hard to build well.

Why haven't we over-engineered in the past?

Analytics engineering, enabling the answering of these requests, can involve building data models. These data models define entities with associated metrics and dimensions. It's possible when building these data models to only build the dimensions and metrics requested.

This was common practice before cloud data warehouses were the norm, and storage was expensive on databases used for Analytics. To make a query performant on the data model, BI developers would strip dimensions away from event tables. Event tables only contained the timestamp, event type and numeric fields needed for measurement, plus foreign keys related to entities in these events. Therefore, if you wanted to know anything about the entities, you would have to join to the dimensional table which stored this data. This makes sense where storage is expensive and you want to minimise redundant storage of data; compute is fixed and may as well be fully utilised.

photo of computer cables — Photo by Kvistholt Photography on Unsplash

This approach made dealing with slowly changing dimensions particularly difficult. If you only have a foreign key in the transactional table related to something like a product, a property like price can change over time. You then will need to store product price change history separately. Should you want to know the price at the time of the transaction, finding this from the history is slow and expensive.

How to See the Amazon Price History of a Product

Modern data warehouses are not constrained by servers with limited storage and compute. Storage is cheap and compute is relatively unlimited but expensive. You would rather have properties about the event, stored in the event. If the event was about a product, you would rather have all the details about the product at that time on the event, eg stock status, price, colour, variant. This way you don't have to join a state history of products on product_id to the event table which is slow = expensive.

You may additionally want to have an entity state history which can tell you, at any analysis date, what the value of its properties were. However, this enables use cases unrelated to the events above. For example with a product state history:

Have we started to increase inventory with products that are more likely to be out of stock?
Has our coverage, in terms of choice of sellers, declined over time?
Have we progressively increased our ability to offer our customers a saving by having price choice on more products?
How well are we doing at matching products which are the same, but from different sellers, over time?

This kind of analysis is something that isn’t easy to deliver as a data practitioner. It’s often some of the most valuable and insightful work to inform strategy, but making state change data available in self-serve is not something I have seen work well.

Rand McNally and John B. Sparks "The Histomap of Four Thousand Years of World History"

Immutability as a guardrail

Where storing a dimension in an event prevents the event from being immutable, this is probably where reverting to Kimball is a good idea. If a dimension related to a foreign key can change over time, independently of the event, such that its value stored in an event requires updating, this is a good indicator that it should be stored in a dimensional table.

Where an entity group is small and therefore joins to it are cheap, this makes even more sense. For example, most companies do not have > 10m customers, so having a customer table with cohorts, segments, milestone dates is cheap to join to an event table. These dimension fields are not specific to an event; they are specific to the entity of the customer. Entity resolution (especially probabilistic over heuristic) on users and customers can retrospectively change what cohorts they are in - this is actually unhelpful to have stored in an event, as it means you will need to update it retrospectively and your events won’t be immutable!

Inevitably, you may compromise in the consumption layer, as you will want to present your events for consumption with certain dimensions available. You may join this data in afterwards, knowing that the enriched events then become mutable and therefore expensively kept up to date, or that you will accept that the enriched events become “more wrong” over time. This is often due to entity groups being of a similar size to the number of events, therefore being prohibitive to regularly join on the fly for consumption. Anonymous users/devices represented by cookies in web data are a good example of such a group.

Sarah Krasnik Bedell

I appreciate the premise but I think the title is a bit misleading. From what I'm gathering I'd agree that we should over-create data even if we don't know if we'll need it because as you said, you can't rewind and get data for past events. However, over-engineering from a tooling perspective comes at a cost, the cost being time to get the product shipped and then time to maintain the implementation with less value as a result as complexity grows. While I do think we should over-generate data, I'm not sure we should over-engineer the implementation.

Expand full comment

3 replies by David Jayatillake and others

3 more comments...

davidj.substack

Discussion about this post