I’m in San Francisco from 12th to 17th October and would love to catch up if you’re around. I’m speaking at Cube Rollup at The Pearl SF on the 15th - feel free to sign up if you’d like to come along.
I used to play a lot of real-time strategy (RTS) games when I was a teenager. In these games, you’d usually start on one part of the map where you have a base. From here you could develop and expand your base, and also go explore parts of the map. The unexplored parts of the map would be covered by the “Fog of War” - you would have to send a scout to go and see what was there in terms of resources, terrain and other players’ units/structures. Explored parts of the map, which were significantly far away from any of your structures or units, would become shrouded, as you didn’t know what was going on there - just the shape of the territory.
About a month ago, there were some posts in the data space about what had gone awry at Nike and how it may or may not have been due to being overzealously “data-driven”. I want to drill into this difference between the map and the territory and what it means to me.
The RTS map behaviour provides a good model for an analogy.
Your base on the RTS map is where business-as-usual happens for your organisation. These are the metrics that matter: the numbers your company looks at very regularly, the numbers where small unexpected changes trigger big discussions. This is the domain that your gold dataset covers in your medallion architecture. This is where a semantic layer should be used comprehensively to codify and govern the meaning of your data. You want to be able to measure the standard activities, unit production, resource gathering, health of buildings, upgrades unlocked, worker capacity and utilisation… ARR, CAC, LTV, the seventeen flavours of revenue etc. This is for where you know the territory inside out - you live on it. It’s not just part of the map you’ve seen before, it’s what your business controls and operates.
I’ve covered the benefits of the semantic layer in this context in many posts before, but I always meant that it should be used where your base is. There are exceptions to this, which I’ll cover below.
The part of the map where you are scouting is very new to you. You can see potential in terms of visible resources and helpful terrain, but that’s about it. You can also discover traces of other players’ activity or current occupation. In data terms, this could be thought of as the scoping work that analysts often do. They see if there is data available to measure something, then assess its quality. Sometimes, it could be a data engineer considering what data should be captured and what modeling method could be used to describe the business activities with data.
It’s not even possible to have a semantic layer here - there may not even be data or useful data. Sometimes things can go wrong where an organisation tries to be “data-driven” in this context. Whatever data is available or can be bought is considered best endeavours and is then used to analyse the situation and evaluate options. The truth is that rational logic and doing qualitative research here is a better idea than using very flawed, irrelevant or incomplete data. It isn’t impossible to use data in this context, it just requires a huge amount of care to understand if the data is representative of the situation. I’ve had to go see the oracle (the Stats PhD on my team) to understand what I can reasonably understand about a distribution etc. I promise the answer always starts with “it depends…”
Sometimes scouting leads to greater action. The scout sees an enemy mine, and you send troops and an engineer from your base to capture and secure it. This is similar to where you might decide to build a new feature for your product to win more business, or where you market to a new channel. Again, it’s not usually feasible to be “data-driven” here - you likely don’t have the data yet. This is where you plan to collect the data and determine what you might measure with it. You work with the product or marketing team to explain which data model would allow a sufficiently rich representation of activities. Based on what stakeholders want to measure, you can define a new part of the semantic layer that can expose the new data model. To reiterate, at the time of the decision to take action, there is probably no data to use to help you make the decision. You plan to collect data to measure outcomes from the decision.
Where we’re talking about the map and territory covered by the Fog of War, unless you can buy data that you’re confident models the activity there, you can’t do data work - there is no data. It’s better here to rely upon that executive you hired who has been there before. Sometimes data teams get asked to use data from an adjacent occupied/known territory to simulate what it might be like in the hidden territory. The output of this kind of work should carry a big health warning! Just because there is mountainside and valley on the known side, doesn’t mean it will be same on the other hidden side - it might be a cliff into a ravine that’s occupied by bandits1.
Shrouded areas are not great because things have probably changed since you were last there, but could be OK with a bit of luck. You don’t actually operate here on a regular basis. This is like where someone extended the semantic layer to power a fairly niche dashboard that should have been a notebook. Running that dashboard six months later may not make any sense, the data quality may not even be good enough because there have been breaking changes upstream that haven’t been flagged, because this part of the semantic layer isn’t considered prod. This is exactly where semantics layers are usually pruned over time, and arguably shouldn’t have been used in the first place.
Often, when data teams are asked about shrouded areas, this work should probably be deprioritised. It might be an executive whose work stream is misaligned with the business - after all, no business resources have been dedicated to this space and it was previously explored and left alone. Sometimes these executives really believe there is value here, “if we chop down those six trees, we’d have some wood to use!” and want to save face from a previous exploration they commissioned with great fanfare… that led to nothing.
Above, when I described data work which is less well-defined, and requires more care, it was not a suggestion that this work is less valuable or shouldn’t be done.
Often this work can be the most valuable and strategic work data folks can do. You look back over your career and can probably count where these projects have yielded value on one or two hands.
Analytics work, which isn’t building reporting from a fully-known data model, is research and investigative. You don’t know what you’ll find and what is even possible to know, until you look.
People often mistake this work for production type work, which is better defined. They just want a “quick data pull” - it’s only quick if it’s known and terraformed territory. I think that’s part of why many analysts have become analytics engineers - that work output is more predictable in outcome and quantity - not necessarily because they prefer this work.
That high of finding a new insight that changes your business, where the ship changes course, is why I’m in data at all. I’ve had to do loads of engineering to get to those insights, like building a whole cost modeling system (which took a year), but thats why I’m here. Sure, I enjoy building the machines too, but they are for this end.
These insights can sometimes come from your base and it’s easier when they do, but less likely. As you and many others use this semantic layer a lot, others often find these insights instead of the analytics team. They find them quickly and easily without a “project” happening, too. They find them without you and don’t even tell you, they get on with their jobs. Often analytics teams find these insights when they leave the base, when they go off-piste, because thats where new information is to be found in the big wide world out there, outside of the base. These teams need to be supported and provisioned to do this work where they may find nothing at all - knowing this is a real possibility.
Semantic layers empower analytics teams to do this work, as well-known territory has roads - you don’t need to be an explorer to wander around the base. This allows your explorers, your analysts brave enough to investigate and do research, to leave the base and see what’s out there.
Only focusing on knowable outcomes is a mistake. I think this is one of the big pitfalls of companies being “data-driven”. They stop being able to see that there is territory to be had beneath the fog. It can make them timid. Product folks can become afraid to punch out into the unknown and focus only iterative changes or optimisations. With a company like Nike, this is a disaster - Nike’s following has come from their courage to make new, exciting products that may or may not be hits.
A good data team should help their business be daring enough to venture into the unknown, by providing a construct of how to measure and observe the territory when they get there. You don’t need a map to find new territory.
Like how some people have thought moving from normal financial services like payments and banking would be similar in crypto.