Last week I wrote about Excel and why I think it’s so popular, but also the problems data folks have with it.
I mentioned scalability, version control, lack of shared semantics, security, data cleanliness and consistency as the main issues to be solved.
Scalability
Part of the reason Excel isn’t scalable is its row limit - but it has this row limit to ensure that it can provide a good user experience using the resources most machines have available. Excel doesn’t seem to have the concept of pagination, in combination with lazy evaluation, where it’s only holding the data and calculations that are visible on the page in memory. Having this ability would make Excel able to cope with much larger datasets. It’s possible that the experience of using this wouldn’t be snappy enough, but there are plenty of other applications that use it very well, even with remote API calls; therefore with local API calls, it should theoretically be possible.
We’ve seen what is possible on large datasets, even locally, using technologies like DuckDB. I have recently been working on a 20GB dataset using it, with ease and near zero latency on my machine. I’m pretty sure, from other people’s benchmarking, that it would scale to 2TB without too much of a problem.
Recently, Jordan Tigani of MotherDuck (previously GCP BigQuery), wrote this piece about the end of Big Data. I do believe there is a segment of companies who have to deal with data much larger than is possible on a local machine on a regular basis, but with the technologies available today, this is a smaller segment than ever. His point regarding how, really, only the most recent data is queried very much is also very true. Usually, when historical data is needed to stretch further back, it is aggregated to a level that you would forecast at - this data is small.
Imagine hosting a DuckDB backend to our new Excel on a modestly large AWS Instance, just while users are actively working on it. You could handle some pretty large datasets.
You would need to be careful about how to access storage: not every function in Excel is an OLAP-type one. Index match, Vlookup() and plain old cell references are really OLTP queries, returning the first result found. It is now possible to use an OLTP db like Sqlite and DuckDB on the same storage, so you would need to route these functions to be executed on the right type of compute.
Where many cells have the same formula copied with shifting cell references, it can be the case that one window function could generate the data to return the values for all of the cells. Therefore, a smart compiler would have to know to aggregate these formulas into fewer OLAP queries and then split the results appropriately afterwards.
I’m not saying that this is a perfect workable solution right now - this is just an idea of something that could be built with similar tech.
Data Cleanliness
Some of the issues that Excel, somewhat unhelpfully, hid before (like possible fan-outs) could be exposed to the user to resolve. One formula producing many results could have an explicit error or warning.
Version Control & Consistency
Obviously, the current method - with files being punted around - is broken. However, what Google Sheets does in making a new version after every change is unhelpful, too. The workflow you want to have is where someone makes changes then purposefully saves/commits them and then merges or discards those changes, ahead of sharing/publishing.
You could see a whole lineage of files showing the original shared version, with children branching off from this. It would make it possible to have a DAG to know what each version was and who touched it along the way and whether it was more or less recent than another, all at a glance.
Sharing files could work in this regard, using hashing to know something was a specific version, but it’s more likely that a SaaS tool will host all the versions. This is already the case with Google Sheets and all the new entrants I discussed last time, and is far superior for security, as no data is moved, access is given and can be controlled. Excel was already moving away from desktop and files being shared, through Office 365.
Version-controlling large data, byte for byte, isn’t feasible, so hashing could be used to understand that data has stayed the same. An equivalent of Snowflake’s zero copy cloning could be used between versions which use the same data. This way, if data was updated by ELT runs, the hash would change for all versions using the same data and therefore they would still be considered the same. The user could be informed that their base data has moved since they used the workbook last, or they could choose to force the data to remain in the same state and have partition drift with production data - either way, it’s explicit.
Shared Semantics
With a full semantic layer, ie including entities, it could be much easier, safer and more secure to pull data into Excel.
If all you want is a metric or metric series, you could do something like “=semantic(‘Revenue’,’Month’,’2013’)
” and it will pull the revenue by month for 2013. You wouldn’t have to find it in some data warehouse or CSV in your inbox attachments. You wouldn’t have to dig up some old workbook and recall whether it was the right one to use. Better yet, you could do “=AI(“Show me revenue by month for 2013”)
”. It’s actually more verbose this way, but it’s also easy for anyone to use who doesn’t know how to make the semantic query formula work.
The big benefit of metrics and dimensions being taken from the semantic layer is that they aren’t being redefined, they are just being reused which was the point of the semantic layer in the first place!
Of course, you could also start a pivot table from a semantic layer… its OG use case. However, this is where Excel is starting to be used as a BI tool, rather than the exploratory tool it is better at being. Pivoting data that you have curated yourself in Excel is still very much part of that exploration, but pivoting data that is somewhere else isn’t. For things like comparisons or overall composition it can still be helpful.
Analytical freedom is severely limited. Since data from the metrics API is already aggregated, there’s no way to drill into specific records, or create on-the-fly dimensions and measures to slice the data in a way that might reveal new insight. You can request other dimensions from the metrics layer, but what if the dimension you’re interested in doesn’t exist yet?1
The reason I say “full semantic layer, with entities”, is because often a user is trying to use Excel to explore an idea for themselves. They do often want to take some rows of data that haven’t been aggregated or shaped into a metric query yet, and then do their own analysis on it. Data scientists often repeat the mantra that you need to spend time looking at your data, doing EDA. This is probably a similar activity, where the user wants to know the data they are using.
Taking entities from the semantic layer is much safer than directly from a data source. It’s much more likely to be clean and free of issues like duplicates; it’s much more likely to be safe to share, as it will have PII stripped out etc. It’s also actually much more likely to really reflect the entity the user wanted to explore… your data team has said this is what it defines as a “customer” or an “order”… it’s not perfect but much better than guessing. It can also become a powerful way for a user to pull a cohort or segment without having to know where to get it or how to define it - they can then use it for their own exploration.
What we don’t want to change… the User Interface
What we can tell from how much more popular Excel is than all other analytics tooling combined, is how many users prefer its user interface to those other tools. They like its flexibility and that they are empowered to do things for themselves, with transparency. This is the part I wouldn’t change, and what everything I’ve proposed above preserves.
I wonder if Excel resembling your maths workbook at school is very much on purpose - familiarity. If you didn’t like maths at school, you probably aren’t going to be an Excel (or BI for that matter) fan anyhow, so why not lean in to this? Yes, cells naturally form a grid, but there were probably other ways to build it, too.
You might say: “Well, why not use the LLM to generate the whole workbook?” I don’t think this is the right way to go - part of why people like Excel is that they can explore and understand for themselves. If you’re generating a whole workbook, you really should be generating a slide deck or doc.
As I mentioned earlier, there is value in using an LLM to acquire the entry data and then explore from there - this is where semantic layers are helpful. AtScale integrates with Excel today already - imagine if you could pull a metric series or just a set of entities, like with a cohort or segment, and start working on them. Don’t know where to look and how to get the data? You just ask your friendly neighbourhood LLM service, which would translate your request into a “semantic layer query”. However, this doesn’t change the grid - the grid is here to stay.
https://carlineng.com/?postid=semantic-layer#blog
Great series David. I can't tell you how many IT folks I've heard complain about Excel and their plans to kill it. What they don't understand is that the cell-based design of a spreadsheet like Excel makes it an easy to use, universal tool for analytics. By adding a semantic layer to Excel , we can avoid many of IT's "wild west" complaints while still allowing users to pick their analytics tool of choice (9 times out of 10: Excel!).
I wonder if there ever will be a better interface. Is it beloved because of its familiarity? Or so widely used because it simply is the best way to self-serve data exploration?
Every BI tool if ever seen has eventually had to embrace the grid/table component. And download to excel capability