I recently did some consulting and deployed what I felt was the best stack I could, with the information I had at the start of the project.
We’ve since included the likes of Fivetran for ancillary data sources that we can’t CDC from with Streamkap, and may also even use Portable for a connector, which is one of the long tail that aren’t widely supported (I strongly recommend that if you choose a CRM any time soon, don’t choose Emarsys - just use one that everyone uses, like Salesforce or Hubspot. You want your CRM to connect to everything and therefore you want it to be popular).
I attended a meeting where we discussed progress and future requirements from data, and all was well with the progress expected for the project. At the end of the meeting, I was introduced to a company advisor who showed me some of the modelling he’d been doing in Excel. He had been extracting data from the BI tool we had deployed as part of the project. I asked him why he needed to do his modelling in Excel, as, if he needed extra processing of the data into something useful, this is something we could build for him, and he said: “Oh, but we’ll always want to put data into Excel to play around with it.”
As someone who has worked in data for a long time, I have spent a great deal of time and energy in trying to move processes and work out of Excel and onto the data warehouse. Part of why I have gone this way is that I have previously been the person who has built far too much in Excel! I’ve built a forecast model with so many formulas and array formulas that it required a workstation to recalculate. My solution for sharing the forecast, for those with regular computers, was to make a macro that generated a “flat” version of the forecast with no formulas.
In the era where I was trying hard to move processes out of Excel, it was being used to do far too much. That included a lot of data cleaning and transformation inside of ELT processes, plus being used as a BI tool - Power BI originates from Excel. I think a lot of data teams have aimed to reduce Excel use to as little as possible in their organisations, and even considered the measure of their performance to be inversely correlated with Excel use.
Thanks for reading davidj.substack! Subscribe for free to receive new posts and support my work.
Let’s recap pros and cons of Excel in the data space to explain this - I’ll start with cons, as it is natural to explain why data teams want to avoid using it:
Scalability - Excel runs on a local machine and is limited by the row limit and the individual machine’s resources.
In the era where a million rows is pretty small, this is a very tight limit.
Any process which runs in Excel today, may not fit tomorrow.
In the past I’ve ended up “sharding” across multiple Excel files, but avoid this at all costs.
One of Sigma’s main selling points is addressing this issue. It allows you to push down Excel-like formulas to your data warehouse.
Version control - Excel doesn’t have a good version control system and as you can’t express formulas etc as real code, you can’t use git. Therefore, you end up with final_v3, final_final, final_David…
Lack of shared semantics - anyone can create a metric in Excel to mean what they want it to mean.
There is no central control or governance about how to define a metric.
It’s a great amount of freedom but at a cost, and data teams pay the majority of this cost in an org, in pain and time.
I understand that MDX mitigates this, but it is not widely adopted.
Security - once something is in an Excel file, it’s effectively an Infosec risk.
Try as security teams may in preventing Excel leakage, it’s too easy for Excel files to leave an org.
If you ban the file extensions, you can just zip and rename to some other extension which is allowed, like .doc or .txt and get your recipient to change afterwards. I’ve done this to distribute reports internally, when our Infosec team banned Excel files from being emailed. I even built a macro in Excel to generate, zip, change extension and email the files en masse. Life finds a way…
Data cleanliness - there are differences in how Excel works vs SQL.
Vlookups aren’t the same as joins.
It’s not possible to have fan-outs in Excel, but this can hide data issues such as duplicates.
The fact that SQL is less forgiving than Excel actually forces us to address data quality issues.
Consistency - beyond the lack of shared semantics and version control.
You can have what appears to be the same Excel file, with a slight tweak that completely changes the output.
I’ve been in many situations where it’s difficult to know which file was correct, was it final_final or final_final_copy?
One report can be produced in Excel that actually spawns tens, as recipients edit the report to do what they want - they may not even rename it 😱.
Pros and “Why does Excel succeed?”:
Clarity in lineage - Excel makes it easy for non-technical users to understand the lineage of any transformation or pass-through of information. It is powerful in this regard - you can refer to specific cells or ranges inside the same sheet, other sheets or other workbooks entirely. This is something that does not scale well in SQL, where you need to be doing this via a join at scale for it to work.
Ease of use - It also allows them to build their own abstractions to achieve what they want, very quickly. There is more guidance on stack overflow, or the internet in general, on how to make something work in Excel than in any programming language or SQL and perhaps all of them combined. You can do ELT, Visualisation, Research… the whole data stack in one file, on local compute, with no help from anyone.
Composability - This is perhaps also a con in the complexity it can generate, but it is very powerful.
You can very easily refer to work already done in other workbooks or sheets by simple point and click cell references.
You can easily just build on top of an existing workbook to do what you want.
In principle, this is like the ref() function in dbt that is the bedrock of its success.
Did dbt actually bring some Excel-like functionality to SQL, as well as SWE principles? Is working in Excel a form of SWE? I’m not going to answer these controversial questions - I’ll let you form your own opinions 😉. There are probably DAGs in Excel that are as complex as any in dbt.
Type safety and testing
Excel is type safe where other systems aren’t. If you change a data type in Excel to something which will break the workbook, you know immediately.
You don’t need to do Excel run or build in the command line to watch for errors, or to build a CI process.
Excel can make it easy to do unit/integration/e2e testing, precisely because everything is one place and the changes to formulas and inputs take immediate effect. However, you have to choose to do this testing like in any other context.
The zen flow that SWEs look for where they can build and test in the same place, with short iteration and feedback times has always existed in Excel.
I’ve built RAG status sheets which tested assumptions that should be true in order for the workbook to be good, testing the output of complex formulas (unit), testing the data that flowed between workbooks and data sources correctly (integration) and testing that overarching assumptions about the end outputs from the raw data were true (e2e).
Excellent learning curve - Most people can get started in Excel and gradually learn more until they can do what they need to.
This may be stopping at doing sum() formulas, or continuing to learn about conditional logic, Vlookups, Index Match, Pivot Tables, Array Formulas and VBA.
There are an army of people out there with advanced Excel skills: accountants who write moderately complex software in VBA, dedicated VBA developers (OH in the City of London/Wall Street - at their peak VBA developers made up 10% of trading floor headcount 😂), anyone who has ever worked in management consulting.
I recently taught an older family member how to use PivotCharts: it took about 15 mins or so and then a bit of practice on their own. Imagine the equivalent of getting them to using the pivot function in SQL or learning to use a BI tool! They already had all the fundamentals of knowing how to use Excel and getting the data they needed in there, it was just one additional skill to add on top instead of learning a whole additional domain to then use that one new skill. Most people will take the shorter route.
You can also extend Excel with SQL, as I have done in the past, using it to template SQL queries and push the results into the start of a process. I’ve built a billing system that was used to invoice millions of euros=dollars to customers, using Excel, SQL and VBA.
Forgiving - As I mentioned above, Excel doesn’t have the problem with fan-outs, complicated or unclear error messages (Excel even has a step-by-step function trace to understand where something went wrong, which no SQL data warehouse has by my recollection). Yes, it may implicitly handle duplicates by taking the first value found, but often Excel users don’t mind this. They are willing to tolerate a certain level of uncleanliness in their data and Excel enables them in this regard.
Excel has not reduced in use at all because of the Modern Data Stack, and perhaps in some regards it shouldn’t. If it is the de facto way for people to quickly play with some data, maybe this is something we need to support as an industry and as engineers. Could we address some of the cons to make Excel 2.0 (philosophically)?
Count is interesting in this regard - it tries to provide some of the benefits of Excel, with a hybrid SQL/data table/chart canvas experience 😎. Let me be clear: Count isn’t trying to replace Excel, it’s trying to be a better space for analysts and their stakeholders to collaborate - almost like Miro/FigJam for data. It does try to increase clarity in lineage, as all transformations from the data sources are visible on the canvas. However, in order to really understand how they work, a stakeholder would need to be able to read SQL.
Equals sticks much more closely to Excel and provides the same interface, but with easy access to source data from a data warehouse. It also provides an easy interface to create a SQL query - even with joins between data sources - using a click and dropdown interface. This definitely makes it easier to access data and probably deals with the scalability issue, but I wonder if non-technical users would understand the implications of joins etc…
These tools are impressive, as they make progress on some of the cons I mentioned above. However, as they both rely explicitly on data warehouses (instead of abstracting the data store like Excel does) I don’t think they are really that ‘Excel 2.0’ successor that we need. I will repeat: they probably don’t aim to be this, they are just interesting for context. Nevertheless, I believe the pieces of the puzzle are available today - I’ll put them together next time.
I’m an investor
You note the ~1M row limit in grid Excel. Do you think connecting to larger datasets through Power Query is too different from formulas and functions to count as “Excel”?