Data, Activated

Doing it the wrong way

Nov 02, 2022

I’ve recently been helping out an organisation who, despite being a profitable and successful eCommerce retailer, are fairly immature in terms of data. Perhaps this shows that mastering the fundamentals of buying for one and selling for two is more important than trying to implement a deep learning model somewhere 🙊.

Having said that, there is a huge opportunity for data to have an impact here:

Like any eCommerce retailer, they have many products and can optimise which ones they show on screen at any given point (recommender systems). They do this with human intuition currently.
They have yet to consider their website as a product in and of itself: they change it according to the inventory they have to show, rather than considering analytically which UI configurations are most optimal.
They have a mixed bag of reports running off their Magento backend, a legacy full stack BI solution - currently only used for orders data - and some direct queries on their production database!

As with any data project, it can be useful to go for the low-hanging fruit first. There are things which are costly and potentially damaging to the business which can be rectified as a priority. While you may end up putting intermediary solutions in place to deal with these issues, it can save the organisation money and build goodwill between consultant and organisation.

The first two bullets above are longer term projects of their own that require good data fundamentals to be in place before they can be tackled, such as having a an event tracking system in place and starting a product management discipline.

The reports being run directly from their Magento backend are a mixture of reports that need to be refreshed in near real-time, as trading decisions are made on their output. Some queries run very frequently (every few minutes), to feed their CRM with new customer activity and attributes.

The Magento backend uses the production OLTP database. Therefore, syncing data from this database to a DWH would enable reporting to run from here instead, thus reducing the load on the production database. Reducing load on this production database will allow it to be scaled down, safely saving tens of thousands of dollars and it will also prevent further website outages due to the load being unmanageable at times (saving much more money from lost sales).

Due to the need for the data in the DWH to be kept in near real-time sync with the production database, we have opted to use CDC streaming from the production database to the DWH (using Streamkap as a CDC vendor). The data sent to the DWH is a stream of records emitted for every record in the production database that has had a transaction against it (Update, Delete, Insert). This will allow for state change analysis on things like inventory and customer status in the future, but for now I am using dbt incremental models to maintain an as-is view of the production database at the point of dbt run.

As an aside, this kind of difficulty in replicating data is why HTAP databases are something of a holy grail, although they don’t help with tracking state change by default unless they store transaction logs.

Each table in the production database that is replicated via CDC will need its own dbt model to have a current state table. Each production table’s CDC stream ends up in a table in the DWH, which I have declared as sources in the dbt project (codegen is a very helpful dbt package for this!). I’ve made a macro which I can use repeatedly with little change to make incremental models, to maintain current state.

An example of one of the models using the macro:

*as you can see the somewhat complex macro allows for easy repeated use where only the source_name, unique_key and cluster_by need to be changed once per model*

In order to represent delete transactions, soft delete records are also emitted. If I was simply having table models instead of incremental models, I could handle this inside each model by keeping only current records where the CDC metadata field __deleted is false. However, over time, this will become exponentially more expensive, as the table models will always need to scan and sort the whole CDC history.

Therefore, I have opted to use incremental models without filtering deleted records, as I need to know when records I already have in my table, which aren’t currently deleted, become so. These records need to then be removed from the current state models after the incremental run has finished.

I had to make a tough choice: it would have been a bit cleaner to have second step models which filter the incremental models of the type above for only rows where __deleted is false. However, this involves storing the same data a third time. I opted instead to use a post hook on the incremental models, to delete records where __deleted is true. The amount of data scanned is the same as if I were to have the secondary models, but I don’t need to store it again and can hugely reduce the number of dbt models I have in my DAG.

I’m thinking ahead to a point where I may have hundreds if not over a thousand dbt models, so saving having some models I don’t need feels like a good choice. It is a bit less clear what is happening, as each step isn’t a model in itself. However, given the definition of the models are “as at current state”, it’s also logical that they shouldn’t have deleted records. The post hook is also simple and in plain sight in the dbt_project.yml file. Some Analytics Engineers will staunchly maintain that I should have made the secondary step models, but I feel post-hooks are a part of dbt to use as needed, still maintaining all functionality in the dbt project 🤷‍♂️.

As described earlier, the reason I needed these current state models was to allow for the equivalent queries that had been running against the Magento backend, or directly on the production database, to be repointed to the DWH.

You might say - hold on, surely the CRM use case would benefit from an event tracking system like Rudderstack sending events directly to the CRM when users complete actions like sign-up or cart abandon? Yes, this is by far the ideal pattern, bypassing the DWH altogether, but the organisation doesn’t have the resources needed to implement the event tracking system yet. So, pragmatically, and to reduce load on the production database, we have opted to refresh the incremental current state models at a high cadence (every few minutes for specific models needed for CRM queries). The dbt cloud job used to run these models executes very quickly and scans little data each time.

Once we have had time to implement Rudderstack properly, we can move away from this process to something closer to ideal.

These CRM queries are run using cron as part of a PHP script, which runs the queries and sends the data to the Emarsys API with the columns from the query results mapped to the field names in Emarsys. This process was originally created by some offshore engineers who no longer are available to the company and no documentation was created… Figuring out what this process does exactly and which queries are used in which field mapping has been the hardest part of the whole project so far.

Once this was understood, we decided to use Hightouch to reverse-ETL the data to Emarsys after each dbt job runs.

Why Hightouch and not Census? Simply because Hightouch had Emarsys as a target (and Census didn’t), no other reason. I haven’t checked if Census has the features I describe below that Hightouch does.

The new customers model above didn’t exist before - this was the query actually run in the script, ahead of its results being sent to the Emarsys API directly. In Hightouch, I could have just input the query in the sync config, but it is better for this to be a model in the dbt DAG in order to have well understood lineage. You can also choose models to sync as a source in Hightouch from dbt-core, if you integrate your dbt repo with Hightouch.

Hightouch also has a dbt exposures sync feature, which adds exposures for Hightouch syncs into your dbt DAG. This is great - it makes lineage complete all the way to the consumption layer, without you having to write a lot of exposures manually. I originally thought these syncs could be analyses in terms of dbt entities, but exposures are better.

https://hightouch.com/blog/introducing-dbt-exposures-for-hightouch

Hightouch also has an integration with dbt Cloud where a job run in dbt Cloud can trigger a Hightouch sync to happen. This is exactly what I want to happen in this instance, as the data being prepared for this sync is specifically what is in my dbt Cloud job.

Philosophically, it could be better for Hightouch to trigger the dbt cloud job and run the following sync rather than be triggered by it, reverse orchestration-esque. It makes the end data asset, which is the data synced to the target, the focus of the activity. It already has the orchestration abilities to do this.

https://hightouch.com/docs/syncs/dbt-cloud

From a data consumption in Data Engineering point of view, there isn’t much difference between something like Hightouch and something like Tableau. Both take data as a query from a DWH and push it into another API. Where Metaplane exposes lineage to BI tool dashboards and charts, it is just as logical to expose lineage to syncs in Hightouch.

As Hightouch is already putting its syncs inside dbt lineage, it means that Metaplane could easily integrate with Hightouch via dbt lineage.

Kevin Hu and Erik Edelmann’s talk at Coalesce 2022, covering usage of dbt, Metaplane and Hightouch in Vendr’s data stack

davidj.substack

Discussion about this post