dlt windsurfing

Trying out dlt with DuckDB

Dec 04, 2024

a person surfing on the waves — Photo by Mark Harpur on Unsplash

Yesterday, I mentioned that I wanted to focus on getting some data from the Bluesky API endpoints using dlt - which involves learning how dlt works. You can follow along with my public repo here.

sqlmesh init duckdb

David Jayatillake

December 3, 2024

I’ve created a public repository called bluesky-data, here - feel free to follow along as I work on it and even to contribute to it if you like.

Read full story

Firstly, what is dlt?

With over 1M downloads per month, dlt 1.4 is the most popular production-ready Python library for moving data. You can add dlt to your Python scripts to load data from various and often messy data sources into well-structured, live datasets. Unlike other non-Python solutions, with dlt, there's no need to use any backends or containers. We do not replace your data platform, deployments, or security models. Simply import dlt in a Python file or a Jupyter Notebook cell. You can load data from any source that produces Python data structures, including APIs, files, databases, and more.

What it feels like to me is a kind of framework to build your own connectors. So many data engineering teams end up building custom solutions to problems that are not unique to them. dlt feels like it offers a level of standardisation of method for these kinds of solutions. It has some really helpful out of the box features. For example, one of the things that isn’t possible to do in SQL is handle pagination from an endpoint. DuckDB and other query engines (probably) allow you to query an endpoint directly:

select * from read_json_auto('https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=jayatillake.bsky.social');

The Bluesky http endpoints use cursors to paginate through larger amounts of data, they have a limit of 100 records returned per request.

dlt provides a RESTClient that makes handling this pagination easy. It then also allows you to build a runnable pipeline to take data from a source, like this Bluesky API endpoint, and load it into a data store with incremental strategies and metadata stored about runs.

In 40 lines of code, including package imports and 5 line breaks, you can get your data loaded. It’s definitely easy to use. I used Cursor lightly, to make some initial structures for me using the dlt and Bluesky API docs, then changed things as I wanted.

This was a great start, but just having the followers for one actor is only so interesting. I wanted to get who they follow too, plus the follows and followers for their follows and followers. So, I added the code below to achieve this.

For one actor, the original code ran quick enough - looping through thousands of actors to get their followers and follows using the method above proved to be slow. It would have taken many hours to collect the data using this method.

I tried using Cursor to introduce concurrency to pipeline.run for many actors at the same time. It insisted on trying to use the Python asyncio package to do this and got stuck - the code it generated after many iterations worked, but didn’t actually run asynchronously. I even had to upgrade to a paid plan to get a few more iterations to run on a large model, but have since got a refund because it didn’t work - I barely used any of them and I found a new champion…

A number of people on Bluesky, and off, had been saying that Windsurf by Codeium was a significant upgrade on what had already been a good experience in Cursor, and that it was better for workflow tasks in particular. Most data engineering tasks fit this description.

So, I opened the repo to the point I had got to using Cursor and hadn’t tried to get concurrency to work. Windsurf also tried asyncio, but after one or two iterations of it not working, it came up with a different approach with multiprocessing. It started to work after a few tweaks, but dlt.pipeline.run using DuckDB causes locking between processes… so now what?

When fed the error, it came up with a really novel approach that I hadn’t seen anywhere before. It ran the pagination steps concurrently using multiple processes and then queued the dlt.pipeline.run steps to run on DuckDB in a single process sequentially. I’ve never seen this approach in docs or in any of the technical help slack conversations in their community slack. Windsurf uses the chain of thought approach - they call it Cascade - to come up with better solutions and it’s effective!

This approach was massively faster than the previous one above (10 hours or so down to 1), but I noticed processing the dlt.pipeline.run steps sequentially took a really long time. DuckDB is great at big batch writes, but not optimised for many small writes. In addition to this, dlt.pipeline.run stores state and metadata in a few different places including DuckDB - this is really helpful for a production use case, but unnecessary for this project.

I wondered if it would be much faster to simply write the output of the pagination steps in microbatch to parquet, which I can then easily query using DuckDB later in my SQLMesh project. As it turns out, this is indeed much faster. Windsurf started off by suggesting loading the JSON output from the pagination processes into a Pandas dataframe and then using a method of the dataframe to write to parquet. I felt this didn’t make sense though, and that it should be possible to write straight to parquet. I expressed this to Windsurf and it suggested using pyarrow instead, which worked great.

As you can see, I had also requested an amount of logging so I could track progress and issues when running. Windsurf had also added useful things like error handling and dealing with worker timeouts from multiprocessing. As I got into the flow of working in Windsurf, it started to feel like working with another engineer, where I was acting more like an architect. I’m pretty good at knowing exactly what I want software to do, but I don’t write it often enough to be productive. If I was writing code from scratch with just a language server, I’d get stuck really quickly because I wouldn’t know how to express what I want to do.

The experience with Windsurf and its Cascade feature feels like the step beyond Copilots, where you’re actually working with it to solve problems. Cascade is able to create and change multiple files in your project to respond to your requests or problems. Its actual Copilot feature was worse than Cursor’s and was a bit annoying and too quick to interfere at times, but with the way of interacting with Windsurf being less like coding and more like supervising… this was less important. It didn’t need to be told about docs, it went and found them proactively.

I actually learned new concepts from using it - when it suggested a method or code I didn’t understand, I looked it up or just asked what it meant in Cascade. It gave great explanations and also showed examples of why certain concepts were necessary. It could empower engineering teams to handle taking on interns and junior engineers. They can learn how to write code from AI but learn how to engineer well from the senior engineers in the team while using less of their time and learning how to do things faster.

The great thing about it being a data process is that the proof is in the pudding - it was easy to check the output and compare it to earlier methods to be certain that it was doing what I thought it was.

Finally, I realised I was only using dlt for its ability to paginate over an endpoint. It felt heavy-handed to use it for that, and I worked with Windsurf to replace it with a simpler solution using the Requests package. After a few tweaks, this worked too and my last solution doesn’t use dlt or DuckDB at all. It runs in 10 mins, which is about 100x faster than my first method. If I filtered out the actors who aren’t real people but have over 100k followers or follows, or are interesting highly followed people I follow that aren’t part of the data community, the process would run in about three or four minutes.

I’ve had fun coding into the wee hours with Windsurf, and just like I wasn’t going back to VSCode from Cursor, I’m not going back to Cursor from Windsurf1. I did get into that flow state, and hours melted away! This is where the puck is going. I can’t see AI replacing engineers any time soon. They just don’t have the ability to know what good looks like, but to someone who does, they are an amazingly powerful tandem engineer.

I will use the DuckDB db that I created using the longer process with dlt, so that I can illustrate how dlt metadata is inherited by SQLMesh. However, it was enlightening to explore how much is possible with new AI tooling.

It may be the case that for simpler use cases, where frameworks like dlt are heavy-handed, it’s perfectly easy to just partner up with your AI programmer to easily create something simpler and bespoke for you needs, with fewer dependencies.

Back to SQLMesh tomorrow!

My choice to try and switch to storing the data in parquet was perhaps inspired by the big news from AWS Re:Invent - S3 Table Buckets. Essentially, buckets designed for parquet storage, with automatic metadata creation and maintenance in Iceberg!

The more I think about this announcement, the more I think that it changes things a great deal. The gravity of data has shifted from the warehouse to cloud storage... but is there really a difference any more?

I haven't seen documentation on how these special Table buckets are used, but given the metadata (Iceberg catalog) is automatically generated, I imagine that you can just push parquet to these buckets in some reasonable standard and partitioning pattern and it will just work. I know - a big caveat! I want to look further into how they work, but I’m sure that the other cloud providers will follow suit soon.

The consequences for data engineering are huge. Increasingly, big SaaS providers, like CRMs, offer connectors to many data warehouses. They no longer need to do this - they can just go straight to S3 Table buckets and provide code examples as to how to query and schema info.

The metadata in S3 Metadata and Iceberg then becomes very complete. More or less an org's whole data estate, apart from OLTP. Even with OLTP, there are now ways to separate storage from compute and store as files in cloud storage.

If there were a universally-accepted open format to store row-based data that could work with Iceberg or its own universal catalog standard... then we have metadata for the whole data estate. The need for complex data catalogs that ingest metadata from many different systems could decline.

I think it's a great time to be a Data or Analytics engineer, and this kind of news increasingly merges the two roles together. If all data is in parquet/Iceberg, both Data and Analytics engineers need to be able to use it here, anyhow.

The role distinction then becomes operating further to the left or right of the DAG... which is probably not a worthwhile distinction, except in the largest data teams. Bronze/staging work, which is accepting Iceberg as input, can become much more standardised too.

Materialized View

It's Time to Merge Analytics and Data Engineering (Again)

Josh Gray, CEO of Artemis﹩, interviewed me about data stack fragmentation, dbt, cost savings, and more. Josh’s questions really got me thinking, and is the subject of today’s post. Here’s the interview…

8 months ago · 40 likes · Chris Riccomini

Data Engineering becomes more focused on "librarian" type work, regarding the meaning of data and transforming it for use, than focusing on platform and infra work.

Although the gap is smaller. I believe with Cursor’s likely headstart in funding, they could work hard to catch up or leapfrog Windsurf. I’m talking about Cursor as it is today, when I’m saying I’n not going back. This is capitalism working well!

davidj.substack

sqlmesh init duckdb

Discussion about this post