Data Platform Journal #11
Query DuckDB in AWS Lambda, SQL Unit Testing, AWS Kinesis is not Apache Kafka, Data Migrations and Software Estimations
I’m afraid I have no updates to the “How to Build a Data Platform” guide, but I should have one ready next week as it’s currently going through review. Instead, I’ll be sharing more links than usual!
This week we have:
How to query a Data Lake with Duck DB and AWS Lambda
SQL Unit Testing
How to Create a Data Product Focused Data Strategy
Amazon Kinesis is not Apache Kafka
How Documentation is Managed at Funding Circle’s Data Platform
What’s the difference between transactional databases and analytical databases?
Migrations - The Hardest Actual Problem in Computer Science
Tips for Software Development Estimations
A Serverless Query Engine from Spare Parts
A great article from Ciro Greco shows how to get high performance serverless querying of a Data Lake using just open source DuckDB, AWS S3 and AWS Lambda; no server required!
There are limitations to this (Lambdas can only run for up to 15 minutes and only supports up to 10 GB of RAM), but this looks like a very cheap way to serve smallish amounts of data (1GB and under?) in a Data Lake via an API.
What is also great about this, is that if you find yourself hitting Lambda’s limits, you can port all or most of the same code over to a container or Virtual Machine, where it can outperform Spark on 100GB+ datasets. Maybe that’s all the compute you need.
The SQL Unit Testing Landscape: 2023 (May Require Medium Paid Access)
This is a paid Medium article, which you may not have access to (maybe try another email address!), but I liked the article so much that I still wanted to share it.
SQL is great for querying data and is the default choice for doing data transformations, though I feel one of its biggest drawbacks is that there are no great ways to do Unit Testing. I know some will disagree, but I’ve always found more joy by calling SQL in PyTest on sample data, which is often also created in Python using libraries like Faker.
But Chad Isenburg, a Data Engineer at Zendesk, has done a great summary of tools that don’t directly involve using Python, highlighting SQLMesh as a compelling new approach to testing.
How to Create a Data Product Focused Data Strategy (Sponsored)
Looking to increase the value of your data? Luke Sharma, my colleague at Oakland Group, shows how a Data Product focused Data Strategy can help achieve that mission.
Amazon Kinesis is not Apache Kafka
It’s very easy to lean too much on our pattern-matching instincts and assume two real-time streaming services are broadly the same: Senior Data Engineer and longtime content writer Bartosz Konieczny takes us through all the technical differences between Kinesis and Kafka, proving they are designed for different use cases.
OLTP vs OLAP - Transactions Vs Analytics
I know there are a few articles out there explaining the differences between analytical databases and transactional databases, but Ben Rogojan has done such a great job of explaining the difference in lots of depth that I felt it was still worth sharing.
How we Manage Documentation at Funding Circle for our Data Platform
Good data documentation is a major factor in the long-term success of a Data Platform, but it is hard to get right at scale, particularly if you have multiple teams all using different methods of documentation (if any documentation exists!).
Nikolajs Skrjabins at Funding Circle shows how they built a central documentation portal for their Data Platform, for what looks like a very low cost using Static Site Generator.
This could potentially act as a low-cost alternative for Data Catalogs, though it won’t have features like data schemas or lineage out of the box.
Video: Migrations - The Hardest Actual Problem in Computer Science
While not directly on the topic of Data Platforms, I will take any quality content on migrations, because, as title the points out, it can be one the hardest tasks to perform in building Data Platforms and Software Engineering.
The above video, presented by Matt Ranney of DoorDash, goes through the pros and cons of number of migration strategies including running two databases at the same time or having a shared database.
Rules of Thumb for Software Development Estimations
Another very difficult task (some would argue an impossible task) is estimating when a software system, task or bug fix will be done.
Of course, the ideal is do as little estimation as possible, but in reality, that is often impossible as businesses want to know how much budget to allocate to a software project.
You can forecast based on how long previous work took, but this requires lots of past data on how long a task took to complete and is only a little better as there is always unforeseen known unknowns and unknown unknowns that can derail the most well planned and forecasted projects.
That said, Vadim Kravenko, CTO of Mindnow, gives a great list of tips and tricks to help mitigate your risks around project estimation. It is worth reading the Hacker News comments on this article for more insights and I recommend Allen Holub’s video on why any estimation can be bad:
Sponsored by The Oakland Group, a full service data consultancy.
Cover photo from Unsplash.