Issue #33: Podcast, Practical Data Modelling, and Reviewing Databases in 2023
Plus: Automated dbt Test Generation, BigQuery Design Deep Dive and Data Exploration in Stream Processing
Hi all, this month we have:
Podcast: How Data Platforms affect AI and ML
Practical Data Modelling
RAG Using Structured Data: Overview and Important Questions
Automated dbt Test Generation
Databases in 2023: A Year in Review
I spent 6 hours understanding the design principles of BigQuery. Here's what I found.
Rethinking Stream Processing: Data Exploration
Podcast: How Data Platforms affect AI and ML
While I haven’t had time to write an article this month, I did record a podcast with MLOps Community on how Data Platforms affect AI and ML. There is a video and an audio (Apple, Spotify) version, so pick whatever format suits you.
Practical Data Modelling
Joe Reis, Co-author of one of the most important books in Data Engineering, Fundamentals of Data Engineering, has started a blog on Data Modelling, an area I feel lacks authoritative, up-to-date material, so excited to hear what Joe has to say.
Databases in 2023: A Year in Review
You could argue Andy Pavlo is one of the leading experts in databases: he is, after all, “Associate Professor of Databaseology” at Carnegie Mellon University and CEO of database tuning company Ottertune.
As expected, his thoughts on the latest database technologies and SQL language are highly insightful.
RAG Using Structured Data: Overview & Important Questions
As mentioned above, everyone wants a “Intelligent Data Platform” and to do that probably requires using RAGs to train LLMs on your data.
But how good are RAGs at reading structured data and outputting accurate data? Semih Salihoğlu, CEO of graph database company Kuzu and Associate Professor, reviews the latest scientific literature on the topic to give an overview.
I also highly recommend reading Semih’s sister article on using RAGs with unstructured data and the role of knowledge graphs.
Automated dbt Test Generation
While I use dbt for most of my data quality testing these days, I still miss Great Expectations ability to profile your data for you and generate tests, saving you loads of effort in configuring the test yourself.
Kevin McQuate has released test generation for dbt while it only generates tests for half a dozen types of tests, hopefully it will grow and match Great Expectations profiling capability.
I spent 6 hours understanding the design principles of BigQuery. Here's what I found.
While Google has struggled for market share in the cloud against AWS and Azure to a certain extent, it still loved by many Data Engineers and Analysts, mostly due to having great data products like BigQuery, which more than holds it’s own against the best Warehouses and Lakehouses.
Vu Trinh dives deep into BigQuery internals to find out why.
Rethinking Stream Processing: Data Exploration
How do you explore data that is constantly in motion? Shi Kai Ng, Calvin Tran and Minh Nhat Nguyenv from Grab, a ride-hailing and food delivery app (among many other features) that has hundreds of millions of users in Southeast Asia, try to answer that question.
Sponsored by The Oakland Group, a full-service data consultancy. Download our guide or contact us if you want to find out more about how we build data platforms!