Issue #20: Why Invest in Data Quality?
Plus Why Software Engineering Can't Solve Data's Problems, Streaming data without the BS, Declarative vs Imperative Code for Data Pipelines and SQL Templates to APIs
Why Invest in Data Quality?
This can seem like a rhetorical question: you should always invest in Data Quality! But I argue we are still not investing enough: surveys show Data Quality issues are increasing in most organisations and on average, take up 34% of a Data Engineers time, instead of creating value by adding new features. This increases to 50% in large Data Platforms.
All these Data Quality issues add up, with bad Data Quality costing organisations on average $15mil a year.
Data Quality investment is also an investment in high-quality AI and ML, as you’ll likely get more accurate AI and ML results from improving Data Quality than changing your AI model and code.
Having Data Quality checks in place helps reduce “data downtime" for outages and fixes, which subsequently increases the overall reliability of the Data Platform. Highly reliable data leads to more trust in data and better informed decision-making.
Better decision making should increase profitability, productivity, and confidence of the whole organisation; which in turn usually leads to more investment in data and, as a result, going back to the start to increasing the quality of data.
All this creates a “Virtuous Cycle” of Data Quality, constantly improving your organisation:
If Data Quality decreases the opposite happens with a negative cycle.
Better Data Quality testing should also reduce the blast radius of the issue to a few Data Engineers rather than hundreds or thousands of users as more issues being found earlier:
The fewer users impacted, the smaller the cost caused by the issue, which should again pay back any investment in Data Quality in large multiples.
This is from my upcoming Data Quality section on my in-progress guide on “How to Build a Data Platform“, where we’ll cover how to do Data Quality Testing, Data Reliability, Data Observability and Data Contracts.
Data is not a Microservice: Why Software Engineering Can't Solve Data's Problems
A wide-ranging article from Chad Sanderson (former Head of Product, Data Platform at Convoy and founder of 7k strong Data Quality Camp community) about how Data Engineering differs from Software Engineering and why we shouldn’t too closely adopt Microservices patterns when implementing a Data Mesh.
There is lots of interesting ideas in the post:
Microservices are less concerned about Single Source of Truth and gaining trust in data than Data Products are.
60%+ of analytical data isn’t used. It’s very easy to pre-optimise a data solution with expensive governance and data quality checks that are never used.
Data has a higher rate of change than software code.
How the data developer lifecycle differs from the software lifecycle.
Improving(?) the Data Mesh architecture to better fit the data developer lifecycle
Declarative & Imperative Code for Data Engineering
Matt Palmer, Developer Relations Engineer at Mage, has written about the pros and cons of using declarative (SQL, YAML, No-Code) and imperative (Python) code styles for data pipelines, as well as trying to find a nice middle ground of a hybrid approach with Mage.
It’s a great article, though I will mention competitors, Prefect offers blocks where you can convert imperative Python code to reusable “Blocks” of code that can be shared in a User Interface and Dagster allows you to declare code as a reusable asset in a Data Catalog for better Data Governance.
It is quite an exciting time in Data Pipelines innovation, with 3 different approaches to making Data Engineering more efficient and reusable across environments and data teams.
Streaming data without the BS; when you need it, how you do it
Hugo Lu pours a bit of cold water on the hype of streaming data by debunking some myths.
There is a lot of innovation in the area, with big players like Confluent, Databricks and Snowflake trying to bring streaming to everyone (plus newer companies like Estuary, Ververica and Rising Wave), but it’s still got a way to go before streaming becomes the default way we process data over batch (if ever).
Hubert Delay’s Streaming Updates for the People Newsletter
If the above post hasn’t put you off streaming, then the author of Streaming Data Mesh, Hubert Delay, has started a newsletter on the latest news in streaming.
I also found his 5 part series on the streaming ecosystem to be a great overview of streaming technology.
Building Real-time Machine Learning Foundations at Lyft
Lfyt shows how hard it is to build Real Time Machine Learning if you don’t buy-off the-shelf tooling (which will be easier to implement but may cost you more at scale).
Performance Benchmarks of Dedicated Vector Database (Qdrant) vs. PostgresSQL Vector Extension (pgvector)
I was sceptical of dedicated Vector Databases until I saw this post by Machine Learning Engineer Nirant Kasliwal, which shows Qdrant blowing pgvector out of the water when it comes to performance.
I’d still like to see how Qdrant and other Vector Databases stack up against more mature, high-performance data processing products that support vector searching like Clickhouse and ElasticSearch.
VulcanSQL: SQL Templates into APIs
This open source library converts SQL templates like dbt to documented REST APIs, for easier sharing of data across networks.
While converting databases to REST APIs is not new (see PostgreREST), I’ve yet to see any that use templated SQL to easily parametrize API requests.
It also has decent security features, though I also wonder if we can import the REST APIs into AWS API Gateway or Azure APIM for more enterprise compliance.
Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms!