Issue #20: Why Invest in Data Quality?

Plus Why Software Engineering Can't Solve Data's Problems, Streaming data without the BS, Declarative vs Imperative Code for Data Pipelines and SQL Templates to APIs

Jul 11, 2023

Why Invest in Data Quality?

This can seem like a rhetorical question: you should always invest in Data Quality! But I argue we are still not investing enough: surveys show Data Quality issues are increasing in most organisations and on average, take up 34% of a Data Engineers time, instead of creating value by adding new features. This increases to 50% in large Data Platforms.

All these Data Quality issues add up, with bad Data Quality costing organisations on average $15mil a year.

Data Quality investment is also an investment in high-quality AI and ML, as you’ll likely get more accurate AI and ML results from improving Data Quality than changing your AI model and code.

Having Data Quality checks in place helps reduce “data downtime" for outages and fixes, which subsequently increases the overall reliability of the Data Platform. Highly reliable data leads to more trust in data and better informed decision-making.

Better decision making should increase profitability, productivity, and confidence of the whole organisation; which in turn usually leads to more investment in data and, as a result, going back to the start to increasing the quality of data.

All this creates a “Virtuous Cycle” of Data Quality, constantly improving your organisation:

Virtuous Cycle of Data Quality, by Jake Watson

If Data Quality decreases the opposite happens with a negative cycle.

Better Data Quality testing should also reduce the blast radius of the issue to a few Data Engineers rather than hundreds or thousands of users as more issues being found earlier:

Increasing impact of Data Quality issues by Jake Watson

The fewer users impacted, the smaller the cost caused by the issue, which should again pay back any investment in Data Quality in large multiples.

This is from my upcoming Data Quality section on my in-progress guide on “How to Build a Data Platform“, where we’ll cover how to do Data Quality Testing, Data Reliability, Data Observability and Data Contracts.

Data is not a Microservice: Why Software Engineering Can't Solve Data's Problems

A wide-ranging article from Chad Sanderson (former Head of Product, Data Platform at Convoy and founder of 7k strong Data Quality Camp community) about how Data Engineering differs from Software Engineering and why we shouldn’t too closely adopt Microservices patterns when implementing a Data Mesh.

There is lots of interesting ideas in the post:

Microservices are less concerned about Single Source of Truth and gaining trust in data than Data Products are.
60%+ of analytical data isn’t used. It’s very easy to pre-optimise a data solution with expensive governance and data quality checks that are never used.
Data has a higher rate of change than software code.
How the data developer lifecycle differs from the software lifecycle.
Improving(?) the Data Mesh architecture to better fit the data developer lifecycle

Declarative & Imperative Code for Data Engineering

Matt Palmer, Developer Relations Engineer at Mage, has written about the pros and cons of using declarative (SQL, YAML, No-Code) and imperative (Python) code styles for data pipelines, as well as trying to find a nice middle ground of a hybrid approach with Mage.

It’s a great article, though I will mention competitors, Prefect offers blocks where you can convert imperative Python code to reusable “Blocks” of code that can be shared in a User Interface and Dagster allows you to declare code as a reusable asset in a Data Catalog for better Data Governance.

It is quite an exciting time in Data Pipelines innovation, with 3 different approaches to making Data Engineering more efficient and reusable across environments and data teams.

Streaming data without the BS; when you need it, how you do it

Hugo Lu pours a bit of cold water on the hype of streaming data by debunking some myths.

There is a lot of innovation in the area, with big players like Confluent, Databricks and Snowflake trying to bring streaming to everyone (plus newer companies like Estuary, Ververica and Rising Wave), but it’s still got a way to go before streaming becomes the default way we process data over batch (if ever).

It also has decent security features, though I also wonder if we can import the REST APIs into AWS API Gateway or Azure APIM for more enterprise compliance.

Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms!

The Data Platform Journal

Discussion about this post