How to Build a Data Platform: Data Transformations

SQL vs Python and Why Code Quality Matters

May 28, 2023

Introduction

Data Transformations are the heart of any Data Platform. They will combine raw operational data and business rules to make analytical data that can give you insights and help with decision making, therefore adding value to your Data Platform.

In this section, we’ll discuss the pros and cons of the two most popular programming languages used to perform these transformations, SQL and Python, and the importance of code quality in a Data Platform.

SQL or Python Transformations?

Whether to use SQL and/or Python code for data transformations is one of the most controversial topics in Data Engineering, though we would say there is a trend to use SQL unless it doesn’t suit the task.

We’d broadly agree with that, as SQL is arguably more popular and easier to learn, though building a Data Platform using mostly Python transformations isn't arguably any more difficult, based on our experience of doing so many times. It also confers certain benefits over SQL, if you are willing to get over the increased learning curve. We list the benefits of both languages below:

Benefits of SQL:

SQL is used mainly for querying data, most often in a database, and is the most popular choice for that.
There are arguably more developers experienced in using SQL than Python (though both are very popular).
Because of that and the fact that it is a declarative language, it is often the most efficient way of querying data, in terms of lines of code and performance.
SQL is often quicker for ad-hoc Business Intelligence queries. If you want to query a database / data warehouse, 95% of the time it’s best to use SQL.
It is also worth mentioning dbt, a popular library that builds data transformation pipelines from SQL (which itself uses Python under the hood).

Example SQL query: reading parquet files using Trino

Benefits of Python:

Python is a general purpose programming language, that has use cases beyond data (DevOps, Web Development, etc.) and is a very good “glue” language - excellent at joining two data / software systems that don’t have out-of-the box integration. It’s entirely possible to build a entire Data Platform in Python.
Python is much better for Data Science use cases; most popular Data Science libraries like PyTorch, TensorFlow, Sci-Kit Learn and Spacy use Python, though some Data Scientists prefer to use SQL for the transformations and reserve the use of Python for the actual algorithms.
- You can also use SQL for some machine learning use cases.
Python is arguably better at dynamic queries (where tables and/or columns queried are determined by other data), dynamic SQL has to be generally written in strings, which can be difficult to debug. You should generally have your data queries as static as possible for readability and easier maintainability, but sometimes it’s unavoidable.
There could be an argument made that long, complex queries are better in Python as it’s arguably easier to debug a procedural language than a declarative one as you can step through the code line by line. Though you can use Common Table Expressions (CTEs) in SQL to break up long SQL queries.
A fair number of popular Data Engineering software libraries are developed and can be used in Python (Apache Airflow, Great Expectations, etc.).

Same output as above, but in Python using pandas

In a large, complex Data Platform you’ll likely end up using both, so it’s worth knowing both SQL and Python well: SQL for simple to intermediate analytics and transformations, Python for very advanced analytics, Data Science, DataOps or to glue two components together that can’t be easily or cheaply done with a managed service.

I’ve seen some argue that because bad code can be written in Python (notebooks being written out of order) or SQL (long undocumented scripts) means you shouldn’t use them, but any language can be written badly and very well, so don’t buy their arguments.

What about other languages?

Using Java is not a bad idea for real time streaming, as the most popular streaming solution, Kafka, is best supported in Java (though can use Python and SQL for a large number of real-time use cases as well). R is well used in Data Science, more so than Python in academic fields like Economics, but isn’t as well covered by Public Clouds and DevOps as Python is.

We’ve also seen data solutions made in C, Julia and Rust for high performance use cases, but it's very rare for your average Data Engineer or Analyst to use these languages as they are arguably more difficult to learn. They also only show meaningful performance differences in a small number of use cases, as many Python libraries have already moved their most performance critical parts to a more high performance language. Another side effect of a more difficult to use language is arguably there are fewer libraries available to choose from because less people use them.

It’s also worth mentioning there are replacements for SQL, with some having quite a bit of traction in their niches, such as Cypher or Gremlin for Graph databases and there are some languages aiming to replace SQL in a more general way, such as Google’s Malloy, but it only supports a limited number of databases as of 2023 and is very new.

Why Code Quality is important to Data Platforms

Having the whole data team write code in a consistent and well documented way means it’s easier to onboard new engineers, guards against common coding issues and allows for more efficiency as code is easier to understand and change in the future. It will also likely lead to a more reliable Data Platform, as misunderstanding how the current code works can be a top reason why bugs are introduced into a Data Platform.

There is often a tradeoff however, between writing the perfect data transformation code, time you have to implement it and sometimes, performance. Tooling can help, which we describe below.

How do you create consistent code?

A good start is to look at programming style guides (Python and SQL) and adapt them to your business - feel free to change them to your suit needs. It is also worth reading up on SOLID principles for more intermediate concepts of maintainable code, though these should be a strong guideline rather than a set of rules: break them when it makes sense.

Simple code linters can help enforce a consistent code style, which can be automatically implemented in a DevOps Pipeline and / or commit hook. More advanced code linters will also look for code “smells” (such as unused code), perform security scanning and check if your functions are too complex.

Documentation

There are some that argue that self documenting code is enough documentation and while it’s important to give variables, functions, queries and classes descriptive names to aid readability, self-documenting code doesn’t describe why the code should exist or why it is being done in way it has been written.

Documentation should also explain how to install, deploy, run, debug and uninstall the software system(s) that run the code you’ve written: code is useless if no-one knows how to deploy it. This is normally done in a README file.

It is also worth documenting common issues and how to solve them and who to ask when someone runs into undocumented issues.

Code Quality and Testing

Data Transformations should also be tested for correctness and performance ideally. It’s very hard to have trust in a Data Platform if you don’t test your code when you make changes. We’ll cover testing in another section which will also include Data Quality.

Code Quality in Notebooks

If you are using notebooks to structure your code, please at least test your notebooks by running them from top to bottom: there is a danger of writing code in notebooks where they only work out of order which will be harder for other to use and debug. Also we would recommend putting all logic in a function or class so there is less risk of code being run out of order.

Some data professionals would prefer to never run notebooks in production, but there are well known examples (Google, Netflix) of how to do this successfully.

How Functional Programming in Your Data Transformations can Improve Your Data Platform

A more intermediate to advanced software engineering concept is Functional Programming. While it’s a difficult topic to describe in summary, it’s about making your code more repeatable and predictable, therefore making it easier to test and debug. I recommend reading Maxime Beauchemin’s article on combining Data Engineering with Function Programming for more.

Summary

Hopefully you have a better idea of what language to run your Data Transformations in, why code quality matters and how to improve the quality of your code.

Sponsored by The Oakland Group, a full service data consultancy.

The Data Platform Journal

Discussion about this post