How to Build a Data Platform: Data Pipelines
We cover: Data Pipelines vs Zero ETL, ELT vs ETL, Do I need Data Orchestration, Data Integration and Reverse ETL tooling and Data Pipeline Best Practices
Note, this is part of an in-progress guide on “How to Build a Data Platform”, subscribe for future updates:
What on earth are Data Pipelines and Why Might I Need Them?
Data Pipelines are the glue that stitches together your input data, transformations and output data so you can turn raw data into curated data. It can often be thought of like a “recipe“, or more technically speaking, a Directed Acyclic Graph (DAG) where data goes through a sequence of steps or tasks and has to complete all previous steps before starting a new step. I will say that some data vendors are moving away from DAGs.
You probably still need Data Pipelines, as while the concept of “Zero ETL” has appeared, but most often it is actually “Zero EL”, as few source data systems can be analysed without some transformations and Zero ETL pipelines won’t do the transformations for you.
Zero ETL is still a new concept, so not many data systems have it and in a reasonably large organisation, you’ll likely need to extract dozens, hundreds or even thousands of data sources, so need many data pipelines to extract and load data sources into analytical data stores.
What on earth is ETL and ELT and which one is best to use?
When processing data, there are two main patterns to use: Extract Transform and Load (ETL) and Extract, Load and Transform (ELT). While the names are similar, the pattern you pick can have a major impact on your Data Platform or Data Product.
ETL extracts data from a source, normally operational data, transforms the data and then loads the data into its analytical destination:
The transformations in ETL would happen outside of the Analytical storage destination, one common way was to use Apache Spark as the transform step for large data loads.
ELT however, extracts the data from source and loads the data to its destination, with transformation being the last step:
Some have noted that both ELT and ETL are actually often ELTEL and ETLEL as many organisations will use Business Intelligence (BI) applications like Power BI or Tableau downstream from their Analytical data stores, usually loading data into BI applications in a batch data transfer, as querying analytical data directly from BI applications can often have performance issues and limitations.
ETL has been around longer than ELT, though ELT slowly replacing ETL as the default way to process data in batches. This is because ELT is a simpler pipeline in theory, which should require less maintenance and arguably cost less to run.
We would recommend ELT for most tasks; the only reasons we can think of to use ETL are:
Your source data cannot be copied one to one to the Analytical Database or Lakehouse (it could have a rarely used file type or be unstructured data like audio or video files).
The destination for your data cannot do analytical data transformations (analytical data used in a website).
Restrictions on processing inside Analytical Database or Lakehouse.
Why did we use ETL in the past? Storage was far more expensive in the past decade or two, so it made more economic sense to keep raw data out of a database. Analytical processing has also come leaps and bounds, particularly with the development of Columnar Databases and Data Lakehouses, so you can now more efficiently do large data transformations and serve data in the same place.
What on earth is Data Orchestration and Why Might I Need It?
So you’ve got yourself a data pipeline (or a hundred of them); how do you orchestrate, manage and monitor them? Enter Data Orchestration Software, designed for building data pipelines in a consistent way, with the ability to monitor pipelines and alert you if they fail.
You may not need it for one simple pipeline, as you can use simple CRON jobs or general workflow automation solutions like AWS Step Functions or Azure Logic Apps to schedule and run pipelines, but if you have complex pipelines that require more options around retries and monitoring and/or a programming language, then having Data Orchestration Software will suit you better.
You may even need no orchestrator at all if building a pipeline using purely real time streaming until it’s final destination: streaming data will move from transformation to transformation. Though there are user interfaces available for seeing how streaming transformations are connected.
Apache Airflow has been a longtime favourite among Data Engineers for Python-based Data Orchestration in the Modern Data Stack. It is open source, but AWS, GCP and Astronomer offer managed options too. It is also seen as getting a bit old among some Engineers and can be difficult to set up, so there are strong new rivals in Dagster and Prefect.
There are many low- and no-code Data Orchestration Software options too, such as Azure Data Factory, Rivery and Prophecy. We find these solutions are typically easier to get started, though I feel they don’t scale as well as code based pipelines. They usually have some integration with version control and DevOps for easily managing changes, but not as much as a pure-code solution, which can make them painful to update at scale.
For example, Azure Data Factory version control exports all configuration to JSON in a number of large files, which makes it hard to see changes in configuration and managing parameters between environments is pretty horrible.
What on earth is Data Integration / ETL Tooling and Why Might I Need It?
So you got your Data Pipelines but some sources of data are proving tricky to ingest, is there any pre built solution I can use that doesn’t take weeks to program? There are Data Ingestion / ELT tools that are primarily used to copy data from a source system to an analytical system with low- or no-code configuration to make data ingestion and integration faster to build and easier to maintain.
If you find “Data Integration / ETL Tooling“ a bit of a mouthful, I would agree, I prefer to call products that mainly copy data from operational sources to analytical destinations “Data Integration Products”, because they are mainly integrating data into your Platform, though many people use the term “ETL Tooling” as well, as they are often (but not always) used in ELT pipelines. And I would argue that you’re typically only doing EL rather than ELT with these products.
Many Data Integration / ETL Tooling do allow for some or a lot of transformations as well, though from our experience it’s better to use a bespoke transform product like dbt as they have typically more customisation and integration with other data software. dbt is free and widely used open source software so doesn’t add much cost in using it.
Though if you’ve already chosen Data Orchestration software above, many of them come with out-of-the box low- or no-code operators for copying data in and out of analytical storage systems (Warehouses, Lakehouses etc.). If they cover your use cases, it may feel like overkill to buy another piece of software to duplicate something you already have.
Also, data sources that are simpler to ingest like popular file formats or databases, can be ingested with as little as a few lines of code, again, potentially providing more value for money, though you do have to maintain that code and it might be painful to scale up.
For more complex use cases like pulling data from Salesforce or using Change Data Capture (CDC) to stream data in, Data Integration / ETL tooling can look like a more valuable proposition, as integrating the above two examples by hand would require a lot of setup and maintenance.
Fivetran is a popular choice and Airbyte is an interesting open-source alternative. Note that there are a lot of choices in this area.
What on earth is Reverse ETL and Why Might I Need it?
Most Data Ingestion / ETL Software only copies data from data sources to analytical systems and not the other way round, so a new type of software has emerged called Reverse ETL, where it does the opposite of common ETL/ELT processing. This can also be called Operational Analytics, because you are pushing analytical data back into operational sources and teams for them to analyse.
Again, the name Reverse ETL is a bit confusing, as you probably won’t do any transformations in the Reverse ETL product, and use it simply for its EL capability.
This typically pushes data from a Data Warehouse / Lakehouse into typical sources of data like Salesforce, Hubspot, Anaplan, etc. This is useful if some of Data Warehouse transformations are useful to put back into source systems (data deduplications, higher quality data, calculated metrics, etc.).
You can write your own code to import data into operational sources, but again, you’d have to maintain it.
Reverse ETL products are sometimes considered “Lightweight Master Data Management (MDM)“, as they are moving data back to the source like a MDM would do, but unlike full MDM products, they leave it up to the data team to make their data into a single source of truth or golden record. This can be a cheaper alternative to MDM if you don’t need 3rd party software to create a golden record / single source of truth. We will cover MDM in Data Modelling later on.
Examples of Reverse ETL software are Hightouch and Census, with a lot more to choose from.
Data Pipelines Best Practices
So now you know what Data Pipelines are and the common components used to orchestrate and move data in them. But what is the best way to build and maintain pipelines? Here are some best practises to help you:
For more reliable pipelines and data, have them fail gracefully:
Retry on failure
Alerting on failure
A if condition that processes data differently on failure
Enable all or most monitoring options you have available. Yes, capturing pipeline logs and metrics costs money, but you’ll likely spend more on debugging if you don’t have logs.
If you are running multiple transformations in a Data Pipeline task, log when each transformation starts and ends: if you have a error, you know exactly where it occurred.
If using a Star Schema, process the dimensions first before the facts so you can generate identifiers.
Version Control so you can easily see what has changed when the pipeline is updated.
Keep your pipeline tasks as small as possible for quicker and easier testing; like a function or method in programming, they should ideally do only one thing well.
Again, for easier testing and debugging, try to design your pipeline tasks in a way they can be run on their own without a pipeline. Not one wants to run a entire 2 hour pipeline when the issue involves a 10 minutes task.
Try to design your pipeline tasks so they can be run many times and still achieve the same result if the data entered into the task is the same; this is known as idempotence. It’s hard to test and debug if you’re getting different results on every run.
Parametrize values in your pipelines that you change often: for a simple example, to switch easily to a test database for testing, make the database name a parameter. You will then store the parameters in a configuration file which in this example would be a test config file and a production file, so you can import the test config file when testing. See this example on how to paramertize pipelines, but this can be configured in any pipeline in theory.
Have a set of Data Quality tests in your pipeline: enough tests to give you confidence, but often you have to balance this with making data available in a timely manner. We’ll have a separate deep dive into Data Quality later.
Add time outs to all tasks, usually at about double to triple your normal running time. Excessively long run times likely mean something has failed (connection failure, duplicates, for example), so you are often burning money for no good reason by leaving them running forever. Can also set alerts when a task has run too long.
Summary
Whew, that was a lot covered, next we’re going to look at what often goes into Data Pipeline tasks - Data Transformations. We’ll cover SQL vs Python vs other languages in data and code quality, subscribe if you enjoyed this post or looking forward to the next one!
Sponsored by The Oakland Group, a full service data consultancy.
Photo by Peter Herrmann on Unsplash