Airflow seems to be the PHP of the data ecosystem, widely mocked and looked down upon, called “legacy“ by its competitors, even though it arguably has more support in its community and powers a larger proportion of Data Platforms today. This might give you the opinion that Airflow isn’t worth bothering with in 2023, but this article is here to at least consider it as an option for a number of reasons below.
And in some ways I get it - it is older than most of its competitors and shows a bit, especially in its User Interface (depending who you ask). There have also been changes to how Data Platforms can be made, which have reduced the need for one Data Pipeline to rule them all and therefore makes Data Workflow Orchestrators potentially a less essential part of a Data Platform.
But sometimes it’s better to be boring and go with the market leader.
So I’ll discuss what general threats there are to Data Workflow Orchestrators and also compare Airflow to its Pythonic competitors: Prefect, Dagster and Mage.
What I’m not going to do is compare low-/no-code against Airflow in this article, though I will say after trying to work with low code Workflow Orchestrators in large Data Platforms (>50 pipelines), I much prefer having my pipelines in Python, which has full version control capability than having no or limited version control or having to version control thousands of lines of JSON.
But before that, let’s address the first elephant in the room:
Is Any Type of Data Workflow Orchestrator Needed?
Yes, if you have lots of batch or low frequency events that need to be coordinated, especially if they are in a complex pipeline. Though there are threats to this pattern:
Zero ETL: This can reduce the number and/or complexity of pipelines, reducing the need for a service like Airflow. While I see Zero ETL as a potential useful timesaver, they only work if there is existing connection between your operational and analytical data stores, and you’ll almost certainly still need to do transformations to turn operational data into analytical outputs and which may need to be coordinated with a workflow orchestrator.
Airflow is not suited to streaming: I think this is the biggest threat to Airflow and it’s competitors, but real time streaming needs to become viable for most organisations for almost all their data. In my opinion, this will take awhile to come to pass, if ever. There will always be spreadsheets and reference files to upload to a Data Platform and lots of data producers still offer no streaming capability. There are also a number of other real world issues with streaming compared to batch which really requires another 1k word article.
No-/low-code: While I said I wouldn’t compare them with Airflow, I should point out they do make sense for a small number of simple pipelines, especially if you don’t feel confident writing Python (Airflow doesn’t require expert knowledge of Python, but novices may struggle). I still wouldn’t use them for lots of complex pipelines, as mentioned before.
Decentralisation: Data Meshes have become rapidly more popular in last few years, which talks about not having one central data repository, but one for each business domain. This should make pipelines less complex in theory, so arguably less need for a Data Workflow Orchestrator. However, I have seen many business domains have dozen(s) of source systems, with a minority of them having streaming capability, so I think there is still a need here, though maybe less of one.
So there are threats from multiple directions, but the reality of today there is a number of data producers still pushing out CSV files to be ingested and transformed, requiring a traditional batch process and a orchestrator to manage all these processes.
So What’s People’s Problem With Airflow?
Complex to set up, especially outside of a cloud managed service.
Complex to maintain.
Airflow is not suited to streaming pipelines.
No deep integration with dbt. Well, at least out of the box…
Airflow XComs can’t handle large amounts of data.
Doesn’t have the instant feedback that Mage has with its notebooks.
In my opinion, none of the above issues are incorrect, hence why there have been more than a few blogs declaring the death of Airflow; however, there are a few areas where I feel Airflow still wins out today:
Managed Cloud Services are the Best Choice (Or Easiest Choice)
Airflow has managed services in AWS, GCP and preview in Azure. This tackles the first two issues above to a extent.
While I feel some unease giving all our cloud budget to AWS’s Jeff Bezos to pay for his missions to the moon, for non-cloud managed services there are less integrations, with a separate support and licence team to interface with.
Not to mention, it’s hard to convince non-technical people why they should use this company they’ve never heard of, for Data Engineering reasons they don’t fully understand. Yes, you can build a strong business case to show value for money, but even then can be an uphill political battle, so it’s tempting to pick the easy way out.
Airflow’s cloud managed services are not perfect, AWS’s has autoscaling issues, GCP’s service can be slow to start any dbt job and Azure’s service is still in beta with lots of networking features missing.
Also, it’s not hard to deploy Prefect, Dagster and Mage on the cloud if you feel comfortable with Terraform.
Finally, the more tied you are to a cloud provider, the harder it is to change, which can make a completely cloud neutral service seem like a viable option - though Airflow even has an option here, in Astronomer.
The Enterprise Security Tax
Airflow is Open Source Software in the almost purest sense; it does not hold back any enterprise options, unlike some of its competitors.
This means security features like Role-Based Access Control (RBAC), Single Sign On (SSO) cost extra, sometimes a lot extra. Prefect to their credit, does best here by actually giving a price for SSO and RBAC, but that starts at $450 a month, potentially costing more than your Airflow deployment before you add users at $79 a month and pay for server costs. Dagster doesn’t even have a price for Enterprise SSO (so let’s assume it’s more expensive) and Mage only supports LDAP.
Most security reviews I’ve been through have mandated SSO and RBAC, among other security controls.
Airflow’s SSO does require extra configuration to setup, probably more than its competitors, so bear that in mind, though likely still be value for money if you feel confident setting it up. Airflow also has free, deep RBAC integration though has .
Finally, it’s worth mentioning Airflow can be deployed in your cloud or on-premise server without any features missing, paying large enterprise tier costs or having to communicate to a control plane over the public internet. You may find, however, the amount of work to configure and maintain Airflow on your own outstrips the large enterprise costs, so it’s worth comparing costs on this.
Airflow Operators - The Sharpest Double Edged Sword
Airflow Operators are the actions or tasks that make up an Airflow pipeline, most commonly copying data from one data store to another or transforming data.
On the one hand, Airflow operators are amazing as they are free with the Airflow service, and there are over 1000 operators to choose from. On the other, they are not as modular as they could be especially when it comes to data copying.
Also you need to write code, potentially lots of code, to integrate your source data with your destination data in Airflow, when you can use a no- or low-code data integration solution which can be many times quicker and cheaper to do, or the opposite. I do cover the pros and cons of this in more detail in one of my earlier newsletters:
I personally think Airflow operators are worth trying out to see if they work for you before spending money on another tool (imagine the costs of paying for a workflow at the enterprise pricing tier and a data integration tool also at the enterprise tier?!).
Airflow’s competitors do also have integrations, but they are far less in number (Prefect, Dagster, couldn’t find a page for Mage). So you could end up saving quite a lot of time and money taking advantage of Airflow operators, compared to other options.
Summary
Airflow isn’t the only game in town, I’ve in fact deployed Prefect a few times to clients with great success and loved using it. I’ve also been following Dagster and Mage closely, with both offering interesting new ideas on building Data Pipelines that can help boost Data Engineers productivity.
That said, there are times where Airflow makes the most sense for a organisation, even in 2023, particularly in ones that have strong security requirements and large Data Platforms to manage. I do think streaming pipelines could well replace it and its competitors in say, 5 to 10 years, or even now if you’re lucky enough to be able to make heavy use of streaming technologies.
I will say a lot of Airflow’s benefits right now are inertia and items that don’t directly benefit Data Professionals, but when assessing a software product, you should take into account all aspects of it, not just its headline features.
Sponsored by The Oakland Group, a full service data consultancy.