The Data Platform Journal Issue #5
This week: Beware of Fivetran? Things that Databases Don't Do, How to become a Data Reliability Engineer, Data Platform Challenges and more!
Good Monday afternoon all, this week we have:
Should you be wary of Fivetran or any data ingestion software?
Things that databases don’t do
How to become a data reliability engineer
Data platform challenges
End-to-end data discovery, observability, and governance on open-source DataHub
What's Fundamentally Wrong with Modern Data Stack?
Beware of Fivetran?
The above Reddit post on the data engineering subreddit had me conflicted: based on my personal experience of implementing Fivetran and similar data ingestion layer software, I’ve found that buying external software for data ingestion is sometimes a massive time saver or massive waste of time. It really depends on how you use it and your requirements.
But here are my thoughts on the subject:
Not all data connectors are built the same: you may find one data connector in data ingestion software that works very reliably and another in the same software that is total junk. Test before you commit long-term to buying.
Airbyte is a highly customisable, open source alternative to Fivetran: you can use the UI or Python to build your own connector. Do note that Airbyte only has a small subset of connectors (only 3 destinations!) in General Release as of March 2023.
Fivetran and similar solutions scale often seamlessly up and down; does your custom connector do the same? Does it need to?
Fivetran claims it can even save on data storage costs by applying transformations before reaching your data storage.
Some data sources are really painful and time consuming to connect to without a preconfigured solution like Fivetran in my experience: these tend to be software that has a large complicated APIs: Salesforce, Oracle NetSuite, etc.
A common way to stream real time data from databases to analytical storage is Change Data Capture (CDC). Setting up CDC outside of paid software can be done with Debizium, but requires some technical experience and likely more time to setup and maintain. It also requires paying for a server or container to run Debizium.
Fivetran also supports Teleport Sync as a alterative to CDC, when you can’t access the source database logs.
It can be quicker to connect data systems together with free Python libraries if the connection configuration is simple and you have experience doing so. Some connectors I’ve written in the past can only be 5 to 10 lines of code and haven’t needed changing in 5 years. Others required months and needed lots of changes.
If you’re using a data workflow orchestrator like Airflow, Prefect or Azure Synapse you may find it more value for money to use one of their built-in connectors rather than buying another piece of software. Airflow has over 100 providers, for example.
I think in summary using Fivetran or similar is a good first choice to get a data connection working fast if you can’t connect to data source easily by existing means, though you should test thoroughly to check it works for you and watch the costs as you scale, as paying by the row can make grow wildly out of control if you are not careful.
Expect in a large organisation / data platform to use a mix of software here, as it’s rare that one tool or technique will cover all requirements in our experience.
Things that Databases Don't Do
Databases and SQL are wonderful tools, but they can’t do everything we want (should they?). Gwen Shapira at Nile, who are a database start-up, does a great job explaining what databases might be missing, such as version control, built-in soft deletes and composable APIs.
How to Become a Data Reliability Engineer
One of the top issues in data is making it more reliable and of high quality, so I can see a greater the need for specialist roles in data quality, such as Data Reliability Engineer. Kyle Kirwan, CEO of BigEye, who make a data monitoring product, lists what skills you need to become one such as: end to end data monitoring, setting data quality standards and reducing manual fixes (toil),
What are the challenges of building a data platform? (Sponsored Content)
My colleagues at The Oakland Group have written a excellent list of challenges and risks when building a data platform, as well as what solutions you can use to mitigate them.
End-to-End Data Discovery, Observability, and Governance on open-source DataHub
This is a article I was wanting to write myself, but now I can link to this one!
I’ve been trialing Datahub for the past few months and I’ve found it enjoyable to use, though like all open source software, you can’t just drop it in and expect it to work for you (it has a paid managed option too).
While the article written by an AWS architect, it can be deployed on other clouds, and I also feel they left out Datahub’s best feature; it can show test results of datasets in the UI alongside schemas and documentation:
Podcast: What's Fundamentally Wrong with Modern Data Stack with Lauren Balik
Reading Lauren’s blog and Twitter is a wild and sweary ride with a very pessimistic bent on the current data market. This is likely because they have been fire fighting issues in lots of companies for the last two years as owner of a data consultancy.
While I don’t often agree with them, having a voice of caution in the current economic climate is worthwhile. The podcast is less wild, but still very insightful, covering many topics and pains in the data landscape.
Sponsored by The Oakland Group, a full service data consultancy. We’re hiring!