Issue #14: Microsoft takes aim at the Modern Data Stack (Again)
Also Data Council 2023 now on Youtube, SQL vs Python, Hype behind Duck DB and The State of Data Engineering
This week we have:
Microsoft Fabric Announcement
Data Council Talks on Youtube
How to Build a Data Platform: Data Transformations
What’s the hype behind DuckDB?
The State of Data Engineering in 2023
How to Learn in a Rapidly Changing World
Microsoft Fabric
Having attempted it once only three years ago with Synapse Studio, Microsoft is having another go at making the “Data Platform in a Box” with a flurry of exciting new features. Though do note it is in preview, so no-one should be using Fabric in production for another 6 to 24 months and many users complained Synapse was undercooked.
Here are my early thoughts on what looks good and bad about Fabric:
The Good
Having just one file format for all data processing, Delta Lake, an open source and commonly used file format amongst Data Engineers is an excellent choice. This, should in theory, allow interoperation with other non-Microsoft data tools as well.
Power BI can read Delta Lake directly and is allegedly combining the best of both worlds with Direct Lake: real time reads of Direct Query and performance of Direct Import. This is potentially a big deal, as I’ve mentioned before that many ETL/ELT workloads were actually ETLEL / ELTEL: you currently often have to copy data to Power BI to get the best performance, increasing time from to source to report. This could be a major time saver and I’m very keen to test this.
Integration with Windows for OneLake could be a big deal, as you can now potentially see all the data files floating about the business in a Data Lake, which should allow business teams to work better with Data and IT teams. Hopefully we’ll see more seamless integration with Excel and SharePoint soon as well.
Synapse spark pools are now not terrible and Synapse Warehouse is dead, now a Lakehouse. It’s not as good as Databricks arguably (no photon runtime, no dbt support, etc.), but you can still use Databricks with OneLake anyway.
Better version control for a number of components, in particular Power BI. It will, however, likely output everything in one big JSON file that will be a pain to read through in a large Power BI reports, but hey, we’re making steps in the right direction!
Single layer of security: arguably this could be done right now by bunging everything in one Resource Group in Azure, though this should have a more seamless experience and hopefully we don’t need to make a Managed Identity or Service Principal for every connection in the Microsoft data stack.
The Bad
Pricing looks to be moving away from Pay as You Go to capacities that are not well explained yet, there are rumours of prices starting at $300 per month.
$300+ a month for 0.25 CPU doesn’t sound like great value for money, though you do get a lot of features.
It might be harder now to pick the small number of components of Microsoft’s data stack that you want to use: you’re paying for all or nothing. This could be very problematic in the future for a significant number of users who just want to use Azure Data Factory with Databricks or Snowflake.
Support for using a programming languages for data pipelines is still very limited. There is no mention of Airflow support in Fabric Data Factory. I realise this is maybe a niche issue, but managing the DevOps of Azure Data Factory was a nightmare when you had dozens of pipelines as you have to wade through thousands of lines of undocumented JSON to find out what had changed. I just want to write my pipeline jobs in Python, please.
Why do I have to learn another language, Kusto, for Real Time Analytics? Just adopt SQL, Python or Java like everyone else, Microsoft.
While Fabric has integration with Purview, there have been no major updates for it: I’ve heard many complaints from clients about Purview having no Data Quality functionality and/or being difficult to configure, so I feel like Microsoft missed a trick, especially as Data Governance feels like a growing concern amongst organisations. This isn’t helped by the fact that Databricks and Snowflake have much more support and integrations amongst Data Governance products than Synapse and Data Factory.
One worry I had is I couldn’t find much on networking documentation with Fabric, but I do realise components like Private Endpoints tends to get added later on when Microsoft previews products. I hoping this also means the death of Power BI Gateways and Data Factory Self-Hosted Integration runtimes, as managing these on manually built Windows VMs always felt like an unnecessary hassle.
In summary, I feel this is mostly aimed at organisations with low technical maturity, which is great, as right now it’s hard for those organisations to figure out how to join up all the elements of the Modern Data Stack. I also know some have been disappointed by the current state of the Modern Data Stack and may welcome this effort by Microsoft.
I think high technical-maturity organisations will be less interested as they’ll already have their existing, more flexible, high performance Modern Data Platforms.
Personally, I like all the new features that are coming but worry about the lock-in and the value for money. I suspect a few organisations could fall into the 10% trap if they are not careful. I get why they did it, who it’s aimed at and note OneLake should be accessible to other vendors, but I would have preferred to also have a Pay as You Go option for each of Fabric’s services.
Further Reading:
Benn Stancil makes good point that Microsoft is making a “Data OS“ and sees a new battle emerging of Modern Data Stack vs All in One Solutions like Fabric.
Data Council 2023 Conference Talks are now on Youtube
Data Council conferences are known for being low on marketing buzzwords and fluff and high on technical detail, therefore the conference is often ranked high amongst Data Professionals.
It was also expensive to visit, especially for people like me who live in Europe, who would have had to pay £2/3k to attend. So it’s great to see all the 70+ talks uploaded online, so I can watch them while I do my dishes at home. It includes talks from vendors like AWS, Microsoft and Confluent but also from well-known data content creators like Chad Sanderson and Seattle Data Guy.
How to Build a Data Platform: Data Transformations
This week for my “How to Build a Data Platform“ guide, I tackle the most controversial topic in data: when is it best to use SQL or Python for data transformations?
I also discuss Code Quality which is only slightly less controversial!
The next section will be on Architecture, which could be the biggest section of them all, so it might take a week or two longer to finish than normal.
What's the hype behind DuckDB?
Another DuckDB article, Jake?! Yes, but I’m sharing this because I love the tutorial at the end on how to partition data in DuckDB, which looks very nice. If you haven’t yet done much research on DuckDB, then the whole article by Matt Palmer is a great read.
The State of Data Engineering in 2023
Two articles came out last week on what the Data Engineering landscape looks like right now:
Neither of the above is what I call the definitive view on Data Engineering (I don’t believe more Engineers are using Airbyte than Fivetran), but each contains interesting insights: Airbyte’s survey notes Pandas is still popular in the face of increased competition and LakeFS thinks about what can replace Hive Metastore, if anything.
Joe’s Nerdy Rants
One of the authors of “Fundamentals of Data Engineering” has a weekly newsletter! I love the post (rant?) in this issue, “How to Learn in a Rapidly Changing World“, in particular this section:
“Jeff Bezos has a handy illustration of the True North for Amazon - focus on what doesn’t change. For Amazon, True North means customers will always want better deals on high-quality products delivered as quickly as possible. I'll be waiting if you can show me a world where customers want crappy and overpriced products delivered as late as possible.
For data professionals, I see True North as data needs to be as accurate and believable as possible, delivered as quickly and seamlessly as possible, promoting the best data-driven decisions and outcomes. Please let me know if you can show me a world where people want the opposite.“
I think these are excellent values to work towards to.
Sponsored by The Oakland Group, a full service data consultancy.