Issue #31: Microsoft Fabric is Generally Available - Should You Adopt It?
Also: Text to SQL LLMs and Knowledge Graphs vs. Semantic Layers, The State of Streaming, Why You Need a Data Catalog to Build Data Products and Automatic Data Platform Optimisation.
Hello all, I’m back after an unforeseen break in writing due to being very busy with work, and honestly, I suspect December will be just as busy. I should hopefully be back to a regular schedule in January.
Microsoft Fabric is Generally Available - Should You Adopt It?
Investing in Knowledge Graphs Provides Higher Accuracy for LLM-Powered Analytics Systems
Semantic Layer as the Data Interface for LLMs
Data Explained: The State of Streaming
Why You Need a Data Catalog to Build Data Products
Automatic Data Platform Optimisation
Microsoft Fabric is Generally Available - Should You Adopt It?
So Microsoft Fabric, an all-in-one data platform, has been generally available for a few weeks, and as a Microsoft Partner with five years of personal experience designing and deploying data solutions in Azure, I have opinions.
I had some issues with Fabric when it was announced, and MS, to be fair, has addressed most of my concerns at the time, either in the present or on the future roadmap (which I’ll cover later).
One thing I should mention early on is that I’m asking the question “should you adopt Fabric as a data platform” and not as a Business Intelligence (BI) solution, which is a far easier question to answer as Fabric has adopted most or all of the features of Power BI, which is arguably the most popular BI application on the market right now and has had great success using in the past for clients.
Not quite best-in-class security
The consensus among Oakland engineers is that Fabric is great for proof of concept and for low- to medium-data maturity organisations that are not too fussed about having the best security.
Why? Fabric has no private link support, and most enterprise cloud security teams using Azure mandate them partly because Microsoft says including private links in all data solutions is security best practice, so you’ll likely stick to Synapse or Databricks on Azure in that situation.
Fabric security is still very good, but not the best.
It all depends on data maturity of the organisation
As I mentioned above, Fabric is a great option for low-data-maturity organisations, as it’s very easy to adopt if you already have Microsoft, Office 365, or Azure accounts. This may explain why I’ve seen lots of excitement among Power BI and Platform users as Fabric massively expands their data capabilities without having to learn cloud infrastructure.
But also, I’ve seen a lot of negative opinions from experienced data engineers for Fabric due to its focus on no- and low-code. Most experienced data engineers just want to write in SQL and/or Python (including me) and not be limited by their tooling, and Fabric doesn’t allow that in Pipelines and real-time analytics (yet).
For example, Microsoft has been spending a lot of time adding features to their low-code data flow and pipelines, whereas I would prefer if they added more support for managed Airflow or something similar to bring it closer in line with GCP or AWS.
This is also the second Microsoft data product refresh in a few years, and Azure data engineers are weary of having to adapt to a new suite of products rather than see existing products that haven’t been out very long (Synapse) catch up with the competition (Databricks, Snowflake, etc.).
Also note that there isn’t full automation support right now, so platform admins might be tearing their hair out over the prospect of managing Fabric on a large scale.
Roadmap
Though Private link are on the roadmap for 2024, and speaking of the Fabric roadmap, I’m quite excited about what’s in there:
Better Data Factory / Pipelines git integration
The current iteration converts everything JSON at the moment, which is awful to review in pull requests and merging
It also has issues when using Infrastructure as Code (Terraform, Bicep, etc.)
SQL for real-time analytics
I’m hoping “On-premises data gateway (OPDG)“ means we don’t have to build another virtual machine for Pipelines / Data Factory to connect to on-premise again.
Better automation support everywhere (SDK, REST API)
This feels a bit “jam tomorrow”: I’ve been burned in the past by great-sounding features turning out to be rubbish or never arriving, but I’m cautiously optimistic about Fabric’s future.
Investing in Knowledge Graphs Provides Higher Accuracy for LLM-Powered Analytics Systems
As you can see in the table below, Knowledge Graphs can possibly provide massive benefits to text-to-SQL query accuracy:
What is interesting about this paper is that it was used on a somewhat realistic 13-table schema* based on the insurance industry rather than some noddy one-table dataset. Major kudos to the authors for putting in the effort and money to build this.
That said, it only gets 71% accuracy on simple results, which says to me that LLMs require an expert to second-check any outputs, so I think we’re still a bit away from a CFO just using text prompts to get analytical insights.
Even with the results above, I’m not sure it’s still worthwhile to use Knowledge Graphs with LLMs unless it’s a large-scale solution: you’ll have to build and maintain two data analytics systems rather than typically just one, and I can tell you, as someone who has designed and/or built more than half a dozen data platforms in the last 6 years, building just one analytical system is hard enough work, even if you can use SaaS solutions and the cloud.
Also, on top of that, you have to keep Knowledge Graph in sync with your relational database, which exponentially increases your maintenance problems (for example, you now have two sources of truth, not one).
*I know 13 tables is a bit on the small side for a large enterprise dataset, but it’s a much more complex dataset used to test LLMs than I’ve seen elsewhere.
Semantic Layer as the Data Interface for LLMs
But maybe semantic layers are actually better with LLMs than Knowledge Graphs? Jason Ganz, Developer Experience at dbt Labs, presents their promising initial findings on using dbt semantic layer with LLMs.
One issue dbt is going to face with this is that the two biggest developers of LLMs, Microsoft and Google, also sort of have a semantic layer in Power BI and Looker, respectively. Though dbt might have an angle here for organisations not looking to be locked into only one LLM stack or BI application.
Data Explained: The State of Streaming
I’ve mentioned quite a few articles on streaming and/or real-time data in this newsletter, but I don’t think I’ve shared an introduction to streaming here, and Data Engineer Matt Palmer does one of the best intros I’ve seen in this article.
I’d also generally recommend following Matt Palmer, especially if you like Jujutsu Kaisen references in your data engineering articles.
Why You Need a Data Catalog to Build Data Products
Last week, I was speaking to a data governance professional in Oakland about how it is difficult to get anyone outside of data governance to care about the topic.
It probably explains why some in data governance use comics or music to lower the bar of entry for non-experts.
Hugo Lu, Co-Founder and CEO of Orchestra, tries well-written prose to make the case for data teams to adopt data catalogs, I especially liked this sentence: “Catalogs offer a way for data practitioners to finally collaborate with business users effectively.“
Automatic Data Platform Optimization
With rising prices in both on-premise hardware and the cloud, there has been a lot of noise in the data world about reducing costs on data platforms.
That said, I haven’t seen many articles on comparing vendors that try to save you money on data storage and compute, so I love the article by Phil Dakin here.
Sponsored by The Oakland Group, a full-service data consultancy. Download our guide or contact us if you want to find out more about how we build data platforms!