How to Build a Data Platform: Classic Data Platform vs Data Fabric vs Data Mesh
Or Centralised vs. Decentralised Platforms
Note, this is part of an in-progress guide on “How to Build a Data Platform”, subscribe for future updates:
Introduction
So you want to build a Data Platform, know where to start building to bring the most value and have picked your cloud or on-premise location to process your data. What next?
The next big decision arguably is how centralised you want your Data Platform to be. This isn’t a binary decision, but more of a scale, even if some people present it as a binary for sales reasons.
We present three options: Classic Data Platforms, Data Fabric and Data Mesh, however in practice, we’ve seen organisations build each of the above architecture patterns in a dozen different ways and also combine ideas of more than one pattern into one architecture.
This post won’t say which is the best, but rather what kind of organisation is best suited for each architecture. Don’t believe anyone who says that their version of Data Fabric or Mesh is the only one true way for all organisations or that one of above options is “dead“.
Classic Data Platform
A “classic” centralised Data Platform is one where all data is processed in one place, as duplication in data silos will cost more to build and maintain. It also helps create a “Single of Source of Truth“: everyone knows there is only one place to look when searching for analytical data. This architecture makes most sense for organisations that have only a few sources and can build their data platform in weeks or months.
Calling this architecture a “classic” Data Platform architecture might make it sound legacy, but it can be the best way to design a Data Platform if you are just starting and have relatively low complexity in Data Processing. Stick with simple if you don’t need anything more complex.
Centralisation Issues
Centralising all your data can be an impossible dream if you have hundreds of disparate databases and lakes in data silos. And even if you do manage to crowbar dozens of sources into one location, it can end up being the data equivalent of a big ball of mud if not carefully managed: unmanageable and painfully slow to react to change.
It can also be hard to manage Data Governance and Data Access at scale without bespoke tools.
While we’ll mention Data Modelling strategies for helping to manage hundreds or even thousands of tables in the Data Modelling section, there are also two architectural “solutions”, called the Data Mesh and Data Fabric.
Both Centralised and Decentralised: Data Fabric
Data Fabrics manage data at scale by leveraging metadata, which is data attributes of data (what columns, it has, who owns the data etc.). Usually metadata is put to work or made active by using a Data Catalog or Data Discovery software, which we’ll discuss in the later Data Governance section.
While everyone agrees on having only one Data Catalog for a Fabric, there is a lot of difference of opinion on whether Data Fabric can process analytical data in more than one location. Some Data Fabrics explanations say it can as long as all data can be accessed from one location which sounds a lot like a Data Mesh from a technology point of view, while other descriptions talk about processing data in one centralised location.
While Data Catalogs have been around for a while, you could argue Data Fabric are not offering anything new; however, Data Fabrics try to use them in a way that improves Data Analytics efficiency, to the point where you can use metadata to make automated analytical decisions with Knowledge Graphs.
Though automated decision making through metadata is arguably more theory than reality at present for most organisations, even with recent advances in Generative AI, especially when some data suggests only a of third of organisations have a Data Catalog and an even smaller percentage use it effectively.
You can ignore all the AI aspects of a Data Fabric and just concentrate on provide a single place to access (and probably process) all analytical data.
Arguably the biggest benefit of Data Fabric is that it largely keeps the organisational structure the same (if you want to!), allowing for a quicker, less painful migration to a new technology or architecture pattern. Data Fabric products don’t often define how data teams in an organisation should be structured, though they usually expect a strong central team for Data Governance enforcement.
Data Fabric Issues
So you’ve decentralised your technology but not your teams; this can cause issues with what to prioritise and build, as a central data or IT function that knows less about the business than those who work in it and may have more central blockers to growing at the pace of the business.
One example is having all data access requests go through one team (usually underfunded), which again does not have much knowledge of how the business operates.
Another issue is that some Data Fabrics products talk about one consistent data layer for large organisations but also talk about anyone being able to quickly add data to a Fabric too. I don’t think this is actually possible without turning your Fabric into a swamp of duplicated, poor quality data as often great care needs to taken not to add data that duplicates data that already exists in a single large model.
The above issue can be avoided somewhat by creating Data Marts in a Fabric, though it will still take time to curate them.
Next up is Data Mesh, which can fix the above issues, but the price can be a heavy one to pay.
Mostly/Fully Decentralised: Data Mesh
A Data Mesh solution creates a data discovery layer over all of your data, removing many of the pains of data silos in a fraction of the time compared to centralisation. Also, scale becomes an issue in a centralised system, as any change is more likely to have more impacts downstream.
There are four pillars to Data Mesh:
Domain Ownership: the business domains, directorates or departments own the data and data processing (not a central IT or data team).
Data as a Product / Data Products: aligning business domain operational and analytical data closer together and product thinking.
Self-Service Infrastructure as a Platform: either every product team has to learn Infrastructure as Code or they use some kind of low- or no-code portal to build their data infrastructure.
Federated Computational Governance: similar to Domain Ownership, but where the business, not the central IT or data team, manages their own Data Governance.
If this sounds like a chaotic outcome where anything goes, in reality many Data Meshes still have a central team to manage platform-level operations such as security, governance and monitoring: this is called a hub and spoke model - though you may argue that bring a Data Mesh closer to functioning like a Fabric.
The phrase I like for federated architectures is “trust, but verify“, you want independent, autonomous teams, but you also want to monitor and review the teams regularly to check they are aligned with the organisation’s goals, not breaking the law, have good security practises, etc.
The benefit of all this is that Data Engineering, Science and Analytics are closer to the business, rather than a central data or IT function that knows less about the business and have fewer central blockers to grow at the pace of the business.
Also, this should bring source data producers closer to analytical teams, so data producer think more about the analytical impact of their data and better chance of data issues being fixed at source.
Finally, it stops data models from getting uncontrollably large by dividing them up by business domain and Data Products.
A Data Mesh can feel similar to a collection of Data Marts, but a Data Mesh differs in two ways:
Data Products in a mesh can contain both analytical and operational data, whereas a data mart only contains analytical data.
Data Marts are owned by the central data or IT team, whereas a Data Product in a mesh is owned by the business domain.
Just to be clear, Data Mesh architectures also include Knowledge Graphs, Data Catalogs or Data Discovery, so all the active metadata benefits of a Data Fabric can also be found in a Data Mesh.
Data Mesh Issues
A Data Mesh will more likely cause duplication of work and data, which can look more inefficient and cost more, but you are trading that off with teams having more flexibility and less cognitive load, which should allow them to be more efficient and save costs.
While the technology part of Data Mesh is hard, the organisational changes needed changes to fully realise a Data Mesh is even harder, as you may have to split up a central data into many domain-oriented data teams and reallocate budgets away from IT to business departments like Sales or Human Resources.
Finally, while there are many Data Fabric products to buy off the shelf, there isn’t as many for Data Mesh: so it likely be more work to connect multiple technologies together to make a Data Mesh. You may be fine with that if you buy into the Modern Data Stack way of thinking of having multiple best in class products in your Data Platform.
Summary
Classic Data Platforms are probably the quickest, least risky and easiest way to get started, though they usually have limits to their scaling. You can maybe think of this as a single “Data Product“ if you want, as long as it’s only serving one business domain.
Both Data Fabric and Data Mesh make more sense as scale your data team and business and can afford all the extra hardware and software like Data Catalogs, Knowledge Graphs and Federated Query Layers.
Data Fabrics are ideal for organisations who do not want to go through the pain of organisational change and want to manage data at scale faster. As James Serra says, Data Fabrics are a technology centric change rather than organisational.
Data Meshes ideally suit large, high technical maturity organisations that already have decentralised data teams. While there is more risk with a Data Mesh, there might be more rewards too, especially for organisations that are highly decentralised.
I would argue it makes sense to decentralise more as you scale to avoid the big ball of mud and losing your agility: go from classic, then fabric and finally mesh, though I could also see the point of skipping fabric and going straight to Mesh or staying on a Data Fabric as Data Platform re-architectures are expensive.
We also mention that there is nothing stopping you combining components of both Fabric and a Mesh together, which will be more expensive, but a component like a Knowledge Graph can also bring value to a Mesh, even though Knowledge Graphs are often only associated with a Data Fabric.
Finally, as mentioned above, there is quite a lot of variety in how organisations implement these architectures, for example, this article offers six different ways to implement a Data Mesh all with different degrees of decentralisation: there’s probably more variety out there that is undocumented.
Note, you may wonder how Data Warehouses, Lakes and Lakehouses fit into these architectures: any of them can be used in a classic, fabric or mesh architecture and we use a catch-all term “Analytical Data Processing“ that you can replace for a Warehouse, Lake or Lakehouse.
We will have a separate post on Data Warehouse vs. Data Lake vs. Data Lakehouse soon for the guide.
Special thanks to Graham Thomas for reviewing this post!
Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want find out more about how we build Data Platforms!