How to Build A Data Platform: Data Modelling

We ask: Do I Need any Data Modelling?, What is the Data Modelling Workflow, Model Layers, Model Types and why you might need Master Data Management.

Aug 31, 2023

Note: This is part of an in-progress guide on “How to Build a Data Platform”, subscribe for future updates!

Do I Need any Data Modelling?

With the rise of Query-Based Modelling adopted by large companies and enterprise “super” data models being broken up by Data Products, you may be forgiven for thinking data modelling trend is decreasing, maybe even dead. Even if that is the case, I think there is still a case for some modelling in all datasets.

Why? Well, almost all data has some kind of structure, so putting data in a consistent shape and style can allow data teams to make changes faster as you scale up your Data Platform. Thinking a bit about your data model before implementing it is like drawing a chair diagram before you cut the wood to build it.

This doesn’t mean you have to spend weeks or months modelling every dataset, but at least think about rules for consistent labels like columns and table names with sufficient documentation, starting with your most used data, like you would with any well-written code, even if you’re only using a Data Lake or Real-Time Streaming transformations.

Data Modelling can also help with Data Governance and debugging data issues if done well; you are documenting the data ecosystem with clear diagrams and easy-to-understand data models that relate to the business.

Data should be able to be easily visualised in reports and dashboards while also having good enough performance that analysts are not waiting hours or even days for queries to run; both are heavily affected by how you model data.

Data Modelling Workflow

Just like in Solution Architecture, you want to start with the Business Processes and Business Rules: what is the Business Process from source to report for your data and what rules must it comply with? Of all the steps, this is arguably the most crucial to get right, as you want to build a data solution that matches stakeholders expectations.

This stage is often called requirements gathering, which is a whole topic in itself and has many textbooks on it.

Example of a purchase order process flow diagram — Source: https://www.frevvo.com/blog/business-process-flow/

The next stage is Conceptual Modelling: if dealing with lots of data assets, it is usually easier to understand if they are grouped into business domains like Finance, Human Resources and Sales and entities like Employees, Customers and Projects. This part is sometimes skipped, especially for small models that only have one or two concepts.

You’ll want to use the same language as the business domain at the first two stages so you can have Subject Matter Experts (SMEs) in the business domain verify your modelling.

Logical Modelling: this stage formally defines the relationships between tables, what unique identifiers to use and what columns tables will have. This part is also sometimes skipped, especially on small models where it is combined with a physical model stage to save time writing two different models. One reason to keep this stage is it is largely technology-independent while driving deeper into the model details.

The last stage is Physical Modelling, which defines the column types and data constraints. After this stage is complete, can use tooling such as SqlDBM or Erwin Data Modeler to convert a physical model diagram into SQL code to save a lot of time creating tables by hand.

I’d recommend checking out Sonny Rivera’s post which goes into more detail on modelling stages if you’re interested.

Model Layers

The other important consideration is how we process our data through our data pipeline, from source to report. Generally, there are three layers: Raw, Conformed and Curated (there are many naming variations, with Databricks confusing everyone with Bronze, Silver and Gold, but they achieve the same aims).

Raw layer is most of the time a straight copy of the raw data with essential metadata and maybe some transformations to fit the technology the data is being moved to, so it requires little modelling.
Conformed layer cleans and deduplicates data to make a Single Source of Truth (more detail later). May also add some enterprise-level business rules too. This is arguably the hardest layer to design, as you have to balance use cases for today with adaptability in the future.
Curated layer is analytical data that adds business value in some way directly, so often has a very narrow focus of answering one or two business questions. You may end up splitting this layer into business domain Data Marts.

I’ve seen some data models and pipelines combine both Conformed and Curated together, which could save some time on smaller data models, but does make your conformed layer more narrow in focus and harder to adapt in the future.

Finally, you have to be careful with the curated stage so that you don’t end up with hundreds, if not thousands, of unused tables used for one-time queries. Combat this with careful pruning based on usage metrics or by putting more experimental and/or exploratory analytics in a separate layer and/or schema.

How to Model Data?

Agile, Value-Oriented Modelling

Model what you need to, not every data source you have to get quicker Time To Insight (TTI). As mentioned earlier in the guide, you should first look at your anticipated analytical outputs and then only model the data that is needed for those outputs.

Maybe your Data Platform is small and only has a few tables, in which case model the entire Data Platform. However, if your Data Platform will potentially contain hundreds of tables and will take years to implement, we recommend only modelling on data you will use in the next few weeks or months.

Model Types

The most common data model type historically has been the Star Schema or Kimball model, especially when it comes to reports and dashboards in Business Intelligence (BI). For example, some testing has shown Power BI is often better modelled as a Star Schema.

Image shows an illustration of a star schema. — Source: https://learn.microsoft.com/en-us/power-bi/guidance/star-schema

This does not mean Star Schemas are always the best type of model, and with newer technologies like Columnar Data Warehouses, streaming, various flavours of NoSQL and Data Lake, you may require slightly different methods of Data Modelling, so model your data according to your use cases and technology.

Arguably the most common alterative model is One Big Table (OBT), where it creates a much simpler model where the most commonly used attributes and measures are put in, well, one big table, with possibly small reference tables to supplement it.

The benefit of this style is that you’ll have to make fewer joins with other tables, which can be an expensive and difficult to debug operation if there are lots of tables to join. OBT also matches the data model required for most Data Science, AI and ML model algorithms, which often expect one big table (or matrix) of features and target values.

The downside is that you can end up with a massive table of dozens, if not hundreds, of columns, which can make it hard to understand and use. Also, as mentioned above, BI applications like Power BI generally prefer Star Schemas.

Another big factor in how you model your data is how often your data structure changes. This can have a big impact on your modelling; you may even decide to use semi-structured data storage if data changes structure a lot (from data point to data point or on a daily basis).

Activity Schemas offer an alternative here by storing data in a tabular database with columns for unique identifiers and metadata but storing dataset attributes and/or semi-structured elements in a JSON column to keep data models smaller for less maintenance and potentially improve performance.

The downside is that it requires more effort and expertise to use JSON columns in BI applications and queries. I wrote a longer post on Activity Schemas if you want to know more.

Source: https://github.com/ActivitySchema/ActivitySchema/blob/main/2.0.md

Also consider Data Vault models for better write performance while keeping to a more tabular structure. The downside of Data Vault models is that they can make your models bigger, so they may require more maintenance. Also, like Activity Schemas, usually not ideal for direct use with BI applications with a small number of data assets.

A diagram illustrating the relationships between data vault hubs, links, and satellites. — Source: https://www.databricks.com/glossary/data-vault

With all these modelling styles that each support different use cases, it may make sense to use multiple types of models if you have varied workloads and analytical technologies. It is a common pattern to use modelling patterns that may suit better performance, like OBT, Activity Schema or Data Vault closer to the source data and then use Kimball or Star Schema closer to the consumption side for easier understanding of large data models.

Non-Database Data Should Still be Modelled

Data Lakes still need modelling: they should be structured in such a way that it’s easy to find the data you want, and large Data Lake files should have thought put into how it is partitioned to get the best performance.

The same goes for streaming, logs, event data and semi-structured data (JSON, NoSQL): think and document about how the data will be used, joined, and change over time will help create a consistent structure to make refactoring and debugging issues easier in the future.

Even metadata should ideally be modelled, so you have clear and consistent requirements for data ownership and usage on all datasets for easier governance.

Single Source of Truth

A lot of modelling effort can go into creating a Single Source of Truth - one record, dataset or model to rule them all for each entity. For example, the customer or employee dataset that (nearly) everyone should use in an organisation. This makes new analytics and fixing issues much easier, as you always know the best, most trusted place to look.

Getting to this deduplicated nirvana can be very difficult, especially if duplicated data is already widely used. Master Data Management (MDM) can help massively here, but it can cost a lot unless you are a large enterprise and will take time and effort to implement.

What is Master Data Management (MDM) and do I need it?

Master Data Management can check multiple data sources, combine them into one correct record, and send it back to the data source if the source needs updating.

You can use algorithms such as Levenshtein Distance or Cosine Similarity to compare two possible duplicate records, and then use other metadata like update time to work out which record details are correct to give you a Single Source of the Truth.

Though today’s off-the-shelf MDM software often uses more complex Machine Learning algorithms to do such matching.

One example of MDM in action is customer records: large organisations likely store customer data in multiple places across departments, and it can be important for marketing and legal reasons to keep them consistent and up-to-date.

Off-the-shelf MDM software can be expensive (starting at >£10k per month) and can take awhile to integrate, so it can be cheaper to roll your own solution for simple use cases, though for complex, large-scale cases, you may find MDM is value for money. For example, scanning a table of billions of rows for a “fuzzy match“ can be expensive in computer resources, so high performance off-the-shelf MDM software can provide performance benefits without you trying to find them yourself.

I also haven’t mentioned that you often get a nice User Interface, alerts, etc. with MDM software, so it can be used beyond Data Engineers.

I should also mention Reverse ETL, as it’s typically much cheaper than MDM, but just sends analytical data back to the data sources like Salesforce with no matching algorithms. This is great if you already have clean, deduplicated data in an analytical data store and want to send it back to the data source.

There is also Entity Resolution and Identity Resolution, which feel mostly like rebranded MDM for the cloud age, possibly without the moving of deduplicated data back to the data source.

If you do implement MDM, I would look at first setting it up in registry style, especially in a complex environment, where you deduplicate incoming data and assign unique identifiers, but don’t use it to conform to a golden single source, which can involve a lot of work in removing and changing existing data. You can use this style initially as a trial run to see if MDM is producing the data you want it to.

One other important implementation consideration with MDM is whether you send deduplicated, cleaned data back to source* or not (consolidation style): while sending data back to source will lead to better Data Quality, the data source technology might not be able to be updated by a third party source, and the data source owner may not have the budget or the will to incorporate MDM into their systems.

*There are two ways to do this: centralised, which queries the master record before creating a record in a data source or transactional data system; or coexistence style, where data sources are ingested post-creation, then cleaned, deduplicated, and sent back.

Summary

Another big topic with lots of takeaways:

All data that is used by your customers (internal or external) should at least have a bit of modelling and documentation, at least to keep it consistent and easy to find.
Model only what you need.
Think about multiple model types for varied workloads.
Single Source of Truth and MDM are very powerful but can be very expensive and time consuming to implement.

I hope you enjoyed reading this; leave any comments or questions below!

Cover photo by Karen Vardazaryan on Unsplash

Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms!

The Data Platform Journal

Discussion about this post