How to Build a Data Platform: Data Governance
Why is Data Governance important?, "Agile" Data Governance, Ownership Over Tooling and Do I Need a Data Catalog
Introduction
I’ve worn a number of hats in my career, but Data Governance isn’t one of them, so I can’t claim to be an expert.
That said, I have implemented governance controls on many Data Platforms, have opinions about it and don’t think I can talk about “How to Build a Data Platform” without mentioning Data Governance in good conscience, as I feel all data solutions benefit from some governance.
See this as an “Architects or Engineers guide to Data Governance” rather than any deep dive into Data Governance, for that, I’d check out my colleagues expert in-depth guide.
Why is Data Governance important?
Data Governance is often the last thing on an engineer’s or analyst’s mind while being swamped with stakeholder requests that needed to be answered yesterday.
But it’s likely they got into that situation partly because of a lack of governance:
Stakeholders can’t discover data on their own will instead ask an overworked data team.
Lack of data documentation inside the data team, which increases time to insight and generally making the data team miserable because everything is harder to do than it should be.
Unclear data access specifications mean it takes weeks, months or even years to get access to data that should only take hours to get access to.
There is no clear data ownership on data assets, so data access requests are a slow or impossible task and a lack of updates to data means it becomes a legacy dataset.
Little or no usage metrics are collected on data assets, which means investment in data decisions requires more effort and is more likely to be spent incorrectly.
Not enough Data Quality, so the data requires more work to do analysis on.
I would argue data with a lack of governance is like software technical debt, so in this case data debt, okay in small amounts, but can soon build up enough to become a millstone around the data team, if not the whole organisation, making everyone more inefficient.
I haven’t even mentioned the security and legal compliance aspects of Data Governance: if you collect personal data and have improper governance, you are at risk of embarrassing and costly data breaches and / or legal action.
Govern what you use, not every data asset you have
But comprehensive Data Governance in a large organisation is often expensive, and there is never enough budget, right?
One way to combat this is start governing data that is only actively used (in reports, dashboards, etc.). Once you have a handle on actively used data, then look at data not used often or at all with a view of seeing you can delete it without any impacts (though this can be done through FinOps efforts too).
It’s also worthwhile adopting agile strategy of starting in one area and getting feedback on what worked and didn’t work before rolling it out to other areas of the organisation.
Data Ownership and Standards Are More Important Than Tooling
I know, right, an engineer recommending less focus on tooling? Wild.
I’ve seen many Data Governance initiatives fail due to the organisation buying a fancy Data Catalog and then not doing enough work to make the Catalog an essential part of the organisation’s data ecosystem.
To help with that, we recommend implementing data ownership across all essential data and then implementing robust standards and processes, locally by the data owners themselves and globally by a central data team.
Data Owners (or Stewards or Custodians) mean someone (or a team) is responsible for the data assets: quality, documentation, security, and access, helping resolve many of the issues we spoke about at the top of the article.
After that, I’d look at implementing a cross-organisation metadata model or schema for all governed assets so that minimum Data Governance requirements are set. There may also be local policies set to extend the enterprise metamodel, like Finance wanting certain metadata tracked for legal reasons.
The metadata model can look like a complex, nested JSON document, but it can also be three tables in a spreadsheet: data sources, tables and columns.
Both of the above are often easier to do before adopting any tooling (re-factoring configuration is usually harder than starting from scratch) and still provide benefits if your tooling is only partly implemented or not at all.
With active data owners, you’ll have all data assets kept up-to-date in any tooling, and a consistent metadata model makes it easier to import and maintain any metadata in tooling.
Do I need a Data Governance Framework?
I know that setting out to build a framework for anything requires time, and you’ll have many competing concerns, so I understand if you feel reluctant to build one, especially if you are a small team with a limited budget.
But there comes a point where fighting lots of local battles with Data Governance becomes more inefficient than building out a framework to reduce Data Governance issues over the long term.
Like I mentioned at the start of the post, I’m not going to go in depth on frameworks, but I will say that whatever framework you use, make sure it’s cyclical so that it’s always improving and you are acting on any emerging issues in a timely manner.
You Might Not Even Need Fancy Expensive Tooling
Another budget hack is to just use a spreadsheet, which should do the job in a small Data Platform, especially if you have a low number of schema changes (a few times a month). The downside is that you can’t easily track who has changed what in the spreadsheet, which is itself a Data Governance concern…
Another option is to import metadata into a database or data lake, which is more work to setup but allows for more scale than a spreadsheet and can potentially track changes.
For cheap metadata visualisation and discovery, think about importing governance metadata into a Business Intelligence (BI) application like Power BI / Tableau so metadata can be discovered alongside the reports and dashboards.
What is Data Catalog and / or Data Discovery Software and do I Need It?
The main point of Data Catalogs (or the more aspirational title “Data Discovery“) is to collect metadata of data assets, including data ownership, column, and table names, import them into a data store (often graph-based) and then combine that with a User Interface (UI) to search by key terms so users can discover data.
But Data Catalog software can often seem expensive, starting at hundreds of pounds per month that can scale rapidly up. Why? Because they have many components that need to work well together: Search Engine, Relational Database, Web Server for User Interface, a Graph Database and maybe a Streaming Cluster. Collectively, this costs a lot to run.
I wouldn’t recommend building your own unless you have at least a few million dollars or pounds to burn (though some still do, as Data Governance is a bit different in every organisation, making it hard to find the tooling that exactly matches your requirements).
So are they worth bothering with?
At a large scale, yes, data discovery software has powerful search engines that search over millions of data assets in fractions of a second, which might take a long time or be impossible to search for in a normal spreadsheet or database.
They also often have:
Integrations with popular data sources, making metadata import much easier to setup.
Real-time updates.
Actions that are sent to Teams or Slack if there are any important metadata changes or issues.
Integration with Data Quality tooling like Great Expectations.
Automatic Personal Identifying Information (PII) detection.
One interesting trend we’re also seeing is getting Data Catalogs for “free” alongside other data solutions (Databricks Unity Catalog, Monte Carlo, Starburst Gravity and Dagster Asset Catalog) as "Data Catalog lite".
They lack many features found in a bespoke Data Catalog, but can help data teams have a smaller, cheaper on-ramp into Data Catalogs without buying another product to maintain.
Summary
As mentioned at the top, this is only a lightweight summary of Data Governance and we did not cover:
Data Culture
Data Literacy
All the Data Governance Roles
Data Quality (though have a upcoming section on this)
That said, my takeaways are:
All Data Projects / Products and Platforms benefit from at least a little bit of Data Governance
Start small and don’t initially focus on governing all the data.
Ownership and Standards should ideally exist before starting to look at any tooling
Once you do implement tooling for governance, there are a variety of options, from a simple spreadsheet to an enterprise Data Catalog that costs millions per year in licences.
I think my last comment on this is that, like Data Quality, successful Data Governance requires effort from all of the organisation, not just Data Governance professionals.
If you have any questions or comments on this article, leave a comment below!
Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms!
Cover photo by fabio on Unsplash