Note, this is part of an in-progress guide on “How to Build a Data Platform”, subscribe for future updates:
Introduction
Designing architecture is about making trade offs; there is rarely a perfect solution, especially when you have numerous and/or complex requirements. It is more often about designing ‘good enough' architecture that meets as many requirements as you can. Therefore, we won’t suggest one ‘golden’ architecture, but discuss the pros and cons of common approaches.
At its core, being an architect is about making decisions; what architectures, technologies and configurations you choose will determine the success or failure of data platform builds and it’s adoption. It’s why I phrase a lot of my headings in this guide as a decision to be made. Given the ever increasing scope of solutions, the sheer amount of decisions that need to be made can feel somewhat daunting. As such we've provided a fairly exhaustive list below.
In addition to this, providing the context around how these decisions were made, operating in an ever moving world and enabling other teams are core aspects of an architects role. Therefore we've included some short contextual sections on these to get you thinking.
So What Decisions Do We Need to Make to Build a Data Platform?
What is our Roadmap?
Is it feasible in terms of time, money and current organisational capabilities?
Cloud vs. On-premise?
Do we get a “All in one Data Platform” vs. Modern Data Stack?
Build vs. buy?
What is our security strategy?
Is Any Migration Needed?
How to migrate?
Centralised (Classic / Data Fabric) vs. Decentralised (Data Mesh)?
Are we implementing Data Products?
Data Processing:
Batch vs. Real Time?
Real Time Processing: Lambda vs. Kappa Architectures?
Data Lakehouse vs. Data Warehouses/Analytical Databases vs. Transactional Databases?
Do we need Orchestration Data Workflow Software?
ETL vs ELT
Do we need Data Integration/ETL Software?
Do we need Reverse ETL?
What Data Model are we implementing?
Data Analytics:
Are we doing Data Science?
A dozen more more Data science questions we should ask here…
Do we need Business Intelligence Software?
Self Service vs Pre-built Reports?
Do we need a Metrics Store?
Do we need a Data Quality / Reliability / Observability Software?
Are we implementing Data Contracts?
What is our Data Governance Strategy?
Do I need a Data Catalog / Data Discovery software?
What is our Testing Strategy?
What is our DevOps/DataOps Strategy?
How do we connect to our source datasets?
How do we deliver the Data Platform?
What is our Disaster Recovery Strategy?
What is our Support Strategy?
Note that there is a whole set of product decisions (“What Public Cloud Platform do I pick?“) as well to answer, that I left out for brevity. I also left out anything to do with Data Strategy, as I feel that’s a whole other set of questions and a whole other guide to read.
Yes, there are a lot of questions to be answered, but they should not be answered all at once, in order or even at all. If all you need is Data Warehouse/Lakehouse, then just build that.
Architecture, like most processes in Software Engineering, should be an iterative, agile process where you design and document enough to be able to start building and then redesign based on what you’ve learned.
How Do I Capture Decisions?
This is a painful question to answer, because it requires a fair amount of documentation to explain your decisions. It can feel like a frustrating waste of time, but when you’re spending hundreds of thousands or even millions of pounds/dollars of an organization's money, you need to show you are spending it responsibly and efficiently.
Some organisations have a long Software Design Document to fill out or ask that you follow a large framework like TOGAF, especially if you are designing for a regulated industry. There are more agile practises taking hold, such as Architectural Decision Records (ADRs) which focus on:
The context of the decision
The decision made
Why you made the decision
The value added by the decision
What alternatives did you research and reject
We’ve seen already Amazon, Google, Red Hat and Spotify each promote the use of ADRs.
I personally enjoy reading, writing and reviewing ADRs as they are split into bite sized one- or two-pagers that are easier to write and read, rather writing and reading a 200-page document in one go.
Go with the Flow?
When architecting in a large enterprise, it can be tempting to break the rules and ignore the existing organisational design process, as it seems like you’re just wasting money and time jumping through pointless hoops. Though, from painful experience, don’t try this unless authorised to, as you’ll quickly find your IT and security stakeholders have a poor opinion of you and are going to be less helpful in helping you deploy your solution.
That said, if you think you have a strong argument to skip or alter part of the process, then feel free to ask: I’ve found design review boards to be more flexible in reality than what they document.
Though generally, work with IT and security teams rather than against them, read all their relevant documentation and try to get them on board with your designs as early as possible.
At Scale, Build Patterns and Services
If working at the enterprise level of a large data team, design reusable architectural patterns and services that teams can adopt and build themselves, rather than one team at becoming gatekeeper and building everything.
The last point is important at scale: if one team has to build everything, there is very real danger that data becomes trapped at source because the business domains can’t get the data insights they want while they are waiting for the central data team to build it. This isn’t so much a problem in small data teams, but as you scale, you may see your central team backlog balloon out of control and shadow IT emerge as the business gives up on the central data team.
Though do remember these patterns and services are best practices guidelines, making them too rigid and mandatory can get in the way of delivering the most business value.
Summary
And that’s my overview of Architecting a Data Platform, apologies if I missed anything out, as I am trying again to cover in a article a subject that had hundreds of books written on it!
I’ll next dive into some of the technical decisions you need to make, first of which is Cloud vs. On-Premise and then Centralised (Classic, Fabric) vs. Decentralised (Data Mesh) Data Platform Architectures.
Sponsored by The Oakland Group, a full service data consultancy.