Hi all, this week we have:
How Self-Service Analytics Impact Data Platforms
Data Pipeline Orchestrators - The Emerging Force in the Modern Data Stack (MDS)?
When to Build or Kill Your Data Product Ideas
From Data Platform to ML Platform
dbt Shows Off New Features For its Cloud Service
How Self-Service Analytics Impact Data Platforms
I believe most decisions made in data should ideally be on a continuum of two or more options or be a compromise rather than picking a binary extreme.
And I think Self-Service vs. Guided Analytics (pre-built reports) is one of those examples, as in a sufficiently large organisation, you are going to see a mix of both approaches.
Now I’m not going to make a typical comparison of the two, as there are already a number of great articles on that. Also, the comparison articles can’t cover all the numerous use cases and personas of analytics (see appendix below).
The trick is to find the right balance for your organisation based on your current analytics use cases. This balance will differ in each organisation and, most importantly, differ over time as your organisation becomes more data-informed, new technology comes on the scene, and data analytics use cases change.
Self-Service is Technically Everywhere
While the dream of some to have 100% guided analytics, as it’s fully governed, requires less data literacy training and documentation, the reality is that there is always a requirement for some custom ad-hoc analysis.
I also believe pretty much every organisation is doing self-service analytics; just all are not aware of it. If you’re using Excel for analytics, you’re doing self-service analytics!
And with Excel, you are doing ungoverned self-service analytics, as Excel doesn’t have much in the way of query logging, data lineage and version control features out of the box.
But self-service done right requires more upfront costs than guided analytics: you need the right tools, governance, and training in place so self-service doesn’t turn into a report-dumping ground.
But What Does This Have To Do With Data Platforms?
As I mentioned in my guide, ideally, a data platform should be designed backwards from it’s analytical use cases so we know the right tools to choose for our platform and how best to use them. How you do analytics can have a massive impact on your data platform.
Increased self-service use cases will impact platform design in a number of ways:
Modelling: data models will require a more adaptable core data model
Performance and Cost: self-service modelling will probably put more strain on your warehouse or lakehouse. Rather than having static indexes and partitions, you may want to look at more dynamic alternatives (Starburst, Databricks and Snowflake) based on recent usage.
Data Governance: will be harder to manage with self-service, so it requires more platform processes and tooling to manage.
Tooling choices: low-code and no-code tooling can increase self-service, but may come with a lack of governance features like version control.
Summary
The above considerations might put you off self-service, but as I mentioned before, it’s nearly impossible for some self-service not to occur, so it’s best to plan to govern it, plus it can give you business performance benefits such as a lower time to insights.
Likewise, I can’t see a fully self-service approach working for most organisations, as it will simply cost too much to train everyone to be a data analytics wizard, even if all staff want to be one (unlikely).
So it’s about finding the right balance for you right now and building the best data platform for that balance.
Appendix: Analytics Personas
While this didn’t fit well into my post above, I wanted to give you an idea of the range of use cases that fit into guided analytics and self-service by looking at the types of end users in analytics:
Report viewers:
Require relatively lower data literacy training.
Very much a guided analytics use case.
Doesn’t have to be static reports, though the more filters you add to a report, the more training required for it
Report Builders:
Now into the domain of self-service analytics, building ad-hoc experiment reports or production-ready reports for report viewers
Usually exclusively using BI tools such as Power BI or Tableau.
Should be trained in data tooling they are using and data literacy.
Report Modellers:
Still often using BI tools, much also likely comfortable using data warehouses and lakehouses with knowledge of SQL
Will also often build the reports too.
They require a high level of data literacy: at least the basics of data architecture.
Likely data specialists with some domain expertise.
Analytics Engineers:
Designing their own data pipelines with tooling like dbt.
Very quick time to insights, but can be hard to govern at scale.
Likely data specialists with high data literacy.
Data Scientists:
Will often require access to conformed and even raw data.
Will likely still need training on how best to communicate their AI and ML models to non-scientists.
May design models that can be accessed in a self-service manner for investigation.
Data Pipeline Orchestrators - The Emerging Force in the Modern Data Stack (MDS)?
Timo Decau, Chief Content Maker at Deepskydata, often writes great posts, and this is another: pointing out that Data Pipeline Orchestrators offer a common layer for various Data Platform / Modern Data Stack tooling similar to a fully integrated data platform.
He also does a deep dive comparsion into various Data Pipeline Orchrestrators and wonders if they’ll exist in 10 years time.
When to Build or Kill Your Data Product Ideas
There are lots of posts on LinkedIn telling you that your analytics “must have business value“, which I agree with, but how?
Stanislav Dmitriev, Head of Marketing at ellie.ai gives a nice intro into how to figure out which data product ideas should be built or not.
From Data Platform to ML Platform
This is a great (and accidental!) companion article to my article on why data foundations matter in AI and ML, as this looks at how to start with the data foundations of a typical data platform and add AI and ML functionality with a deep dive into MLOps.
Written by Ming Gao, Tech Lead Manager at Bytedance (Tiktok)
Don’t Use DISTINCT as a “Join-Fixer”
Great deep dive into the performance impact of DISTINCT SQL function by Aaron Bertrand at Staff Database Reliability Engineer for Stack Overflow.
However, Snowflake processes DISTINCT differently, so this advice doesn’t apply to all databases!
dbt Shows Off New Features For its Cloud Service
While I love open source dbt (or “dbt core”), I’ve always struggled to see much value in dbt cloud, especially at it’s current pricing. For example, I don’t think most organisations need a semantic layer, as they likely have one in their Business Intelligence (BI) applications, and it’s orchestration features are limited compared to best-in-class orchestration libraries.
But dbt mesh and dbt explorer sound interesting and could add value to organisations that have hundreds or even thousands of models across many data products.
Though note, there is already work in the open source community to replicate some of the dbt mesh’s features with dbt loom, Data Engineer Christophe Blefari shows off a working example of it.
Sponsored by The Oakland Group, a full-service data consultancy. Download our guide or contact us if you want to find out more about how we build data platforms!
Cover Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash