Firstly, an advert: I’m doing a related talk on “Past, Present and Future of Data Engineering“ in Leeds UK on April 18th, so if you’re in the area, it would be great to see you there!
I know, I know, another “you can’t kill SQL; SQL will be around for another 50 years!” post, but while researching the history of Data Engineering for a talk (see above), I was amused at how many times people have tried to replace SQL and relational databases over their history and had to make a blog about it.
And yes, I know, I’ve interchanged between SQL and relational databases a lot here, despite being two different entities, but the two are closely connected, in my opinion.
QUEL
Interestingly, relational data modelling was created a few years before SQL in 1970 by Edgar Codd and SQL had some stiff competition from QUEL*, which was invented by same person who went on to help create PostgreSQL.
QUEL had it’s own pros and cons, listed here and could be argued to be the better language that was also more suited to Codd’s vision, but SQL received major backing from IBM and Oracle, so QUEL was always fighting an uphill battle.
I also quite like QUEL’s copy syntax; I think it was written years before it would appear in SQL:
copy student(name=c0, comma=d1, age=c0, comma=d1, sex=c0, comma=d1, address=c0, nl=d1)
into "/student.txt"
*The SQL name was apparently a pun on QUEL: “SEQUEL”.
Object-Oriented Databases
Back in the 1980s, when Object-Oriented Programming (OOP) was the coolest kid on the block, there was an attempt to bring databases into the realm of OOP so programmers didn’t need to learn more than one language. But it became a niche technology, arguably because relational models are more robust for transactional updates*. They sort of still live on in Oracle databases as SQL Object Types and a few niche vendors.
*Note, since it happened in the 1980s, it was hard to research on the internet to say exactly why Object Oriented Databases did not take off…
Document Databases
In the 2000s, the use of semi-structured data exploded, and for a period of time, querying such data in SQL was a massive pain, giving rise to analytics usage with Document Databases such as MongoDB with MQL.
But SQL databases added JSON type, which made querying semi-structured data easier, and also, data ingestion technologies like Fivetran made ingesting 3rd party APIs into relational tables easier too. ElasticSearch is also still popular in some analytics niches, such as searching logs.
Graph Databases Languages (Cypher, Gremlin, etc.)
Graph databases have the sales pitch of being better in terms of performance and code conciseness for multiple joins than SQL and relational databases (5 to 10 joins). Though I’d argue they’ve stayed fairly niche, as it can be easier to remodel your data to do fewer joins than adopt a whole new database.
-- SQL
SELECT actors.name
FROM actors
LEFT JOIN acted_in ON acted_in.actor_id = actors.id
LEFT JOIN movies ON movies.id = acted_in.movie_id
WHERE movies.title = "The Matrix"
-- Cypher
MATCH (actor:Actor)-[:ACTED_IN]->(movie:Movie {title: 'The Matrix'})
RETURN actor.name
Also, the SQL standard now supports Graph querying, which could limit graph database usage even more for data analytics. There is even serious debate recently about whether graph databases have any point at all.
Java
Using Java and/or Scala was quite popular in data analytics in the 2000s for large data transformations (Hadoop), but Hadoop is now dead, and a lot of Hadoop usage ended up being done via SQL anyway using tools such as Hive (which itself has survived much better than the rest of the Hadoop ecosystem, partially due to being, well, SQL).
Java has had more success in real-time streaming and is still a popular language to this day for the use case, but we are also now increasingly seeing SQL being targeted as a primary language for new streaming products.
Why? You can point to Java being a verbose language to write and difficult to learn quickly unless you’re an experienced, expensive programmer, so it's a poor language for ad-hoc queries. I can still see use cases in large, complex streaming transformations, though.
Python
This is the most painful one for me, as five years ago I built everything with pySpark and thought I only needed SQL for ah-hoc queries now and again. I really liked the idea of using one language for Infrastructure, Business Intelligence, streaming, and Data Science. The reality is a bit harder to achieve after some bitter experiences.
One problem with Python is maintainability at large scale, which requires experienced, expensive teams of developers to manage well (though you can argue the same with SQL to be fair). Another bigger problem is migration and re-training, as most new data platforms are doing some kind of migration from a mostly SQL-based platform, and it’s easier to move between dialects of SQL than another language.
Python still has a role in data (we are currently using pySpark for a few clients right now), especially in Data Science, though even there, you can find SQL extensions that allow limited ML/AI capability.
I talk more in depth about Python and SQL here.
Low Code
While low-code is increasingly successful recently in some areas of Data Engineering and Analytics, such as Orchestration, Data Visualisation (Power BI/Tableau) and Data Ingestion, I think low-code data transformations at scale are a poor idea: lock-in, often poor debugging experience, DataOps can be an afterthought, and a more limited community than SQL. That said, I can still use cases for simple transformations for small, low-maturity data teams.
LLMs(?)
This is more of a guess than a history lesson, as there are many companies right now betting on enabling querying via natural language with LLMs. But at the moment, scientific papers suggest querying structured data with LLMs is only 30 to 70% accurate, and that’s only if you attach an expensive Knowledge Graph / Semantic Model to it.
This doesn’t even take into account asking the right questions to solve a business problem, which often requires years of experience in requirements analysis and in-depth knowledge of the domain you’re working in to be good at.
If AI does get good enough to answer vague business questions with near 100% accuracy, then it is also likely clever enough to take over the world, at which point we'll have bigger fish to fry. I still think LLMs have many great use cases, such as data discovery, code generation, and debugging, but I currently see LLMs more as a tool to enhance my SQL coding process than as a replacement.
So why do engineers keep trying to replace SQL?
SQL has become very close to the “universal data language” and it performs four quite different use cases across many different domains:
To quickly perform ad-hoc queries on data (DQL)
To transform data as part of data pipeline (DML)
Defining data models (DML)
To administer data access (DCL)
I think SQL does all four tasks well enough, but learning SQL can feel like a lot of effort to non-engineers for just ad-hoc querying, and certain complex and/or dynamic problems are arguably easier to do in an imperative language like Python and Java.
For the first problem, low code can sometimes be better suited in low maturity organisations, and there are some SQL-like languages that can fix the second problem to some extent (Malloy, PRQL), though knowing the past history of SQL, it’s more likely SQL will take the best parts of these languages rather than being replaced.
What I’m getting at is that SQL has become a “good enough” language for a high percentage of data tasks, but there will always be edge cases where other languages make better sense, and the important point is to be aware when those edge cases come up, and you’re not trying to hammer a square peg into a round hole when they do pop up.
The same goes for relational data modelling, which I talk more about in detail in my data modelling article.
Cover Photo by GR Stocks on Unsplash
Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms!