SQL - Using CTE's for DRY Queries

2023-12-24

Recently in a work project, I wanted to ensure that despite moving languages, a colleague would still get the opportunity to develop their technical skills. One area I really value in experienced, seasoned engineers is SQL. Most of us use it, and it saves us re-discovering problems folks have solved since the 70’s and have iterated on for decades.

We had an ambitious timeline and volume of data; with what seemed like some common rules; hey, I think I know something that might work for that… CTE’s!

None of our endpoints did the exact same thing, but if you squinted, you could see common parts. I think it’s likely this would be true of many systems.

Let’s consider that you are in a system, where you can access records, only if they belong to your user.


SELECT * FROM record WHERE created_by = ?

We all know this simplistic SQL “won’t scale”. You might want to add complex pagination or filtering logic to this query, so how, besides manually doing that, in multiple places, can we achieve this?

Enter the CTE’s

CTE’s are a way of writing SQL, so that your “main queries”, can use dependent queries, without resorting to custom engine extensions, or user-defined functions. More than this, they also are able to be used, without re-evaluation, multiple times.

If I were to define a more complex CTE as (MySQL syntax):


WITH
my_records AS (
	SELECT * FROM record WHERE created_by = ?
)
SELECT * FROM my_records
WHERE created_at BETWEEN ? AND ?

Here my specific endpoint, can use the initial query, and only needs to remember to pass the parameters it needs.

If I house the SQL for my CTE within a function, I can ensure I have the arguments I need, using static analysis for inputs.

More complex logic

Now your system iterates. No longer can one user own a record. The companies buying your product want more! Multiple people might have read or write access to specific records, but you still need to search and filter for only records they have access to.

Firstly, we might select teams that our user belongs to. Then because in our fictional model, multiple teams have access to a single document, or record, we'll also query a joining table.


WITH
my_teams AS (
	SELECT team.id AS team_id, team.name AS team_name 
	FROM teams
	JOIN user_team ON user_id = ?
)
records_i_have_access_to AS (
	SELECT
		record.*,
		r_own.id AS table_ownership_id,
		t.*,
		r_own.access AS team_access
	FROM record
	JOIN record_ownership r_own ON r_own.record_id = table.id 
	JOIN my_teams t ON t.team_id = t_own.team_id
)
SELECT * FROM records_i_have_access_to
WHERE access = ?

It’s easy to see from here how you could add more complex access logic, I hope. Perhaps a change might be that a team a single member is responsible for approving edits per-team. Perhaps records need an external approver, so you add more relationships.

Note that the query in these examples, as well as the DDL are contrived. I am unable to share work code or live client datasets.

Sidebar: there are few free lunches

One of the difficulties with DRY SQL, is that your ORM framework, may need to be traded for a lower-level querying tool. I would not suggest for a moment that you smash together strings; please note my use of `?` meant to denote a SQL parameter, which you also might recognise as beginning with a \$[0-9]+ syntax

In the project we were working on, SQLAlchemy, sort-of supports CTE’s; so long as you do not fetch results or call execute on a Query object, you can call .cte(‘some-name’) on the query, or call .subquery(‘some-name’); BUT you also sacrifice type safety and occasionally have to carry the mental burden of knowing which columns are available. A Solid naming convention and using functions, with strongly typed params helps us here, but it’s not been without issue.

An example of where this helped recently

Before breaking up for Christmas this year, an internal collaborator (I’m an external collaborator) was looking at a query, literally causing 5XX errors in an API.

By using CTE’s I was able to analyse parts of a 5-table join, to ensure that we had indexes, and I never had to understand, more than a minimal piece of the query at a time.

By working at this for about half-a-day, by the end of the day, I had communicated to them and their team the issue, and some indexes which I’d tested on a similar (scale) database. I was able to communicate both query-plan, and concrete timings for improvements, without harming their system, and confidently execute in their environment.

We’d reduced a query from over an hour, perhaps several (it had not completed since the day prior) to just a few seconds.

The query could now be run by an async, non-request-blocking part of the code.

We not had to rip out the particular SQL engine or perform any costly or complex system migrations to access the performance characteristics we needed yet.

This was in a dataset of hundreds of millions of records.

I Hope this helps. What might you use a CTE for?

By Lewis Cowles