This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization". There is no difference in effort between writing a join or doing the ORM-double-round-trip in the vast majority of cases. People are so afraid of doing joins I see people doing subqueries with the id in a subselect because "joins are slow". The worst is usually some kind of pseudo-join and then an aggregate or filtering in the application code. It drives me up the wall when I see it in code review, usually because I get into some argument about "joins are slow" (with no evidence) and then I have to go and rewrite the query and maybe add an index to show that, yes - an aggregate that takes seconds and a ton of memory in the application code can in fact take milliseconds in the database.
The NoSQL people have really done a lot of brain-damage to this industry.
It's so pervasive that I've starting using this kind of question in our technical interviews, doing a double round-trip ends the interview for anyone higher than a junior.
Here is some actionable advice for speeding up a join with indexes. You'll need one in each table with the join columns. For example:
SELECT ...
FROM table_A
JOIN table_B ON table_A.column_A1 = table_B.column_B1
AND table_A.column_A2 = table_B.column_B2
You can add indexes like this:
- table_A(column_A1, column_A2)
- table_B(column_B1, column_B2)
If both tables are large enough, this query can probably take advantage of those indexes to perform a merge join.
Postgres tip: you can also add columns to the include part of the index to speed up filters in the WHERE conditions. You might even get an index-only scan! Look into covering indexes to learn more about it.
Not related, but here in my job our query just stopped working because the data got too big. we use left joins everywhere because we don't want to lose the main table data. do you think the trick you just mentioned could optimize our left join as well?
Most likely, yes.
Combined indexes are real cool, because if the database can satisfy a join only from the index, it can usually perform it completely in memory, instead of having to do a table scan/hitting the disk for the values.
Another trick people usually shy away from: temporary tables.
It might seem slow and wasteful to create a table (plus indices) just to store results for a fraction of a second, but for very large tables with large indices, creating a smaller index from the baseteable and joining against that can be magnitudes faster!
There is even dedicated syntax for that: CREATE TEMPORARY TABLE. They are local to the connection and will get dropped automatically at the end of the sql session.
They are also great for storing the results of (nondependent) subqueries, because for large sets, not every database is able to find the proper optimizations. Mysql versions < 8 for example.
I really recommend you to try that one. So far I could fix every "query takes too long" problem that resisted other solutions that way.
I suppose so. I would just check if these tables are vacuumed often (preferably with autovacuum) otherwise the query planner might decide not to use the index depending on the visibility map conditions.
Agreed, partial indexes were a game changer for me. After that, Window Functions was like a whole new world of awesomeness opened to me. Nothing has been the same ever since I figured out how to effectively use Window Functions. One of the biggest Eureka! moments of my entire career.
They can be used to speed up queries that have WHERE clauses, so I see it might have caused some confusion since partial indexes have WHERE clauses in the their definition.
Interesting, I've never seen INCLUDE before - probably because we use a version of postgres that does not support it - but I see now that it can be useful instead of doing CREATE INDEX idx ON table (colA, colB) because of the lack of ordering in the b-tree leaves. Good stuff. Just another piece of ammunition for upgrading our db to a newer version that has all sorts of new goodies I've been reading about lately.
> This article speaks to me. So many times I have needed to go back and fix queries that were naively written this way like it was some kind of "optimization"
in some cases, doing joins in the application is more performant then making the database do it. Its usually better to do it by join, but depending on the data you're joining you might incur significant slowdowns.
Its always better to start with the join and only evaluate the application join if there is a need to improve the performance however. Nonetheless, a sweeping statement like yours doesn't help either.
The NoSQL people have really done a lot of brain-damage to this industry.
It's so pervasive that I've starting using this kind of question in our technical interviews, doing a double round-trip ends the interview for anyone higher than a junior.