Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ways to Tweak Slow SQL Queries (helenanderson.co.nz)
281 points by helenanders26 on Sept 1, 2019 | hide | past | favorite | 62 comments


Heed the caution about CTEs being optimization fences. I had a case just this past week, where one of our engineers needed help with a query that used a LEFT JOIN against a CTE which generated text-search vectors for some subset of rows, in some large table. The query was taking 9-10 minutes, cache-warmed.

Used that way — an outer join against a CTE — the planner was forced to generate a plan that produced a row for every candidate row in the table being used as input for the text-search, whether or not it would be needed, visiting nearly all of its pages.

Just by moving the CTE inline (that is: "LEFT JOIN ( #{ subquery } ) AS blah"), the query completed in 17ms.

This way, the planner could apply conditions from the rest of the query to the subplan from which text-search vector rows were being produced, such that it only pages containing rows it already knew it would care about would even be retrieved.

The rest of this article is pretty on-point, too.

Source: This stuff is my day job.

EDIT: As a counter-point, because they are incredibly useful, I've also had countless cases where rewriting a query to use CTEs was the several orders of magnitude win. This also wasn't the only possible fix; the qualifying conditions in the CTE could have been improved, eliminating the extra work where it would have occurred.

Like all things computers, the real answer is, "it depends..."


On Postgres 12, CTE's are not optimization fences anymore. [0]

[0] https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...


Not quite true. The default behavior changes, and new syntax is added allowing users to specify their desired behavior, but if the sub-query in the CTE does certain things, it will still be materialized instead, regardless.

That is: they mostly aren't a fence any more, but can be when you want them to, and sometimes still must be.


Like all things computers, the real answer is, "it depends..."

This is what makes the job exist. If it was algorithmic we'd already have been replaced by EXPLAIN


Each of the ~5000 stored procedures I had to strip out from a client’s large MySQL cluster agrees with this statement.


Query planners are like compilers: sophisticated and sometimes surprising but not perfect. Writing sql is just like writing any other code: optimize for an easy to understand query first, performance if needed (and always comment your performance optimizations)! They are also like just-in-time hotspot optimizations: the query planner keeps statistics and the second run of a query can be 10x or more faster than the first run. (Side note: most analytics queries in my experience are run against cold data so it’s usually worth optimizing for first-query performance.) This stuff is also my day job, and fun!


I think I read somewhere that they are fixing some of optimization issues for CTEs in the next release of Postgres. Does anyone know whether that's true or not?


Are you thinking of the "[NOT] MATERIALIZED" clause?

Yes, that's coming.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...


Thanks for the info! I am a big fan of the readability of CTEs. Though I guess in production code we use SqlAlchemy, so it would just be a matter of swapping `.cte()` for `.subquery()` probably...


IIRC Postgres 12 is suppose to remove most of the optimization fences around CTEs.


The best strategy isn't mentioned: learning how to read EXPLAIN output and acting on it.

> Using wildcards at the start and end of a LIKE will slow queries down

That's only true if you don't have the right indexes in place. A wildcard at the end can use a regular (btree) index. Even a LIKE query with wildcards in other places can be fast with a trigram index.

https://niallburkley.com/blog/index-columns-for-like-in-post...


> A wildcard at the end can use a regular (btree) index.

depending on the locale your db is in, it might be a little less "regular". recently fell in that trap.

http://blog.cleverelephant.ca/2016/08/pgsql-text-pattern-ops...


"Pull out the names of only the columns you need instead of using SELECT * to speed things up further."

This is not performance-related in most cases. Unless the bottleneck is in the amount of data being transferred to a client over a network connection and there is a large amount of columns, or you have an index which matches the limited column list query exactly, there would be no performance difference in SELECT * vs SELECT <column list>.

In columnar dbs it does matter because the less columns you select the less data gets accessed on disk. However, this does not hold true for row stores because data is stored in such a way that the whole row gets accessed no matter how many columns get specified in the query.

There are many other good reasons why SELECT * is acceptable only in development queries, but performance is not one of them.


Some joins may be completely eliminated if the fields coming from them are omitted from the SELECT list. This happens when the DBMS can prove (through FOREIGN KEYS) that the joined row always exists, without having to ever physically retrieve that row.

This is especially important when querying views (or inline functions, if your DBMS supports them) built on top of several other layers of views.


In terms of performance, the primary rationale for only retrieving columns you need would be to allow the database engine to select a proper index to use, specifically, one that covers all the columns you are making use of. This can actually have a massive impact on the execution plan, far more than simply fetching say 2x more data, as you point out.

In fact, I'd go so far as to say that the practice of selecting all columns makes it practically impossible to make use of covering indexes for high impact queries.


It's not about choosing the right index. The point here is that the database can only read the index (without going into the table) if all the columns you need and use in where clause are in the index. Usually index segments are smaller, so it could be faster. I am talking from the Oracle perspective, but I believe PostgreSQL would work in a similar way.


Yes, of course, indexes work like that with where clauses. However here I was referring to covering indexes, where the columns in your _select_ are also present (after the ones in the where clause / join / predicate). This prevents having to do extra seeks into the clustered index to fetch that data, and can make very significant differences, especially if row counts are high.

Clearly, this type of index is both expensive to maintain, and the benefits are lessened if you include every column in a table in the covering index. So, to actually have real benefits, you need to have carefully crafted queries which only select what they need, and these being matched to carefully crafted covering indexes. Basically, you cannot have these - which are one of the better weapons you have for performance - if you always select everything.


I am talking about INDEX FAST FULL SCAN (Oracle) for example. If you do 'select distinct last_name from users;' and you have an index on users (last_name) you do not need to read the table at all. It is enough to read the index.

Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.


> Otherwise it does not matter how many columns you select, because you will have to read all the columns for every row anyway.

It does matter. An index can actually contain a cache of values for some columns. This is done with INCLUDE statement.

     CREATE INDEX idx
         ON book ( author_id )
         INCLUDE ( book_title )

This select will use only index:

       select author_id, book_title
       from books
       where author_id = 123

This select will use index, and then have to follow and fetch data out of rows:

       select *
       from books
       where author_id = 123
https://use-the-index-luke.com/blog/2019-04/include-columns-...


I fully agree. With both comments, both yours and branko_d's.

The way I read the article, it seems to suggest that by removing rows from the select list would somehow always result in better performance. In my opinion, in the majority of cases there will be no measurable impact on performance.

In some specific cases, however, yes, it can make a big difference. In postgres it also seems like selecting all columns explicitly, but out of order can be detrimental to performance:

https://www.postgresql.org/message-id/5562FE06.9030903@lab.n...

So, it depends :)


PostgreSQL can use multiple indexes. So if you have millions of rows, and 3 indexes covering the table, on 3 different columns. PostgreSQL can use any & all of those indexes to filter the data.

Selecting only the columns you need just prevents the database from having to go back to the clustered index to retrieve the remaining information that wasn't included on the index.

Selecting * from a table has no impact on the indexes used. But it can be expensive to need to go back to the clustered index to get the additional information.


> not true for row stores because data is stored in such a way that the whole row gets accessed no matter how many columns get specified in the query.

This is not true, at least not with the likes of MS SQL Server / postgres / Oracle / etc. If the query can be satisfied with just what is in an index the clustered index or heap (where all the row data is together) does not need to be referenced. Worse: if you have off-page data (NVARCHAR(MAX) columns with values that aren't very short, to give one common example in the case of SQL Server) then not only is the engine reading the CI/heap unnecessarily to serve your query but it is potentially making extra unnecessary per-value page reads on top of that.

Also for more complex queries that cause the building of intermediate results in memory or worse on disk, anything where you see a sort step in the query plan for instance, you are increasing the amount of data that needs to be spooled in this way. In some circumstances this might happen to the date more than once during a single run of your query.

Or if columns are actually the result of UDFs - they will be getting run per value per row even though the result is not needed, and that can also affect choices the query planner makes wrt parallelism options. You might not know you query is "more complex" either: do you know for sure that you are referencing a simple table or a view that is doing some jiggery-pokery to derive values? It might be a simple table now but that may change in later versions of the model/application, and now you have that extra computation happening for values you might not actually care about. It doesn't have to be something as complex as a table being replaced by a view that is doing extra work: the entity might later have a free-form note record attached to it in the form of a VARHCAR(MAX) column and those extra off-page reads are hitting your query's performance server-side (not just in terms of what travels over the network) when you don't actually need the results of those reads.

> There are many other^H^H^H^H^H good reasons why SELECT is acceptable only in development queries*

I'd say "to be avoided where possible outside of dev/test queries" as there are always exceptions where it isn't possible due to [bad] design elsewhere, but yes.

> but performance is not one of them.

This I disagree with. It can definitely affect performance. And anyway, it might not affect performance on this particular query, at least not by a measurable amount, being selective about what you select is a good habit to train into yourself.


Nobody is saying that it's not a good habit to form. It would be better to advise to review your column lists and remove redundant columns. Firstly because schemas change, and then because yes, you might hit an index (especially if you know it's there, as you should) and get that speed-up as well.

My issue with this advice is that it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_, which as we all agree is not true. Even more so if you have a relatively well designed database, with no crazy amount of views on views (or any for that matter), UDFs, huge columns with binary data, etc, etc.

PS: "...result of UDFs - they will be getting run per value per row..." -> not always.. Inline tvfs get expanded, so you should probably be quite careful with all other user defined functions in SQL Server anyway.


> it enforces the idea that "less columns in select list" = "less data accessed" _in all cases_

I for one try to avoid absolutes like that. "less columns in select list = potentially less data accessed & processed" is a better wording. Along with "and there are no cases, unless there is a QP bug around the matter than I'm unaware of, where it changes things for the worse" if I'm feeling more wordy.

The problem with that though is the some (I've worked with some difficult people!) see the "potentially" as a reason to just drop in the * "for now" - their reasoning being it is a premature optimisation. Why type the extra if it won't definitely improve performance, despite the warnings that while it might not affect performance now it may in future, possibly due to changes elsewhere making this a trickier fix point to find, and even ignoring the performance issue it is good practise for other reasons.

> more so if you have a relatively well designed database

Oh, how I long for such nirvana!

> Inline tvfs get expanded

But scalar functions do not, well not until you are using 2019 which is due for release soon, and even then only some can be. And even if your SVF doesn't result in any reads, so inlining in this sense doesn't matter, there is extra CPU work going on.

(assuming SQL Server of course, details will of course differ elsewhere)


For query plans using a sort node, there can be a major difference in performance depending on the row width.


I just reject SELECT * queries by CI/CD because schema changes will change what is returned by such queries.


If you sort the result of a SELECT in such a way that can't use an index (i.e. a file-sort), you will get significant performance gains from MySQL by minimizing the columns accessed. MySQL materializes the rows then sorts them. Wider rows, more data shuffling.


This largely depends on the RDBMS. When generating estimates for memory grants, Sql Server will look at the row size as well expected counts. I’ve had particularly bad ORM queries throw together products of joined tables with all possible columns in the projection. Between the insane memory grant request and the IO throughput for retrieval, you’d be surprised at how much SELECT * can affect query performance


Good set of tips, however "Not NOT IN" is not true in general, as a single counterexample:

    EXPLAIN (costs off)
    SELECT relname, relpages
      FROM pg_class
     WHERE relname IN ('pg_class');

                         QUERY PLAN
  ---------------------------------------------------------
   Index Scan using pg_class_relname_nsp_index on pg_class
     Index Cond: (relname = 'pg_class'::name)
  (2 rows)

In general, I'd like to echo other's comments, learning how to use EXPLAIN and how to interpret query plans is the single most important tool, as it allows you to verify your hypothesis instead of relying on rules-of-thumb.


I actually had situation where the using IN resulted in a query that run for 35s and after replacing with `ANY(VALUES(...` it took ~600ms.

In my case `column IN ('value1', 'value2')` was converted to an array like this: `((column)::text ANY ('{value1,value2}'::text[]))`, an index on the column was completely ignored and PG performed a sequential scan.

After changing it to `column = ANY (VALUES ('value1'),('value2'))` it created a table in memory and used it when performing an index scan.

I have feeling this is maybe some kind of a bug in query planner? This is on PG 9.6.8.


I think I'd need more information to diagnose; were there any NULL values in the list? How many values?


The column can have nulls but after a quick check there's only a single record with a null value (probably added as a test, no idea).

In the bad plan a sequential scan was performed and then it removed over 4 million rows (Rows Removed by Filter: 4562849)

Edit: here is the query planner's output: https://explain.depesz.com/s/eVXI

After rewriting the query (change IN into ANY(VALUES( and adding an index to the "seven" table): https://explain.depesz.com/s/b4kU


I think adding the index to the “seven” table is probably the more meaningful change.

You said the index was ignored, but based on that plan output the planner made the right choice since the filter on the “sierra_zulu” table (only one there with an ANY so I assume that’s the one) matches over 90% of the rows. An index scan on that would be far worse.

Adding an index on a different table allowed the driving side of the join to be different, and, with already prefiltered data allowed the other index to be useful as well (fewer index lookups). I’d bet with that index addition your IN clause works just fine.


If you look at sierra_zulu (line #3 on both plans) it took almost 22 seconds initially and after query rewrite (and no index change) took 713ms.

So that change of query alone reduced query time from 35s to ~5s, after that the seven table become the bottleneck and adding an index there reduced the whole query down to 750ms.

I should have saved the intermediate plan to illustrate it, but it essentially looked same as the second plan, except there was a sequence scan on seven table.


Some of this is probably SQL server specific, but we were having some scaling issues on legacy code earlier this year and I got a lot of bang for my buck checking for these things before diving in to specific issues:

- code with a high row count table variable when a temp table would give more accurate execution plans due to better cardinality estimates.

- ultra-complex join criteria with lots of OR logic that performed better as a UNION

- lazy function calling, where a developer used a function that did more than they needed and could be replaced with a simple join to a table

- looped calls to insert procs that could be done in bulk

There was another entire category of problem that I guess would be described as "I hate sets" and was mostly attributable to one former employee. These were recognizable immediately upon opening the code and were a nightmare.


>There was another entire category of problem that I guess would be described as "I hate sets" and was mostly attributable to one former employee. These were recognizable immediately upon opening the code and were a nightmare.

Care to elaboarate? I need a pick-me-up today.


It was a lot of stuff like

- Get a list of x in a table var

- while loop through x to build another list of y

- while loop through y and update z one row at a time

All of that rather than updating with a join.

Some of his other patterns were reusing variables for different things in a proc, using inefficient functions in a way that they executed once per row before the result set was really reduced much, nesting those functions, and using loops any chance he got.

You'd open up these slow, 500-700 line monster procs and have to figure out what they were doing and refactor them, but it was nearly illegible and there were no tests for them. Really a great reminder of what happens when code reviews aren't done.


That's way worse than I thought, thanks for sharing.


About indexes, they improve query performance, but they also decrease write performance (and of course take space) that's why it's good idea to remove indices that aren't used or rarely used. Especially indices that are placed on columns that have random data and/or are frequently updated (prevents HOT updates) can hurt write performance.


One very powerful "trick" is to use partial indexes where useful. For example, a pretty common pattern is to have tables where most rows aren't actively used most of the time, eg when you have a status column, and most rows are in the final status which you usually don't touch, so most where conditions include statuses which are not the final one. If you create a partial index WHERE status IN ('s1', 's2' ...), the index becomes a lot smaller and a lot faster.


I never thought about that, but that's a neat trick.


AFAIK it's also proportional to the number of rows with data in the indexed columns. Inserting a row with a NULL in the indexed column does not cause an update to the index (at least experimentally it doesn't seem to contribute to index size on SQL Server).

I imagine this is/was a simple optimization for DBMS to implement, but I would love to be corrected otherwise...


If you change a value to null it’s still a row allocated in the index. The index size won’t shrink unless you rebuild it. Removing and deleting data will just end up with fragmentation.


The first principle for SQL optimization should be to reduce the number of queries. If you have a query inside a loop, the performances may be catastrophic. Using conditionals in queries was sometimes the only solution to avoid loops. The performance gain was enormous.


Lightly put, the article leaves much to be desired, no mentioned of: execution plan, collecting statistics, partitioning, cost of indexing (e.g. when should you disable an index and rebuild it?), materialized views, etc.


Surprised they didn't mention citext columns


> If you are calling multiple views, or worse, views on views on views you are asking the query engine to run multiple queries to return your columns.

I thought databases flatten all view layers into a single SQL statement before executing. Aka no performance impact except for a couple milliseconds to flatten out the views.


Yes, this was my thought as well. The views being flattened is certainly what I've found it's always looked like was happening, based on the EXPLAIN output.


Number one has got to be look at the execution plan. If it's complicated, try to simplify it, and that's where all these particular things should be checked.


Agreed, related point, collecting statistics.


Try to avoid the use of ‘IN’ or ‘NOT IN’. By doing this you are performing a full table scan as the query engine looks through every row to check if the condition is met.

It's not clear to me why IN () can't do an index scan or hash join?


The thing with PostgreSQL is that it may decide that it can find the result faster without the index, than it can with the index. This is sometimes due to the fact our development machines have alot of memory and its easier for PostgreSQL to store the table data in memory and not bother with the index.

This is a problem when trying to test an index on a small dataset. It may look like the index is not useful, when infact it will be useful.


Yeah IN/ Not IN should be able to use index/hash scan when given normal values. I am under the impression from the note though that this is related to subqueries fitting in a IN/NOT IN clause, eg: SELECT * FROM tableA IN (SELECT * FROM tableB)


IN () definitely does use indexes in most cases on Postgres. The article is wrong there.


Yes in general but I was hit with case when Postgres started to switch to FTS depending on argument count inside `IN ()` while in reality using index was faster for each and every case. I had to split it in multiple queries with 100 args each, even if few month earlier it didn't have any issue with thousands.


I’m guessing the index in question was a partial index defined with a “WHERE column NOT NULL” restriction?

This was a limit of the optimizer previously (IN clauses are broken down into AND/OR groups to prove inferences but only if <= 100 items).

But Postgres 12 includes a patch I wrote so that the optimizer can prove the NOT NULL inference directly from an array operator of any size.


In my case the index is not a partial index, but it is also on PG 9.6.8, in another response[1] I provided more details.

Unfortunately we are stuck with 9.6.x since before my time my company decided to use AWS Aurora, currently there's no easy path to do major version upgrade from 9.6.x and anyway 10.x is the most recent available version :( but any information why is this happening would be appreciated.

[1] https://news.ycombinator.com/item?id=20863418


That's why I wrote "in most cases" :) Anyway, in a situation like that where you have lots of values in the IN clause, I don't think that using = instead of IN would even be an option.


This article has sage advice, but SQL optimization goes far beyond. Optimization is a knee-jerk reaction for me whenever I do database development and involves every level (e.g. should I use IsNull or Coalesce? Try both, look at execution plans, etc.)


These articles are fine, but it is very rare to see people using a profiler to discover what needs to be optimized.


It should be somewhere reflected that this article applies in part only to PostgreSQL (some of the tips are more generic thou)


Immediately following the list TOC:

> ”This is written with PostgreSQL in mind but if you’re using something else, these may still help, give them a try.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: