PostgreSQL Query Optimization Tricks
How to Make Queries Faster in Dune Analytics
Context
Ever found yourself with the fatal message on Dune: “Error: Your query took too long to execute…”? Well, consider it a right of passage to the wizard community. Next, you enter the wonderful phase of debugging. There are a couple of steps you should take before rethinking your entire approach. First, see if you can limit your query in the case you accidentally SELECT * from a table the size of “dex.trades”. Second, optimize.
This article is meant to act as a shorthand for ways to reduce query time on Dune Analytics. The goal: save us a headache from future Dune errors and turn everyone into the best Dune wizard they can be.
Tips-and-Tricks
It is important to note that while not listed, the EXPLAIN ANALYZE syntax in PostgreSQL is extremely helpful to understand which part of your query is taking a long time to run. I would suggest using this before trying any of these tricks to just understand the inner workings of the way the query aggregates and filters data.
1. Common Table Expressions (CTE) → Nested queries.
Surprisingly, CTEs, the lovely function that makes large queries more readable, actually makes them run slower. They are often referred to as “optimization fences” since they essentially inhibit the query optimizer from being able to rewrite the queries occurring in the CTE. When they are changed to nested queries (a form of subqueries), the execution time is usually at least a couple of seconds less.
The reason I did not say CTE to subquery is due to the fact the other form of subqueries, corollated queries, often take just as much time as the CTE. This is because, with corollated queries, the inner query is executed for every row of the outer query. This is contrary to the nested query where the inner query is only executed once, and then those results are used for the outer query. An example of the difference is linked here.
2. Select specific columns from CTE or subquery.
While it is easy to SELECT * FROM table, most of the time not all of the columns are not needed to address a particular question. Especially with large datasets, there will often be columns that are extraneous to the purpose of the query. With smart contract data, this is particularly true, as often there is more information about the details of the contract than are needed to calculate mere volume PERSAY. Always focus on only selecting the columns that are needed for the query to run.
3. Filter the queried dataset as much as you can up front (aka USE WHERE CLAUSES).
Are you querying after a specific date? Looking for results where a column equals a specific value? Only want rows where certain criteria are met? Maybe you only want successful transactions? Be as specific as possible. If that protocol didn’t exist before July of 2021 don’t query the whole ethereum.transactions dataset.
4. When you JOIN two queries there are two tips: 1) Explicitly define the ordering of an inner join and 2) If you are using a LEFT JOIN, make sure any WHERE conditions to the ON clause are on the right side of the query.
1) This can also be referred to as making the ordering that a query must join the tables on “explicit”. It is easiest to explain with an example, so I will be using the one from the PostgreSQL docs.
SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
While these three queries are semantically the same, the first one takes longer to run than the latter two. This is because the first requires more query planning before it executes. Since it can either join a and b first and then b and c, or c and b first, then a and b, the query planner in PostgreSQL will try to figure out which way is more efficient before executing. This takes time, and thus the query takes longer.
Aka, if you are trying to (inner) JOIN multiple tables, make sure to explicitly dictate in your query how you would like the joining to be done.
2) Basically, this second tip is just so your LEFT JOIN is not acting like an inner join.
Also, even though this does not have to do with query runtime, there is a critical JOIN bug that I must note. If you are not using an inner join, then the location of the WHERE clause(s) will affect the outcome of your query. This article does a good job explaining the reasoning if you are curious.
5. NOT IN is much costlier than NOT EXISTS or LEFT JOIN with NULL columns of one table.
In the code examples below, query one takes much longer than two and three, but queries two and three take approximately the same time.
SELECT id FROM big_table WHERE id NOT IN (SELECT id FROM small_table)SELECT id FROM big_table WHERE NOT EXISTS (SELECT id FROM small_table WHERE small_table.id = big_table.id)SELECT b.id FROM big_table b LEFT JOIN small_table s ON b.id = s.id WHERE s.id IS NULL
However, it is interesting to note that the inverse of these commands is not true: using IN or EXISTS is less costly than a JOIN specifying NOT NULL columns. In fact, in this case, IN and EXISTS are equivalent in terms of time, while the JOIN with NOT NULL takes longer to run. For a more in-depth explanation check out this link.
6. IN is faster than BETWEEN for ranges (with date ranges it is particularly notable).
I am not exactly sure why this is, and this article explains it more technically but also does not come to an answer. Win or doubt though – use IN.
7. Avoid using ORDER BY on a large dataset.
This is because PostgreSQL has to sort the rows and then return the correct ordering. If you do require ordering a large dataset, creating an index will make the sorting occur faster. This is due to the correct index allowing the query execution can skip the sorting step. By “correct” index I mean one that orders the rows in the way you desired to originally ORDER BY. Especially if you want to order over two features, making a multi-column index can increase query execution efficiency significantly. This link explains more about how to efficiently use indexes in PostgreSQL.
8. Indexing over block_time + “event_index” is faster than indexing over block_number + “event_index”.
This is because of the way that the tables are indexed in Dune on the backend. If you are curious to see all the indexes of a specific table in Dune, try the query:
SELECT indexdef FROM pg_indexes WHERE schemaname = 'ethereum' AND tablename = 'transactions'
Where schemaname and tablename can be any schema and table on Dune Analytics. (h/t to @fluidsonic for this tip).
Summary List
- Common Table Expressions (CTE) → Nested queries.
- Select specific columns from CTE or subquery.
- Filter the queried dataset as much as you can up front (aka USE WHERE CLAUSES).
- When you JOIN two queries there are two tips: 1) Explicitly define the ordering of an inner join and 2) If you are using a LEFT JOIN, make sure any WHERE conditions to the ON clause are on the right side of the query.
- NOT IN is much costlier than NOT EXISTS or LEFT JOIN with NULL columns of one table.
- IN is faster than BETWEEN for ranges (with date ranges it is particularly notable).
- Avoid using ORDER BY on a large dataset.
- Indexing over block_time + “event_index” is faster than indexing over block_number + “event_index”.
I hope this article was helpful! The information is an aggregate of my own research as well as the community response to this Tweet asking for tips/tricks:
Feel free to reach out on Twitter @_grace_lily if you have any questions/comments. Let’s keep crushing those queries and dashboards wizards!