SQLite: Dropping an index for a 300x speedup

Table of Contents

For a small, personal project I use sqlite as a time series database and make queries like the following:

SELECT referer,
       COUNT(*) AS views
FROM   logs
WHERE  host = 'comments.nbsoftsolutions.com' AND
       epoch >= 1551630947 AND
       epoch < 1551632947
GROUP  BY referer
ORDER  BY views DESC

Over time, I’ve noticed that this query has been increasingly time consuming, sometimes taking minutes to complete. I thought I created proper indices:

CREATE index idx_epoch on logs(epoch);
CREATE index idx_host ON logs(host);

I ran ANALYZE, and still no changes. I was at my wits end, seriously thought about looking into another database. Then I enabled sqlite’s EXPLAIN QUERY PLAN and saw the following output.

0|0|0|SEARCH TABLE logs USING INDEX idx_host (host=?)
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR ORDER BY

Omitting the epoch timestamp index from a time series table is a red flag. SQLite can only use one index from each table and it was choosing the wrong one.

After crawling sqlite docs, I found a way to disable the host index using the unary “+” operator.

Now our query looks like:

SELECT referer,
       COUNT(*) AS views
FROM   logs
WHERE  +host = 'comments.nbsoftsolutions.com' AND
       epoch >= 1551630947 AND
       epoch < 1551632947
GROUP  BY referer
ORDER  BY views DESC

And the new query plan picks up on our hint:

EARCH TABLE logs USING INDEX idx_epoch (epoch>? AND epoch<?)
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR ORDER BY

Execution time decrease from 30 seconds to 0.1 second, a 300x speedup by dropping the host index from consideration.

Being able to write raw sql like this is a major reason why I tend to have a disdain for ORMs that abstract everything away. Sometimes we need to go to a lower level.

Let’s see if we can’t tease out why sqlite naively makes the wrong decision. From the docs on choosing from multiple indices:

the sqlite_stat1 table might indicate that an equality constraint on column x reduces the search space to 10 rows on average, whereas an equality constraint on column y reduces the search space to 3 rows on average

The contents of sqlite_stat1 in my scenario:

logs|idx_host|3165918 7685
logs|idx_epoch|3165918 2

This table states that an equality constraint on host reduces the search space to an average of 7685, and an average of 2 for epoch. Considering that our epoch range is [1551630947, 1551632947), a 2000 difference, I would have expected that sqlite would have realized that (2 * 2000 < 7685). Even updating the estimate that idx_host index narrows the results down to 10 million rows changes nothing:

UPDATE sqlite_stat1
SET stat = '3165918 9999999'
WHERE idx = 'idx_host';

Thus we can conclude that SQLite will always prefer an equality constraint in index evaluations versus any range constraint. So update your queries or drop offending indices if you have to.

If this is not true, or anyone has new / differing information, feel free to comment. This is with sqlite 3.22.0.

Comments

If you'd like to leave a comment, please email [email protected]

2020-03-03 - MightyPaul

Great tutorial. It explains very good the most important aspects (Indexing + EXPLAIN QUERY PLAN) needed to speed up SELECT queries in SQLite! This helped me a lot to understand and speed up the SELECT query in my database with 100 millions of rows. Before, my SELECT query was performing a SCAN of the complete database, now it’s doing a super fast SEARCH. Thx a lot!