Several users have reported that certain queries with IN predicates can't use index scans even though all the columns in the query are indexed. What's worse, if you reformulate your query without IN, the indexes are used. Let's take some example query. Suppose we have a table with two indexed columns:
CREATE TABLE t1 (
col1 INTEGER,
col2 INTEGER,
...
CREATE TABLE t1 (
col1 INTEGER,
col2 INTEGER,
...
KEY key1( col1, col2 )
);
Let's take a look at some queries that could take advantage of the key1 index to read rows without accessing the table.
- SELECT col1, col2 FROM t1 WHERE col1 = 100;
- SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200;
- SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200 OR col1 > 300 AND col1 < 400;
- SELECT col1, col2 FROM t1 WHERE col1 = 100 AND col2 > 100 AND cold2 < 200;
These queries will use what MySQL calls Index Range Scans. (although the first query could also use Ref Scan). This access method will fetch rows from the index trees given a start and end value. It's also possible to read multiple intervals, each with a start and end value, as we saw in query 3 above.
A special case of intervals is when the endpoints are the same value. Range scans can be used for conditions such as col1 = 100 because it's equivalent to the interval 100 <= col1 <= 100. This way we can use range scans for a broader class of queries.
Armed with multiple-interval scans, a.k.a. multi-range reads, or MRR for short, we can use the range access for queries such as
SELECT col1, col2 FROM t1
WHERE col1 = 100 or col1 = 200 or col1 = 300;
We can use all columns in the index of course:
SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
OR col1 = 200 AND col2 = 200
OR col1 = 300 AND col2 = 300;
At some point, this syntax becomes unwieldy. And this isn't just aesthetics, for really big queries, we get a combinatorial blowup which can cause parsing to take a long time. This is the reason why SQL has IN predicates to say the same thing:
SELECT col1, col2 FROM t1
WHERE col1 = 100 OR col2 = 200 OR col2 = 300;
means the same as
SELECT col1, col2 FROM t1
WHERE col1 IN (100, 200, 300);
And for rows it gets even more convenient:
SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
OR col1 = 200 AND col2 = 200
OR col1 = 300 AND col2 = 300;
can be written as
SELECT col1, col2 FROM t1
WHERE (col1, col2) IN ((100, 100), (200, 200), (300, 300));
The problem that users saw is that suddenly MySQL doesn't use MRR any more, and resorts to scanning the entire index. This is because the range optimizer ignored IN conditions over rows. The range optimizer is the sub-optimizer that analyzes conditions and translates them into a multi-range structure that can be handed to the storage engine to fetch the rows from the index. It handled IN predicates as long as they were over scalars or just a single row, but completely ignored lists of rows.
As of 5.7.3 this hole in the net is stitched up. The range optimizer gladly opens the door for queries with IN predicates as long as
- The predicate is only IN, not NOT IN.
- The row on the predicate's left-hand side is only indexed column references, in the same index.
- The rows contain only constants or come from a previously read table in nested-loops join.
Note that 'constants' is a pretty broad category. It consists of pre-evaluated expressions, even some sub-queries, SQL variables and similar beings.