tisdag 10 december 2013

Range access: now in an IN predicate near you.

Several users have reported that certain queries with IN predicates can't use index scans even though all the columns in the query are indexed. What's worse, if you reformulate your query without IN, the indexes are used. Let's take some example query. Suppose we have a table with two indexed columns:

CREATE TABLE t1 ( 
  col1 INTEGER,
  col2 INTEGER,
  ...
  KEY key1( col1, col2 )
);

Let's take a look at some queries that could take advantage of the key1 index to read rows without accessing the table.

  1. SELECT col1, col2 FROM t1 WHERE col1 = 100;
  2. SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200;
  3. SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200 OR col1 > 300 AND col1 < 400;
  4. SELECT col1, col2 FROM t1 WHERE col1 = 100 AND col2 > 100 AND cold2 < 200;

These queries will use what MySQL calls Index Range Scans. (although the first query could also use Ref Scan). This access method will fetch rows from the index trees given a start and end value. It's also possible to read multiple intervals, each with a start and end value, as we saw in query 3 above.

A special case of intervals is when the endpoints are the same value. Range scans can be used for conditions such as col1 = 100 because it's equivalent to the interval 100 <= col1 <= 100. This way we can use range scans for a broader class of queries.

Armed with multiple-interval scans, a.k.a. multi-range reads, or MRR for short, we can use the range access for queries such as 

SELECT col1, col2 FROM t1
WHERE col1 = 100 or col1 = 200 or col1 = 300;

We can use all columns in the index of course:

SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
   OR col1 = 200 AND col2 = 200 
   OR col1 = 300 AND col2 = 300;

At some point, this syntax becomes unwieldy. And this isn't just aesthetics, for really big queries, we get a combinatorial blowup which can cause parsing to take a long time. This is the reason why SQL has IN predicates to say the same thing:

SELECT col1, col2 FROM t1
WHERE col1 = 100 OR col2 = 200 OR col2 = 300;

means the same as

SELECT col1, col2 FROM t1
WHERE col1 IN (100, 200, 300);

And for rows it gets even more convenient:

SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
   OR col1 = 200 AND col2 = 200 
   OR col1 = 300 AND col2 = 300;

can be written as

SELECT col1, col2 FROM t1
WHERE (col1, col2) IN ((100, 100), (200, 200), (300, 300));

The problem that users saw is that suddenly MySQL doesn't use MRR any more, and resorts to scanning the entire index. This is because the range optimizer ignored IN conditions over rows. The range optimizer is the sub-optimizer that analyzes conditions and translates them into a multi-range structure that can be handed to the storage engine to fetch the rows from the index. It handled IN predicates as long as they were over scalars or just a single row, but completely ignored lists of rows.

As of 5.7.3 this hole in the net is stitched up. The range optimizer gladly opens the door for queries with IN predicates as long as
  • The predicate is only IN, not NOT IN.
  • The row on the predicate's left-hand side is only indexed column references, in the same index.
  • The rows contain only constants or come from a previously read table in nested-loops join.
Note that 'constants' is a pretty broad category. It consists of pre-evaluated expressions, even some sub-queries, SQL variables and similar beings.

onsdag 15 maj 2013

The Outer Join to Inner Join Coversion


It is a central part of the MySQL philisophy to try and help you as much as you can. There are many occasions when it could tell you that what you are asking for is utterly stupid or give you a bad execution plan because "you asked for it". But we're friendly here. We don't do that. One of those cases is when you run a query with an outer join but you really meant an inner join, or you don't care.

Inner joins are almost always cheaper to execute than outer joins and that is why MySQL rewrites them to inner joins whenever it can.

An outer join may have WHERE conditions on any of the columns in the result set. In fact, it is a common trick to find non-matching rows using the IS NULL predicate. Here's an example:

TABLE person

name  id
Tom   1
Dick  2
Harry 3

TABLE car

brand owner_id mfg_year
Chevy 1        2007
Volvo 3        2008

The query SELECT name FROM person LEFT OUTER JOIN car ON id = owner_id WHERE brand IS NULL would give you the names of people that don't have a car. The IS NULL predicate is TRUE for an UNKNOWN value. Nearly all other predicates are what we refer to as null-rejecting. A null-rejecting predicate is one that is UNKNOWN if any of its arguments is UNKNOWN. This goes for most SQL predicates: >, <, =.

On the top-level of the where clause, there is an implicit IS TRUE predicate. This way you never get to see rows where it is unknown if they match your search condition or not. So the where clause in our query above is equivalent to

WHERE brand IS NULL IS TRUE

We could also query

SELECT name
FROM person LEFT OUTER JOIN car ON id = owner_id
WHERE mfg_year = 2008 IS UNKNOWN

The result of which would be "Dick", because it is indeed unknown if Dick's car was made in 2008 if he doesn't have a car. Computation wise, the condition is evaluated bottom-up: UNKNOWN = 2008 is UNKNOWN.

We are now ready to turn this notion into a rule-of-thumb: If there are predicates in the where clause that are not nested inside predicates that turn UNKNOWN into TRUE, for example IS NULL or IS UNKNOWN (i.e. null-rejecting), then we can turn the outer join into an inner join before we optimize, because the result set is the same. Unfortunately this translation does not show up in EXPLAIN, but it does in the optimizer trace. Here's how to see it.

SET optimizer_trace="enabled=on";


SELECT name
FROM person LEFT OUTER JOIN car ON id = owner_id
WHERE mfg_year = 2008;
+-------+
| name  |
+-------+
| Harry |
+-------+
SELECT trace FROM information_schema.optimizer_trace;
...

    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "outer_join_to_inner_join",
                "JOIN_condition_to_WHERE",
                "parenthesis_removal"
              ],
              "expanded_query": "/* select#1 */ select `person`.`name` AS `name` from `person` join `car` where ((`car`.`mfg_year` = 2008) and (`person`.`id` = `car`.`owner_id`))"

...