tisdag 10 december 2013

Range access: now in an IN predicate near you.

Several users have reported that certain queries with IN predicates can't use index scans even though all the columns in the query are indexed. What's worse, if you reformulate your query without IN, the indexes are used. Let's take some example query. Suppose we have a table with two indexed columns:

CREATE TABLE t1 ( 
  col1 INTEGER,
  col2 INTEGER,
  ...
  KEY key1( col1, col2 )
);

Let's take a look at some queries that could take advantage of the key1 index to read rows without accessing the table.

  1. SELECT col1, col2 FROM t1 WHERE col1 = 100;
  2. SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200;
  3. SELECT col1, col2 FROM t1 WHERE col1 > 100 AND col1 < 200 OR col1 > 300 AND col1 < 400;
  4. SELECT col1, col2 FROM t1 WHERE col1 = 100 AND col2 > 100 AND cold2 < 200;

These queries will use what MySQL calls Index Range Scans. (although the first query could also use Ref Scan). This access method will fetch rows from the index trees given a start and end value. It's also possible to read multiple intervals, each with a start and end value, as we saw in query 3 above.

A special case of intervals is when the endpoints are the same value. Range scans can be used for conditions such as col1 = 100 because it's equivalent to the interval 100 <= col1 <= 100. This way we can use range scans for a broader class of queries.

Armed with multiple-interval scans, a.k.a. multi-range reads, or MRR for short, we can use the range access for queries such as 

SELECT col1, col2 FROM t1
WHERE col1 = 100 or col1 = 200 or col1 = 300;

We can use all columns in the index of course:

SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
   OR col1 = 200 AND col2 = 200 
   OR col1 = 300 AND col2 = 300;

At some point, this syntax becomes unwieldy. And this isn't just aesthetics, for really big queries, we get a combinatorial blowup which can cause parsing to take a long time. This is the reason why SQL has IN predicates to say the same thing:

SELECT col1, col2 FROM t1
WHERE col1 = 100 OR col2 = 200 OR col2 = 300;

means the same as

SELECT col1, col2 FROM t1
WHERE col1 IN (100, 200, 300);

And for rows it gets even more convenient:

SELECT col1, col2 FROM t1
WHERE col1 = 100 AND col2 = 100
   OR col1 = 200 AND col2 = 200 
   OR col1 = 300 AND col2 = 300;

can be written as

SELECT col1, col2 FROM t1
WHERE (col1, col2) IN ((100, 100), (200, 200), (300, 300));

The problem that users saw is that suddenly MySQL doesn't use MRR any more, and resorts to scanning the entire index. This is because the range optimizer ignored IN conditions over rows. The range optimizer is the sub-optimizer that analyzes conditions and translates them into a multi-range structure that can be handed to the storage engine to fetch the rows from the index. It handled IN predicates as long as they were over scalars or just a single row, but completely ignored lists of rows.

As of 5.7.3 this hole in the net is stitched up. The range optimizer gladly opens the door for queries with IN predicates as long as
  • The predicate is only IN, not NOT IN.
  • The row on the predicate's left-hand side is only indexed column references, in the same index.
  • The rows contain only constants or come from a previously read table in nested-loops join.
Note that 'constants' is a pretty broad category. It consists of pre-evaluated expressions, even some sub-queries, SQL variables and similar beings.

onsdag 15 maj 2013

The Outer Join to Inner Join Coversion


It is a central part of the MySQL philisophy to try and help you as much as you can. There are many occasions when it could tell you that what you are asking for is utterly stupid or give you a bad execution plan because "you asked for it". But we're friendly here. We don't do that. One of those cases is when you run a query with an outer join but you really meant an inner join, or you don't care.

Inner joins are almost always cheaper to execute than outer joins and that is why MySQL rewrites them to inner joins whenever it can.

An outer join may have WHERE conditions on any of the columns in the result set. In fact, it is a common trick to find non-matching rows using the IS NULL predicate. Here's an example:

TABLE person

name  id
Tom   1
Dick  2
Harry 3

TABLE car

brand owner_id mfg_year
Chevy 1        2007
Volvo 3        2008

The query SELECT name FROM person LEFT OUTER JOIN car ON id = owner_id WHERE brand IS NULL would give you the names of people that don't have a car. The IS NULL predicate is TRUE for an UNKNOWN value. Nearly all other predicates are what we refer to as null-rejecting. A null-rejecting predicate is one that is UNKNOWN if any of its arguments is UNKNOWN. This goes for most SQL predicates: >, <, =.

On the top-level of the where clause, there is an implicit IS TRUE predicate. This way you never get to see rows where it is unknown if they match your search condition or not. So the where clause in our query above is equivalent to

WHERE brand IS NULL IS TRUE

We could also query

SELECT name
FROM person LEFT OUTER JOIN car ON id = owner_id
WHERE mfg_year = 2008 IS UNKNOWN

The result of which would be "Dick", because it is indeed unknown if Dick's car was made in 2008 if he doesn't have a car. Computation wise, the condition is evaluated bottom-up: UNKNOWN = 2008 is UNKNOWN.

We are now ready to turn this notion into a rule-of-thumb: If there are predicates in the where clause that are not nested inside predicates that turn UNKNOWN into TRUE, for example IS NULL or IS UNKNOWN (i.e. null-rejecting), then we can turn the outer join into an inner join before we optimize, because the result set is the same. Unfortunately this translation does not show up in EXPLAIN, but it does in the optimizer trace. Here's how to see it.

SET optimizer_trace="enabled=on";


SELECT name
FROM person LEFT OUTER JOIN car ON id = owner_id
WHERE mfg_year = 2008;
+-------+
| name  |
+-------+
| Harry |
+-------+
SELECT trace FROM information_schema.optimizer_trace;
...

    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "outer_join_to_inner_join",
                "JOIN_condition_to_WHERE",
                "parenthesis_removal"
              ],
              "expanded_query": "/* select#1 */ select `person`.`name` AS `name` from `person` join `car` where ((`car`.`mfg_year` = 2008) and (`person`.`id` = `car`.`owner_id`))"

...

tisdag 10 april 2012

DATETIME DEFAULT NOW() finally available.

Having been rather desired for a while now, this feature is finally available as of MySQL Server version 5.6.5. It started out as the innocuous bug #27645 back in 2007, not really commanding much attention from anyone. But since, the page has seen around a hundred posts from users. This is a lot of posts for a bug page.

When I got to work on this, I started out looking at how functions in the default clause worked for the one supported case, CURRENT_TIMESTAMP (a.k.a. NOW() in MySQL) for TIMESTAMP columns. It turned out to be a little more complex than it had to, the TIMESTAMP type has a lot of special rules attached to it that no other types have, not even DATETIME.

One thing that struck me as a little odd was that you can only have one column with a function in the default or on update clause. So I had to take the decision how to deal with that restriction as we introduced DATETIME DEFAULT CURRENT_TIMESTAMP. Should we forbid having a TIMESTAMP DEFAULT CURRENT TIMESTAMP if there was a DATETIME DEFAULT CURRENT_TIMESTAMP? No matter how I looked at it, there would have to be a bunch of new convoluted special rules added that I couldn't motivate to myself. Moreover, a lot of users seemed to request having one DATETIME column with "default now" and another one "on update now", so they could track both creation and last update time to a row. The request was not lost on me: I decided to lift the one-column-with-NOW() restriction altogether. It made documentation simpler and avoided having rules applying only TIMESTAMP spill over on DATETIME. As expected, there was no technical reason for forbidding it, and the code involved needed an overhaul anyways.

The rules

I will not attempt to describe the various TIMESTAMP rules here, the manual does a great job at it already. Instead I will present the new set of rules for how DATETIME DEFAULT / ON UPDATE CURRENT_TIMESTAMP works.
  1. Trying to store a null value in a DATETIME NOT NULL column will always yield an error.
  2. A DATETIME column accepting nulls has NULL as default value unless specified.
  3. There are no implicit defaults for DATETIME columns except the one above.
  4. A DATETIME DEFAULT CURRENT_TIMESTAMP column will store CURRENT_TIMESTAMP if and only if an insert statement does not list the column or store a full row.
  5. A DATETIME ON UPDATE CURRENT_TIMESTAMP column will store CURRENT_TIMESTAMP if the row is updated and the new row is different from the previous row.
These rules sure look simple enough, don't they? Well, of course there are some more ins and outs to it if you look close enough. Hopefully you should not have to memorize all of these scenarios but it would be unfair of me not to at least mention them.

In rule #4, "an insert statement" includes, of course INSERT, LOAD and REPLACE statements. LOAD statements will store "0000-00-00 00:00:00" (a.k.a. the 'zero date') in a DATETIME column if the file contains a null at the column's position. Under strict sql_mode, the entire statement will fail. TIMESTAMP columns still store CURRENT_TIMESTAMP in this case, regardless of default.

In rule #5, the statements that can update a row are UPDATE and INSERT ... ON DUPLICATE KEY UPDATE ... when there is a duplicate primary key.

Lifting the one-column-with-function-default restriction also has some side-effects when you add or move columns.

  • As before, the first column in a table that is declared simply "TIMESTAMP" will implicitly be promoted to TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP. If you add another column that is "TIMESTAMP" before it, they will both be as that long line I just wrote. Previously, you would get an error in this case.
  • Oh, yeah, error number 1293, a.k.a. ER_TOO_MUCH_AUTO_TIMESTAMP_COLS [sic], can't happen anymore. There's no limit to how many columns with CURRENT_TIMESTAMP you can have. Knock yourself out.

When is NOW()?

In MySQL, CURRENT_TIMESTAMP is defined as the query start time, the time when the query arrives at the server. This means that the value is constant during the execution of the query. So don't be surprised if you update a million rows of type DATETIME(6) ON UPDATE NOW(6) - microsecond precision - and they all get the same update time.



onsdag 30 juni 2010

Problems with subquery transformations

MySQL transforms IN predicates into nearly-equivalent EXISTS predicates, and uses a set of mechanisms to catch all special cases. This transformation, outlined below, also applies to NOT IN predicates since a NOT IN b parses to NOT a IN b. Expressions such as

SELECT * FROM table1 WHERE column1 IN (SELECT column2 FROM table2)

get transformed into

SELECT * FROM table1 WHERE EXISTS
( SELECT column2 FROM table2
WHERE column1 = column2 OR column2 IS NULL )

since the latter can be executed more efficiently. For instance an index can be used for accessing only the rows of table2 that will actually participate in the final result. The added WHERE condition is called a guarded condition, or trigger condition, both terms are used. In order to get correct result, the condition column1 = column2 must be deactivated if we are comparing against a NULL value for column1. This guarantees correct results for most cases but there are some additional cases to consider.

If the IN predicate is used directly in the WHERE clause (e.g. not as a sub-expression of IS UNKNOWN,) a NULL value is treated as false. There is a special property top_level_item of an expression node that tells the execution to filter out NULL values from the subquery. But there are problems with it.
  1. It does not propagate through NOT nodes (or any other nodes for that matter.)
  2. It is only applied to the WHERE and HAVING clauses. Since MySQL allows quantified predicates in the SELECT list - this is not in the SQL standard - an extra hack is needed to make it work. The solution is to add - when top_level_item is false - another guarded condition as HAVING clause. Remember that MySQL allows HAVING without GROUP BY - yet again extending the standard. Hence we end up with
SELECT * FROM table1 WHERE EXISTS
( SELECT column2
FROM table2
WHERE column1 = column2 OR column2 IS NULL
HAVING NOT column2 IS NULL )

This will filter out NULL values causing the result of the subquery to be empty.

1 and 2 together caused Bug#51070. For row valued predicands, e.g.

SELECT * FROM t WHERE (a, b) NOT IN (SELECT c, d FROM u)

there may be partial matches for any two rows. A partial match occurs when the corresponding positions in two rows are either equal, or at least one is NULL. The design above handles these cases just fine as long as NULL values come from the outer table. When there is a NULL value in the row from the inner table, however, it gets filtered out by the HAVING clause. Hence the subquery, now rewritten into

SELECT c, d FROM u WHERE c = a OR b = d
HAVING NOT c IS NULL AND NOT d IS NULL

yields an empty result and the value of IN is false, causing NOT IN to yield true.