NULL vs NOT NULL columns and Index usage

Recently I was asked to look at a problem where a simple query refused to use an index.

The table had several million rows in it with a single index on one column populated with 3 values (in my sample the values are A, B, and C.)

CREATE TABLE null_test
(
    col1           NUMBER,
    col2           NUMBER,
    col3           NUMBER,
    col4           NUMBER,
    test_column    VARCHAR2(1),
    col6           NUMBER,
    col7           NUMBER,
    col8           NUMBER,
    col9           NUMBER,
    col10          NUMBER
);

INSERT INTO null_test
        SELECT LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               CHR(ASCII('A') + MOD(LEVEL, 3)),
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL,
               LEVEL
          FROM DUAL
    CONNECT BY LEVEL <= 3000000;

create index null_test_index on null_test(test_column) ;

exec dbms_stats.gather_table_stats(user,'NULL_TEST');

When a query only references columns in an index then it seems logical the optimizer would decide to use the index only and not reference the table at all. However, as my friend discovered, the exact opposite happened.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

Then my friend followed up with another simple query specifying all of the values found in the table. This time, the index was used.

SQL> explain plan for select test_column from null_test where test_column in ('A','B','C');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1415   (2)| 00:00:01 |
|*  1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1415   (2)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("TEST_COLUMN"='A' OR "TEST_COLUMN"='B' OR "TEST_COLUMN"='C')

13 rows selected.

This is where I was contacted to determine why this seemingly odd behavior would occur.
The reason is, TEST_COLUMN is defined to allow nulls. It just so happens the table is fully populated but Oracle can’t be sure of that at query time. An index won’t create a null leaf node, so if someone inserts a row with a null value, the index would miss it. Therefore, the optimizer can’t trust the index to return reliable results for “select test_column from null_test” because there could be one or more null values in the results. So the optimizer specifies a table scan as a reliable answer.

The second query imposes an implicit not-null constraint on the results. The only results that are possibly allowed are those “in (‘A’,’B’,’C’)”. A NULL will, obviously, not satisfy that condition. Therefore the optimizer knows it can trust the index as a reliable search mechanism.

Even if you supply a hint to tell the optimizer you really want to use the index, the optimizer will be obliged to negate that hint as it could produce invalid results

SQL> explain plan for select /* INDEX(null_test null_test_index) */ test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------

Plan hash value: 3898976153

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |  3000K|  5859K|  6519   (1)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| NULL_TEST |  3000K|  5859K|  6519   (1)| 00:00:01 |
-------------------------------------------------------------------------------

8 rows selected.

So, now the question becomes: “Are nulls really valid for the test column?” If they are, then a table scan is the only way to include the rows with NULL values along with the populated rows.
If the nulls are not valid, then the column should be defined to reflect that rule.

SQL> alter table null_test modify (test_column varchar2(1) not null);

Table altered.

SQL> explain plan for select test_column from null_test;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------

Plan hash value: 107206137

----------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  3000K|  5859K|  1407   (1)| 00:00:01 |
|   1 |  INDEX FAST FULL SCAN| NULL_TEST_INDEX |  3000K|  5859K|  1407   (1)| 00:00:01 |
----------------------------------------------------------------------------------------

8 rows selected.

Now, with the column defined as not null, the optimizer knows it can trust the index to contain all of the values in the table because the table can’t have any NULLs.

Many developers will think of the column definitions merely in terms of data “correctness;” but it’s important to remember those same data quality rules are meta-data for the optimizer as well. Thus they can and will influence the optimizer plans.

Leave a ReplyCancel reply