Close

Predicates: Joins, Filters, and Pushability

In a general sense you could say almost all SQL issues are in some way related to misapplied predicates. In this article I’ll focus on two classes of problems.

The first would be queries where there is confusion between join predicates and filtering predicates. The second type of problem is with the predicates that are unable to be pushed.

Before digging in too deep it’s probably a good idea to explain what a predicate is. They are logical conditions such as those found in the WHERE or ON clauses of a query. A few simple examples:

  • WHERE col3 in (‘a’,’b’,’c’)
  • tab1 JOIN tab2 ON tab1.id = tab2.parent_id
  • WHERE x.col1 > y.col2

In those, the first is a filtering predicate, the second is a join predicate. The third is also a join predicate but may be harder to distinguish because it is embedded within a WHERE clause rather than an ON clause.

These logical conditions are distinct from those found in a CASE statement because the predicates control which rows are returned by a query, whereas the conditions of a CASE statement control how columns within rows are processed. It is legal to use a CASE statement within a join or filtering predicate but that is an extra layer of complexity that is often avoidable. I’ll discuss this condition later in the section on pushing predicates.

Join Predicates

Every join is Cartesian

When someone comes to me with a problem where their query returns too many rows and/or duplicate rows I use a simple mnemonic to remind myself of one of the possible causes: “Every join is Cartesian”. I’m sure some SQL purists will balk at that, but please allow me to explain. Early in most developer’s SQL training they learn about problems caused by a Cartesian Join, Cartesian Product, Cartesian Explosion, or Cartesian Error. I’m sure there are other, even less flattering, names for the condition as well. These types of joins arise when a join causes every row of a table to be joined to every row of the other table. Thus creating a result set of (x *y) rows where x and y are the number of rows in the two joined tables.

So, with this description, how can every join be Cartesian? As soon as a non-trivial condition (e.g. 1=1 or NULL is NULL) is added to the join doesn’t that prevent it from being Cartesian? In a strictly mathematical sense… Yes it could; but in a sense that actually helps solve the problem… No.

If you think of join conditions in terms of the (x * y) results, then an accidental Cartesian product are less likely. Even if you don’t create a true Cartesian, if your conditions cause multiple rows to match each other then you will get result set expansion. A simple example might be temporal data of customer contacts. That is, a table with ID, CUSTOMER_NAME, CONTACT_NAME, ADDRESS, PHONE, etc. and maybe a given customer has two rows, one for a daytime contact and one for an evening contact, or primary and secondary contacts. If you do a join between orders, customers, and contacts you will get duplicated order records in your results if you don’t filter to a single contact.

This might seem like an overly simplified example and obvious; but the Cartesian error can sneak into results without an obvious cause presenting itself in the data. Using the example above, maybe you use the contact table to pull the addresses associated with orders but your query doesn’t return the contact name. Looking at the query results you may see two identical rows with no indication of what caused the duplicates because the differentiating data wasn’t selected. So, going back to your query with the thought in mind that all joins are Cartesian will (to me anyway) help identify the problem.

The mnemonic won’t solve the problem; but will hopefully help identify it. I’ve worked with many developers that only looked at joins as look-ups and filters without ever considering the multiplicative nature of the operation.

Inner and Outer Joins

The predicates of these joins are the same. That is, you use the same comparison operators and functions in the same way with INNER and OUTER joins. The functional difference comes from the join syntax itself, either INNER JOIN (the default) or LEFT/RIGHT OUTER JOIN or Oracle’s distinctive outer join syntax using the (+) operator. This article isn’t meant to detail the full specification of all join variants; but I will give a quick summary.

An inner join returns results for all row pairings that satisfy the join conditions. An outer join can be considered as a base table and the joined table. The outer join will return all row pairings that satisfy the join conditions like an inner join would. In addition, the outer join will also return a row for each row of the base table that does not have a pairing with the joined table. Any values that would be read from the joined table will be NULL for these unpaired rows. In the case of a FULL OUTER JOIN, each tables act as base and join tables. Thus all row pairs satisfying the join condition will be returned and rows from each table that don’t have a matching from from the other table will be returned, again with NULL values supplied from the missing rows.

Inner Join Predicates

In the case of an inner join, all predicates pertaining to a table can be considered as part of the join conditions. For example, if I have a filtering rule that only applies to one table then these are all equivalent:

SELECT * from A INNER JOIN B ON  A.id = B.id WHERE B.flag='Y';
SELECT * from A INNER JOIN B ON  A.id = B.id AND B.flag='Y';
SELECT * from A, B WHERE A.id = B.id AND B.flag='Y'; 

So, the optimizer is free to filter results to only the ‘Y’ records of B first, then join that subset to the A rows based on their mutual ID column. It would be functionally equivalent to join all of A and B based on ID first, and then filter that combined result set to only return those that have B.flag = ‘Y’. Either way, the set of returned results will always be the same.

Outer Join Predicates

With outer joins you need to be more careful where you put your conditions. Let’s compare the first two queries above if they are LEFT OUTER JOINs instead of INNER JOINs.

SELECT * from A LEFT OUTER JOIN B ON  A.id = B.id WHERE B.flag='Y';
SELECT * from A LEFT OUTER JOIN B ON  A.id = B.id AND B.flag='Y';

These two are NOT functionally equivalent. The first query will do the join of all A and B before evaluating the B.flag condition. But, if a row in A does not have a corresponding row in B, that row will still have a corresponding row in the results – until the B.flag condition is applied. At that time, the join result for that row will be excluded because all B values will be NULL due to the outer join. So B.flag = ‘Y’ will not be true for any Outer joined rows. In this case, the results would be equivalent to those of an INNER JOIN except with the extra resource consumption of performing the OUTER JOIN operation.

The second query pushes the B.flag condition into the join itself. Thus the only B records that will be evaluated as part of the join will be those where B.flag = ‘Y’. Then, any A rows left unpaired will be represented in the join results with NULLs for any B values.

Instead of vague A, B and flag identifiers, a more concrete example might help illustrate the difference. In the next snippets I’ll count the number of orders per customer for the current month (as of this writing, February 2020) using inner and outer joins as well as moving the date criteria in and out of the join predicates.

SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c INNER JOIN orders o
  3        ON o.customer_id = c.customer_id
  4     WHERE o.order_date >= DATE '2020-02-01'
  5       AND o.order_date < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME  COUNT(O.ORDER_ID)
----------- -------------- -----------------
          1 ACME                           5
          2 Globex                         3

SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c INNER JOIN orders o
  3        ON o.customer_id = c.customer_id
  4       AND o.order_date >= DATE '2020-02-01'
  5       AND o.order_date < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME  COUNT(O.ORDER_ID)
----------- -------------- -----------------
          1 ACME                           5
          2 Globex                         3

SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c LEFT OUTER JOIN orders o
  3        ON o.customer_id = c.customer_id
  4     WHERE o.order_date >= DATE '2020-02-01'
  5       AND o.order_date < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME  COUNT(O.ORDER_ID)
----------- -------------- -----------------
          1 ACME                           5
          2 Globex                         3

Note how all three queries return the same results. The inner joins we expect as much, the outer join though might not be the results we really want in the report, or, if we do, then an inner join is clearly more appropriate to achieve the same results.

Compare those results to the following where the date filter is part of the join predicates. Here, the third customer, Initech is present in our final results even though they have not placed an order yet this month.

SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c LEFT OUTER JOIN orders o
  3        ON o.customer_id = c.customer_id
  4       AND o.order_date >= DATE '2020-02-01'
  5       AND o.order_date < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME  COUNT(O.ORDER_ID)
----------- -------------- -----------------
          1 ACME                           5
          2 Globex                         3
          3 Initech                        0

These differences aren’t just due to ANSI join syntax. The same is true of Oracle’s “(+)” syntax for outer joins. In the examples below, first I’ll join only on the ids and then filter by dates; resulting in the Initech customer being excluded for having no orders (i.e. functionally equivalent to an Inner Join, but with the work of an outer join.) In the second query, the dates will be part of the join criteria, allowing the Outer Join to return the Initech row with 0 orders.

SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c, orders o
  3     WHERE o.customer_id (+) = c.customer_id
  4       AND o.order_date >= DATE '2020-02-01'
  5       AND o.order_date < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME  COUNT(O.ORDER_ID)
----------- -------------- -----------------
          1 ACME                           5
          2 Globex                         3


SQL>   SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
  2      FROM customers c, orders o
  3     WHERE o.customer_id (+) = c.customer_id
  4       AND o.order_date(+) >= DATE '2020-02-01'
  5       AND o.order_date(+) < DATE '2020-03-01'
  6  GROUP BY c.customer_id, c.customer_name
  7  ORDER BY c.customer_id;

CUSTOMER_ID CUSTOMER_NAME   COUNT(O.ORDER_ID)
----------- --------------  -----------------
          1 ACME                            5
          2 Globex                          3
          3 Initech                         0

Where the outer join is equivalent to an inner join, I’ve made a note that even though the results are equivalent there is different work involved. This is not entirely true; but it’s better practice if you think of your code as if it were. The optimizer can, sometimes, figure out on its own if a outer join could be evaluated as an inner join. Sometimes the optimizer can do a better job of it than a person by examining all of the constraints and conditions to determine when NULL values will be excluded. In 11g, Oracle added an optimizer hint you can see in the explain plan outline to enforce such an action: OUTER_JOIN_TO_INNER. In the queries above are simple enough, the Optimizer should be able to figure it out and you can see in the plan excerpt below it did.

-----------------------------------------------------
| Id  | Operation                     | Name        |
-----------------------------------------------------
|   0 | SELECT STATEMENT              |             |
|   1 |  SORT ORDER BY                |             |
|   2 |   NESTED LOOPS                |             |
|   3 |    NESTED LOOPS               |             |
|   4 |     VIEW                      | VW_GBF_7    |
|   5 |      HASH GROUP BY            |             |
|*  6 |       TABLE ACCESS FULL       | ORDERS      |
|*  7 |     INDEX UNIQUE SCAN         | SYS_C007672 |
|   8 |    TABLE ACCESS BY INDEX ROWID| CUSTOMERS   |
-----------------------------------------------------

Outline Data
-------------
... 
      OUTER_JOIN_TO_INNER(@"SEL$2BFA4EE4" "O"@"SEL$1")
...

Even though the optimizer can do this type of conversion for you; it is better if you write inner joins as inner joins, not as outer joins that may or may not get simplified for you.

First and most importantly, writing it correctly forces you to think through the operations and determine what the correct results should be. This in turn helps whomever follows you read and understand the code. If you write an outer join that is always functions as an inner join, it prompts the question “Does this code do what it is supposed to?”

Second, writing an inner join when you mean an inner join saves the optimizer steps of trying to figure out if it can and should rearrange the logic to simplify the query.

Furthermore, the optimizer can’t always figure out all permutations to ensure a proper simplification. So if you use outer joins as a safety net hoping the optimizer will always clean it up for you, you’ll be disappointed (as will your customers.)

Last, outer joins and inner joins are intended to be functionally different. Using an outer join when there should have been an inner join is not only confusing and inefficient it’s also likely to be error prone. It’s only the exceptional circumstance where you might get an optimizer assist.

Filter Predicates

These were already covered a little in the previous section; but, despite their simplicity still deserve a deeper discussion. Where join predicates are the workhorses to connect two tables, filters covers everything else. Every condition applied to a value is some kind of filter on the final results.

Here they may be a bit of a jargon issue. I’m using “filter” here in common English parlance, meaning these predicates act to “sift” the data. These predicates determine a row that will or will not be included in the result set. They may apply to the final result set of a query, or to any number of intermediate result sets as we saw in the join conditions above. The date filters could apply to an intermediate result set determining which order rows would be joined to the customers, or they could be applied after the customers and orders were joined and then filter out the rows that weren’t needed.

That may seem an obvious usage of the word but I wanted to clarify because Oracle also uses the word “filter” in a similar but more narrow context. When classifying predicates within a query plan, a predicate may be used to drive access or it may be used to filter. In both cases, access and filter will create a restriction on which rows will be returned (that is, they both “filter” as originally described.) The difference is an access predicate will determine some sort of physical operation like an index scan. Whereas a filter predicate will be applied to all rows returned from some operation, which could include an index scan.

One may ask if one type is better than the other. The answer is a resounding maybe! It depends on what is being filtered and what the operations are. If an index scan is the most efficient operation then an access predicate to drive the index is good. If an index would not be helpful, then a filter predicate on a table may be best. It’s also possible you may use both. Perhaps I use an indexed date column with an access predicate to find all orders of a particular day, but then I also use a filter predicate on those orders based on some order attributes.

It’s also possible an access predicate should be used but the query is constructed in such a way that a filter predicate is required. The classic cause of this due to applying a function or operation to a column.

For example, these two sets of conditions are equivalent in determining which rows will be returned

AND o.order_date >= DATE '2020-02-01'
AND o.order_date < DATE '2020-03-01'
AND TRUNC(o.order_date,'mm') = DATE '2020-02-01'

An index on order_date can use access predicates with the first set of conditions to find rows; but that same index could not be used for the second condition using TRUNC. So a filter predicate would have to be applied to every row to determine if it was a valid result candidate or not. This can be ameliorated through Function-Based Indexes (FBIs); and in some cases that would be the best solution. FBIs can be limiting though because they are only usable when the queries use the same function. If you queried with a TRUNC(order_date), then a TRUNC(order_date,’mm’) index would not help, you’d have to create yet another index to cover daily truncation values. If you sometimes query by week, quarter, or year, do you want to create an additional index for every possible function usage? That’s clearly not a scalable solution. Using a single, simple index on the base column itself is the most versatile you just have to write your queries to utilize it.

Filter predicates might seem like an expensive option because they are applied to every row in a given set. That is true, but they can also be used as a short-cut. In the query below I’ve repeated the inner join example above except I’ve added an additional predicate that the query only returns rows beginning in March, because a partial month isn’t helpful to my accounting. The resulting plan adds a FILTER operation at step 2 of the plan.

SELECT c.customer_id, c.customer_name, COUNT(o.order_id)
      FROM customers c INNER JOIN orders o
        ON o.customer_id = c.customer_id
     WHERE SYSDATE >=  DATE '2020-03-01'
       AND o.order_date >= DATE '2020-02-01'
       AND o.order_date < DATE '2020-03-01'
  GROUP BY c.customer_id, c.customer_name
  ORDER BY c.customer_id;

-------------------------------------------------------------------
| Id  | Operation                               | Name            |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                 |
|   1 |  SORT GROUP BY                          |                 |
|*  2 |   FILTER                                |                 |
|   3 |    NESTED LOOPS                         |                 |
|   4 |     NESTED LOOPS                        |                 |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| ORDERS          |
|*  6 |       INDEX RANGE SCAN                  | IDX_ORDERS_DATE |
|*  7 |      INDEX UNIQUE SCAN                  | SYS_C0061807    |
|   8 |     TABLE ACCESS BY INDEX ROWID         | CUSTOMERS       |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(SYSDATE@!>=TO_DATE(' 2020-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   6 - access("O"."ORDER_DATE">=TO_DATE(' 2020-02-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND 
              "O"."ORDER_DATE"<TO_DATE(' 2020-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   7 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")

When running the query, and reaching step 2, if the FILTER predicate does not evaluate TRUE, the execution does not need to continue into the children of step 2. Thereby bypassing all of the IO and processing below it. Comparing the autotrace statistics for the original and modified query; we can see there was no IO at all (zero gets) in the modified version.

Original
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         10  consistent gets
          0  physical reads
          0  redo size
        541  bytes sent via SQL*Net to client
        546  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          2  rows processed
Modified with SYSDATE predicate
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          0  consistent gets
          0  physical reads
          0  redo size
        398  bytes sent via SQL*Net to client
        535  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          0  rows processed

These data sets are tiny and ran on one of my sandboxes, so performance isn’t an issue here; but if this was a production system with thousands, millions, or even billions of rows, allowing a plan to skip steps entirely could lead to significant performance gains. Sometimes this type of execution can be an indication of a problem. If a SQL statement finishes in some miraculously fast time but appears to be returning incomplete results; the problem may be due to an early filter predicate negating data reads. Looking through the plan for the FILTER operations and the corresponding filter predicates may pinpoint a logic error.

Pushable Predicates

I alluded to these above when comparing two join variations. When a predicate is pushed it means the predicate is evaluated in a different logical position within a query than it appears syntactically. It’s possible a single condition could be pushed multiple places. In the examples below I’ll start with an intentionally contrived but simple query and progress through more complicated constructs to illustrate pushing.

SELECT *
  FROM (SELECT * FROM customers)
 WHERE customer_id = 1;

The query selects from an unfiltered, inline view; and on the outside filters the results to a single customer. If the optimizer executed the query in logical steps the matched the syntax, it would be extremely inefficient because it would read all of the data in the table first for the inline view and then throw away most of the results by only returning a single customer. As you might hope and expect, the optimizer does not execute the query that way. Instead, it pushes the filter into the subquery. In fact it also negates the subquery step entirely so it evaluates the query in the simpler form:

SELECT *
  FROM customers
 WHERE customer_id = 1;

That’s perhaps too simple of an example because the logical rearrangement to eliminate the query makes the push implicit; but hopefully it gives the idea of where this section is headed. In the next example, we’ll see a condition assigned to one table is pushed through the join into the other table. This logical rearrangement is a basic application of the arithmetic transitive property we learn in elementary school.

  SELECT c.customer_name, COUNT(o.order_id)
    FROM customers c INNER JOIN orders o 
      ON o.customer_id = c.customer_id
   WHERE c.customer_id = 1
GROUP BY c.customer_name;

The plan for this query shows the “value=1” condition is pushed not just from the customer but to the orders as well. So, instead of checking the order customer equals the customer table’s value, it can instead go directly to the id needed by using the same condition.

----------------------------------------------------------------
| Id  | Operation                     | Name                   |
----------------------------------------------------------------
|   0 | SELECT STATEMENT              |                        |
|   1 |  HASH GROUP BY                |                        |
|   2 |   NESTED LOOPS                |                        |
|   3 |    TABLE ACCESS BY INDEX ROWID| CUSTOMERS              |
|*  4 |     INDEX UNIQUE SCAN         | PK_CUSTOMERS           |
|*  5 |    INDEX RANGE SCAN           | IDX_ORDERS_CUSTOMER_ID |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("C"."CUSTOMER_ID"=1)
   5 - access("O"."CUSTOMER_ID"=1)

The goal of a pushed predicate is to cause evaluation of the predicate earlier within the execution in order to propagate fewer rows through the other steps of the plan. If you have views within views, whether inline, refactored, or defined as objects within the database, the pushability of predicates becomes integral to efficient performance. If every layer of all views must be evaluated in their entirety before applying filters at the outside final layer it could make some query stacks effectively unusable.

Using our customers and orders tables from earlier, let’s create a view that summarizes monthly counts per customer.

CREATE OR REPLACE VIEW monthly_customer_orders AS
  SELECT c.customer_id,
         c.customer_name, 
         TRUNC(o.order_date, 'mm') order_month,
         COUNT(o.order_id) order_count
    FROM customers c INNER JOIN orders o 
      ON o.customer_id = c.customer_id
GROUP BY c.customer_id,
         c.customer_name,
         TRUNC(o.order_date, 'mm');

Some query structures will allow pushing of some predicates, but not others. Compare the plans in the following 3 queries:

SELECT *
  FROM monthly_customer_orders
 WHERE customer_id = 1;

------------------------------------------------------------------------
| Id  | Operation                             | Name                   |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                        |
|   1 |  HASH GROUP BY                        |                        |
|   2 |   NESTED LOOPS                        |                        |
|   3 |    TABLE ACCESS BY INDEX ROWID        | CUSTOMERS              |
|*  4 |     INDEX UNIQUE SCAN                 | PK_CUSTOMERS           |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| ORDERS                 |
|*  6 |     INDEX RANGE SCAN                  | IDX_ORDERS_CUSTOMER_ID |
------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("C"."CUSTOMER_ID"=1)
   6 - access("O"."CUSTOMER_ID"=1)
SELECT *
  FROM monthly_customer_orders
 WHERE order_count > 25;

------------------------------------------
| Id  | Operation            | Name      |
------------------------------------------
|   0 | SELECT STATEMENT     |           |
|*  1 |  FILTER              |           |
|   2 |   HASH GROUP BY      |           |
|*  3 |    HASH JOIN         |           |
|   4 |     TABLE ACCESS FULL| CUSTOMERS |
|   5 |     TABLE ACCESS FULL| ORDERS    |
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT(*)>25)
   3 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
SELECT *
  FROM monthly_customer_orders
 WHERE customer_id = 1
   AND order_count > 25;

-------------------------------------------------------------------------
| Id  | Operation                              | Name                   |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                        |
|*  1 |  FILTER                                |                        |
|   2 |   HASH GROUP BY                        |                        |
|   3 |    NESTED LOOPS                        |                        |
|   4 |     TABLE ACCESS BY INDEX ROWID        | CUSTOMERS              |
|*  5 |      INDEX UNIQUE SCAN                 | PK_CUSTOMERS           |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED| ORDERS                 |
|*  7 |      INDEX RANGE SCAN                  | IDX_ORDERS_CUSTOMER_ID |
-------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(COUNT(*)>25)
   5 - access("C"."CUSTOMER_ID"=1)
   7 - access("O"."CUSTOMER_ID"=1)

Note the CUSTOMER_ID condition is pushed through the view down to the underlying tables to act as an access predicate, which not only is an efficient lookup through the index, but more importantly, an earlier evaluation of the predicate so the other customers don’t need to be returned and processed.

The ORDER_COUNT condition though, is not able to be pushed into the view. The reason being, the count for any customer can’t be known until all of the orders for that customer have been found and processed to determine the count. So, the count condition can only be applied as a filter step at the very end or processing. So, even though the filter is ID 1 of the plan, it does not act as a short-cut because it can’t be evaluated until after its child operations have completed.

There is no universal solution to pushable vs unpushable predicates. The main takeaway for you here is to simply be aware of the difference. I’ve worked with many developers who look at explain plans in terms of the operation steps through tables, indexes, and joins; but didn’t take the next step of looking at which predicates were being evaluated during each operation and how they were being evaluated.

When you see a FILTER step in a plan, what condition or conditions is it looking for? Is there some information you know that would make it legal to evaluate those conditions earlier? If so, what do you need to give the optimizer to make that happen? Maybe the predicate is exactly correct but the problem is the order of joins. Are you missing statistics, indexes, partitions, or clustering that would tell the optimizer it had more tools to work with? Maybe the problem is the predicate itself. Are you forcing implicit or explicit conversions that aren’t necessary? Perhaps additional conditions would be helpful in cases where the optimizer, for whatever reason, isn’t able to make the logical leaps of what is both legal and beneficial. Sometimes obviously true transitivity is only obvious to human eyes and the optimizer doesn’t find it. I’ve blogged about that phenomena before. In those cases, some redundant conditions may be the nudge the optimizer needs to push predicates it otherwise didn’t know would be correct to do so.

Impossible Predicates

Some times it is impossible to create a predicate at all, pushable or unpushable because a necessary value isn’t exposed. For example, let’s say we didn’t want to look at all orders for a month, but only those placed on weekends. Since the view doesn’t expose the individual order dates, it’s not possible to write such a query against the view in that form.

If we have access to the underlying tables, then we can bypass the view and query the orders directly; but this is not always possible. Views are often created as an interface between users and the raw data. This could be for security, compatibility, formatting, or to ensure business rules are always applied to the data before end-user processing. So, to ensure proper processing we may be forced to use a particular api, in the form of a view, or possibly a function returning a table collection.

In some cases you may be able to embed variable predicates within a view that will allow you to “push” a predicate manually where Oracle would not be able to on its own. One method is to include the evaluation of a system context as part of the view to determine the results it will return. This does make the view more complicated, and likely slower because it must do more work in order to determine which rows to return. This is especially true where our conditions require using functions on the table columns, so each row must pass its values into the functions before being able to evaluate the context condition.

Here I’ll give an example where I assume we have an “ORDERS” context defined and we’ll check if it has a “FILTER” value populated or not, if it’s not populated then we return everything. If it is populated we will restrict the rows returned to only weekend days or holidays where those are determined by function calls.

CREATE OR REPLACE VIEW monthly_customer_orders
AS
      SELECT c.customer_id,
             c.customer_name,
             TRUNC(o.order_date, 'mm') order_month,
             COUNT(o.order_id) order_count
        FROM customers c INNER JOIN orders o ON o.customer_id = c.customer_id
       WHERE SYS_CONTEXT('orders','filter') IS NULL
          OR (    SYS_CONTEXT('orders','filter') = 'weekend'
              AND TO_CHAR(o.order_date, 'Dy') IN ('Sat', 'Sun')
             )
          OR (    SYS_CONTEXT('orders','filter') = 'holiday'
              AND is_holiday(o.order_date) = 'Y'
             )
    GROUP BY c.customer_id, c.customer_name, TRUNC(o.order_date, 'mm');

So, using this view with a pushed predicate would require calling a stored procedure first in order to set the context filter value and then querying the view. This method is cumbersome, but it allows a great deal of flexibility and can result in significant performance gains for some problems.

I used a method similar to this to solve a problem with a complicated system of layered views. I can’t share the actual code, but I describe the methodology I pursued. The raw data was stored in GMT/UTC timestamps. Views on top of that would truncate values to date values by day to create summaries in US/Eastern time. Views on those would further summarize the data into months. Views on top of those applied other functions, formatting, and aggregations. All of this was fine until the upper most layer of views needed to be joined to other systems that recorded time in GMT/UTC corresponding to the raw data’s timestamps. Because of required logic within the views it wasn’t possible to directly join to the underlying table; but joining from top level GMT values through US/Eastern monthly and daily aggregations wasn’t efficiently possible. There would always be 4 or 5 hours of overlap from months converted from GMT to US/Eastern. The solution then was to call a procedure that would set a series of filtering values based on the other system’s GMT dates. The bottom view was modified to use the package’s values as part of its where clauses. Thus allowing for a push of the filters from the second system to go skip through the views and be applied directly in the bottom layer against the raw data: GMT to GMT.

Using table-functions allow for dynamic sql to be used, thus creating the opportunity for tailored queries based on specific criteria. This also allows for direct passing of parameters to the function as part of a query, rather than a two step of calling a setter procedure before querying a view.

1 thought on “Predicates: Joins, Filters, and Pushability

Comments are closed.