Using Correct Data Types - Sean D. Stuber

One of the fundamental aspects of many programming languages is the notion of data types. SQL and PL/SQL are no exception. This article is not an argument one way or the other on whether strongly-typed languages are superior to weakly-typed, or untyped languages. The argument is moot. If you’re working with an Oracle database, you will work with data types. Which side of the strong vs weak debate you fall irrelevant. Since they are here, how you will use them is the important question.

Unfortunately, improper type use is a fairly common problem with database applications. Numbers get stored as text. Dates are stored as numbers or text or both. CHAR and VARCHAR2 are used interchangeably. NUMBER, FLOAT, BINARY_FLOAT, and BINARY_DOUBLE also lead to confusion and misapplication.

In other languages, the use of the wrong data type may impose syntax limitations, extra memory consumption and possibly more cpu to process . In Oracle, with SQL or PL/SQL the same can be true. It’s also important to note data types can act as implicit constraints. For example, if you use a real DATE type you can’t create a value of April 87, 2020. If you were using a text type though, it’s as easy as I just did in the previous sentence. Similarly if you used an integer type you could hold a value of 20200487 just as you could 20200430 for April 30, 2020. Not only are these implicit constraints convenient to ensure data quality, since a database involved, it can be assumed some of those values will be stored in that database for future use. Thus proper data types help keep data clean while processing and when persisted.

Within a database the use of correct data types helps prevent the need for conversions. Oracle can attempt to process conditions like ‘123’ = 123 (a text value compared to a numeric value,) but doing so means at least one side will be converted implicitly into a consistent form. i.e. either the number must be converted to text, or the text converted into a number. The implicit conversions often confuse new developers because in many cases they “just work.” Thus giving the impression there is native interoperability between the data types.

Below is an example using the sample HR schema EMPLOYEES table. My query is a simple primary key lookup using a bind variable; but my variable is not of a numeric type. In order to process the query Oracle must perform a conversion to get the query to be meaningful. This conversion can be seen in the explain plan.

select * from hr.employees where employee_id = :my_text_variable;

-----------------------------------------------------
| Id  | Operation                   | Name          |
-----------------------------------------------------
|   0 | SELECT STATEMENT            |               |
|   1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES     |
|*  2 |   INDEX UNIQUE SCAN         | EMP_EMP_ID_PK |
-----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"=TO_NUMBER(:MY_TEXT_VARIABLE))

In terms of performance this is clearly not ideal. This conversion must take place every time the access condition is evaluated. Some might try to get clever and put evaluate the conversion once by using a scalar subquery; but that doesn’t work either. The subquery will still return a text value and then that returned value will be passed to the TO_NUMBER function implicitly.

select * from hr.employees 
 where employee_id = (select :my_text_variable from dual);

-----------------------------------------------------
| Id  | Operation                   | Name          |
-----------------------------------------------------
|   0 | SELECT STATEMENT            |               |
|   1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES     |
|*  2 |   INDEX UNIQUE SCAN         | EMP_EMP_ID_PK |
|   3 |    FAST DUAL                |               |
-----------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"=TO_NUMBER( (SELECT :MY_TEXT_VARIABLE FROM "SYS"."DUAL" "DUAL")))

One solution to this would be to remove the implicit conversion and use an explicit one and put that conversion inside a scalar subquery so it will be evaluated just once.

select * from hr.employees 
 where employee_id = (select to_number(:my_text_variable) from dual);

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"= (SELECT TO_NUMBER(:MY_TEXT_VARIABLE) FROM "SYS"."DUAL" "DUAL"))

Doing this should ameliorate the bulk of the performance issues but it makes the query more cumbersome (especially if there are multiple variables involved.) It would be less code, easier to read, and more efficient had they simply both been defined to be the same data types.

This problem can be further complicated in joins where the optimizer will decide which side of the condition needs to be evaluated. It will try to do the least amount of work based on its cost estimations; but it can still get it wrong. More importantly though, since the condition is for potentially many rows. So the conversion must take place for every row processed on one side of the join.

select * from hr.employees e 
   inner join security_team  s 
           on e.employee_id = s.employee_id;

Predicate Information (identified by operation id):
---------------------------------------------------
  4 - access("E"."EMPLOYEE_ID"=TO_NUMBER("S"."EMPLOYEE_ID"))

You might be able to get some benefit from scalar subquery caching with syntax similar to that used with the variable; but as we saw above, the code becomes more convoluted even with just a single condition. Subquery caching has limits, so it’s not guaranteed to fix all issues. Plus you now have the responsibility of making sure you put the conversion on the correct side of the condition for optimal performance.

select * from hr.employees e 
   inner join security_team  s 
           on e.employee_id = (select to_number(s.employee_id) from dual);
           
------------------------------------------------------
| Id  | Operation                    | Name          |
------------------------------------------------------
|   0 | SELECT STATEMENT             |               |
|   1 |  NESTED LOOPS                |               |
|   2 |   NESTED LOOPS               |               |
|   3 |    TABLE ACCESS FULL         | SECURITY_TEAM |
|*  4 |    INDEX UNIQUE SCAN         | EMP_EMP_ID_PK |
|   5 |     FAST DUAL                |               |
|   6 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES     |
------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("E"."EMPLOYEE_ID"= (SELECT TO_NUMBER(:B1) FROM "SYS"."DUAL" "DUAL"))

Hopefully these simple examples adequately explain why a single value of the wrong type can lead to complexity and performance impacts; both of which are made worse as the queries grow larger and the number of conversions increases. It’s helpful though to actually measure the impact. I created two collections of ids, one numeric (NUMBER) and one text (VARCHAR2(6). Then looped through each array and timed how long it took to exhaustively loop through all rows by indexed lookup.

SELECT *
  INTO v_customer
  FROM sh.customers
 WHERE cust_id = v_numeric_ids(i);

SELECT *
  INTO v_customer
  FROM sh.customers
 WHERE cust_id = v_text_ids(i);

Numeric: 03.574954000
Text:    03.663864000

The results show a 2.5% difference using just one variable conversion on my test server with the Oracle sample data. For this small scale test the difference may be negligible; but for production use on larger data volumes and more complex queries, that performance penalty will grow and can become noticeably expensive. Worse, this is just one query. Every query that forces an extra conversion will have similar wastefulness in their executions.

This is why it is important to use correct data types in your object creation as well as good practices in coding. Otherwise you may end up with a system that is perpetually compensating for an initial design flaw, causing all activity to run more expensively than necessary.