SQL and Set Theory - Sean D. Stuber

I polled some friends for topics and one asked for an article on collections. I’ve written about collections before; but usually in the context of solving some other problem rather than focusing on them specifically. Before digging into the collection types and the SQL and PL/SQL syntax of using them; I want to start with a something more fundamental to their usage – and that is Set Theory.

Most of the concepts I’ll be reviewing are learned in elementary school; but are often forgotten or dismissed when developers begin coding. This can be problematic coming from a background in procedural languages. Changing one’s thinking to SQL’s declarative style sometimes feels unnatural or awkward. So let’s refresh some of the key topics and see how they relate to SQL and from there expand into collection types.

A set, in general, is a pretty easy thing to grasp. It’s simply a collection of things, such as sets of integers or the set of real number. Outside of mathematics a set could represent a collection of people or other items. Formalizing our understanding, sets have certain properties. The elements of a set normally have no order. The elements may be divided into mutually disjoint subsets. For example, within the set of integers the elements may be divided into even and odd, or positive and negative sets. We can also perform various operations on a set:

Element membership ( ∈ ∉ )
This is simply a Boolean operation. An element is or is not a member of a set.
e.g. 7 ∈ Integers

Complement ( ′ )
The complement of a set is the set of all elements that are not in that set.
e.g. The set of even integers is the complement of the set of odd integers.
It’s important to note that complements are usually defined with respect to some space.
The even/odd complementary numbers only apply within the space of integers. Within real numbers, the complement of the even numbers would be all odd numbers, rational non-integers and irrational numbers.

Relative Complement ( – )
The name of this operation is a bit intimidating, but its symbol and concept are fairly easy. This is, more or less, subtraction of sets. Given sets A and B, the relative complement of A with respect to B is written A –B. The result of this operation is the subset of elements in A that are not members of B.

Intersection ( ⋂ )
The intersection of two sets is the subset of elements that are members of both sets.

Union ( ⋃ )
The union of two sets is the distinct set of elements that are members of either set. This is sort of like adding sets, except the result has no duplicates.

Cartesian Product ( × )
The product of two sets creates a new set of pairs of elements from each. That is, for each element a₁.. a_n of set A and each element b₁.. b_n of set B the Cartesian Product (A × B) will yield a set of element-pairs {(a₁,b₁),(a1,b2)..(a₁,b_n)..(a_n,b₁)..(a_n,b_n)}

The operators may have clear functions and are certainly usable as is; but it can be unwieldy to work with sets as lists of elements as well as sometimes difficult to picture the results of series of abstract symbols. When working with sets as a whole, rather than individual elements it can be helpful to use Venn and Euler diagrams. An Euler diagram is a collection of circles or other enclosed shapes that represent a set. Intersections, unions and complements of sets are shown by the degree to which the shapes are overlapped or separate as illustrated below.

A Venn diagram is a special case of Euler diagram that illustrates all possible combinations of set relations. The simple 2 circle diagrams above are Venn diagrams because they show the areas of each set that are distinct and the area intersecting.

It’s important to note that the illustration is meant to represent the relations; but does not necessarily imply element membership within those relations. For example, in the two-circle diagrams above you can see 4 areas delimited: the set outside of both sets, the set of just A, the set of just B, and the intersection of A and B. If we define A to be even integers and B to be odd integers, then the set outside of A and B (the complement of the union) would be all non-integral numbers.

The intersection of A and B would be the empty set because there are no numbers that are both even and odd.

The diagrams can be particularly helpful when combining the symbolic operators in combination. The end result might be hard to follow simply by reading the symbols. For example, it might not be immediately obvious the statement

(A′ ⋃ B ′) = (A ⋂ B) ′

is true. However, visualizing with a simple Venn diagram makes is quite apparent.

Complement of the intersection of A and B

As mentioned above, these set operations are probably familiar to most readers from elementary school; but possibly a little rusty on the specifics. Hopefully this refresher helped . The question though is: “how do these operations correlate to SQL statements in Oracle?”

First, database tables map well to sets. Tables are a collection of distinct rows, the rows have no intrinsic order within a table (as far as sql statements are concerned) and tables may be partitioned (albeit a separately licensed feature.)

Furthermore, the basic set operations have direct SQL syntax counterparts:

Row membership – In/Exists
Complement – Not In / Not Exists
Relative Complement – Minus
Intersection – Intersect
Union – Union
Cartesian Product – Join (I wrote about this previously here)

It’s easy to say and show that the words correspond; but the real test is to see the set theory and sql in action.

The following example helps illustrate the point.

From the diagram above it’s easy to see the intersection of the three ingredients, flour, eggs and milk go to form the basis for pancakes. Let’s represent our graphed data with tables.

SQL> select * from ingredient;
INGREDIENT_ID NAME
------------- --------------------
            1 Flour
            2 Egg
            3 Milk

Next we’ll create a table to indicate the results of each intersection shown.

SQL> select * from recipe;
 RECIPE_ID NAME
---------- --------------------
         1 Pasta
         2 Omelette
         3 Batter
         4 Pancakes

And finally, a table that maps the ingredients to recipes.

SQL> select * from recipe_ingredient;
 RECIPE_ID INGREDIENT_ID
---------- -------------
         1             1
         1             2
         2             2
         2             3
         3             1
         3             3
         4             1
         4             2
         4             3

So, the given this data one might ask, can we verify the data to the Venn diagram? Can we determine the ingredients needed to make Pancakes?

SQL> select i.*
  2  from recipe r, ingredient i, recipe_ingredient ri
  3  where r.recipe_id = ri.recipe_id
  4    and i.ingredient_id = ri.ingredient_id
  5    and r.name = 'Pancakes';
INGREDIENT_ID NAME
------------- --------------------
            1 Flour
            2 Egg
            3 Milk

That’s a little too easy though. What if want to construct a query in the other direction? That is, given the three ingredients and a set of recipes, which recipes will use all of the ingredients?

We can try a set of joins. This structure mirrors how one might pursue it procedurally. That is, iterating over each ingredient looking to see where it might overlap with the next ingredient until you’ve exhausted ingredients 1, 2, and 3.

SQL> SELECT r.name
  2    FROM recipe r,
  3         ingredient i1,
  4         recipe_ingredient ri1,
  5         ingredient i2,
  6         recipe_ingredient ri2,
  7         ingredient i3,
  8         recipe_ingredient ri3
  9   WHERE i1.name = 'Milk'
 10     AND i2.name = 'Flour'
 11     AND i3.name = 'Egg'
 12     AND i1.ingredient_id = ri1.ingredient_id
 13     AND i2.ingredient_id = ri2.ingredient_id
 14     AND i3.ingredient_id = ri3.ingredient_id
 15     AND r.recipe_id = ri1.recipe_id
 16     AND r.recipe_id = ri2.recipe_id
 17     AND r.recipe_id = ri3.recipe_id;
NAME
--------------------
Pancakes

That’s kind of cumbersome though. What if we try implementing the Venn diagram more directly in SQL with “Intersect” of the various recipes for each ingredient?

SQL> SELECT r.name
  2    FROM recipe r, ingredient i, recipe_ingredient ri
  3   WHERE r.recipe_id = ri.recipe_id
  4     AND i.ingredient_id = ri.ingredient_id
  5     AND i.name = 'Milk'
  6  INTERSECT
  7  SELECT r.name
  8    FROM recipe r, ingredient i, recipe_ingredient ri
  9   WHERE r.recipe_id = ri.recipe_id
 10     AND i.ingredient_id = ri.ingredient_id
 11     AND i.name = 'Flour'
 12  INTERSECT
 13  SELECT r.name
 14    FROM recipe r, ingredient i, recipe_ingredient ri
 15   WHERE r.recipe_id = ri.recipe_id
 16     AND i.ingredient_id = ri.ingredient_id
 17     AND i.name = 'Egg';
NAME
--------------------
Pancakes

That seems a little more reasonable but still a lot of code. Does Oracle have any built-in methods of solving the problem? Answer: Yes, we can try aggregating the ingredients of each recipe and see which aggregate contains all of the ingredients. In 11gR2 and above we can use LISTAGG and check if all three ingredients can be found somewhere in the list of items for each recipe.

SQL> SELECT name
  2    FROM (SELECT
  3              r.name,
  4              LISTAGG(i.name, ',') WITHIN GROUP (ORDER BY 1) l
  5            FROM recipe r, ingredient i, recipe_ingredient ri
  6           WHERE r.recipe_id = ri.recipe_id
  7             AND i.ingredient_id = ri.ingredient_id
  8          GROUP BY r.name)
  9   WHERE l LIKE '%Egg%'
 10     AND l LIKE '%Milk%'
 11     AND l LIKE '%Flour%';
NAME
--------------------
Pancakes

And finally – the most direct application of set-to-sql syntax. Instead of aggregating into a text string, aggregate into a collection with the COLLECT function introduced in 10g and then check if all of our ingredients are part of the recipe’s collection of ingredients.

SQL> SELECT name
  2    FROM (SELECT
  3             r.name,
  4             CAST(COLLECT(i.name) AS ora_mining_varchar2_nt) c
  5            FROM recipe r, ingredient i, recipe_ingredient ri
  6           WHERE r.recipe_id = ri.recipe_id
  7             AND i.ingredient_id = ri.ingredient_id
  8          GROUP BY r.name)
  9   WHERE 'Egg' MEMBER OF c
 10     AND 'Milk' MEMBER OF c
 11     AND 'Flour' MEMBER OF c;
NAME
--------------------
Pancakes

These last two are definitely the most compact in terms of syntax; but how do they compare in terms of performance? Using autotrace we can capture a couple of key metrics for each of them.

`Join ------------------- 43 consistent gets 0 sorts (memory)`	`Intersection -------------------- 54 consistent gets 6 sorts (memory)`
`ListAgg ---------------------- 18 consistent gets 1 sorts (memory)`	`Collect -------------------- 18 consistent gets 1 sorts (memory)`

We can see that using aggregations is clearly the most effective; which makes sense given that the other two methods needed to query the tables multiple times in order to achieve the same result.

Another option, looking at the diagram, we can see we are only interested in the recipes that have exactly 3 ingredients. From our diagram we know that 3 is the entire realm of ingredients so we don’t really need to specify them individually in the query. So, given this domain knowledge we can say any recipe that has 3 ingredients will necessarily have milk, eggs and flour.

Implementing that in sql is quite simple and very effective.

SQL> SELECT   r.name
  2      FROM recipe r, recipe_ingredient ri
  3     WHERE r.recipe_id = ri.recipe_id
  4  GROUP BY r.name
  5    HAVING COUNT(ri.recipe_id) = 3;

NAME
--------------------
Pancakes

Statistics
----------------------------------------------------
         12  consistent gets
          0  sorts (memory)

And this method is the best of all. It’s not using anything in the syntax that specifically maps to set operations; but it is an implementation of information that is able to be pulled directly from the Venn diagram.

Summary

The example above may seem contrived and simplistic – and I agree, it is. Most real problems will involve more than three values and the intersection of the values won’t necessarily produce a single result. In some cases it maybe zero (the empty set) or it may produce many rows.

The point though was to explore the various ways to implement the diagram and think about what the set relations meant and how they could be used to describe the results. This thinking in terms of descriptions is the key to the declarative nature of SQL.

Keeping these operations in mind for tables and sql results will then help as we explore Oracle’s collection types.