Several years ago I needed to calculate the skewness of some data. That is, an indication of how asymmetric points are within a distribution. Unfortunately, I found there was no native feature to do this within SQL. So I wrote my own with the Oracle Data Cartridge Interface (ODCI.) Using ODCI you can create your own aggregate functions like MIN, MAX, and AVG.
Recently I had cause to revisit that function and thought I could do a little better. This also seemed like a good opportunity to share some of my work in tribute to Joel Kallman for all he gave to the Oracle community.
A happy discovery while looking into my old code – in the intervening years since my first implmentation Oracle had provided the SKEWNESS_SAMP function to their SQL implementation beginning with the 21c release.
Before I get into the doing things the hard way of implementing my own function, I’ll show the native way.
Here I’m calculating a small, known data set 1,2,4,8,16.
SQL> SELECT skewness_samp(n)
2 FROM ( SELECT POWER(2, LEVEL - 1) n
3 FROM DUAL
4* CONNECT BY LEVEL <= 5);
SKEWNESS_SAMP(N)
___________________________________________
1.32531470981340446231691036153849023328
Next I'll use dba_objects which has around100,000 objects in my test database. Your results will vary.
SQL> SELECT skewness_samp(object_id)
2* FROM dba_objects;
SKEWNESS_SAMP(OBJECT_ID)
___________________________________________
1.26477910446002342095053639554088606526
To implement my calculations, I'll be using the following formula:
Where:
- n = number of elements in the sample
= the ith element within the sample (Sum where i=1 to n)
= the mean of the sample
= the standard deviation of the sample
So, using the same data as above, we can test this formula.
SQL> WITH sample_data AS
2 (SELECT POWER(2, LEVEL - 1) n
3 FROM DUAL
4 CONNECT BY LEVEL <= 5)
5 SELECT SUM(z) skewness
6 FROM (SELECT COUNT(*) OVER()
7 / (COUNT(*) OVER() - 1)
8 / (COUNT(*) OVER() - 2)
9 * POWER((n - AVG(n) OVER()) / STDDEV(n) OVER(), 3) z
10* FROM sample_data);
SKEWNESS
___________________________________________
1.32531470981340493277036905772942591043
So, it's not exactly the same as the native function, but differing only after 15 decimal places I think it's safe to say it's a limitation of internal rounding errors.
Using this logic we can build a function to iterate on a collection, such as an already aggregated set of values. With my function I'll have it return null and 0 for data sets of 0, 1, or 2 values.
CREATE OR REPLACE FUNCTION numtabskewness(p_nums IN numtab)
RETURN NUMBER
IS
v_result NUMBER;
v_count INTEGER := p_nums.COUNT;
BEGIN
-- Implement the formula
-- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
--
-- Where:
-- n = number of elements in the sample
-- x(i) = the ith element with the sample (Sum where i=1..n)
-- m = the mean of the sample
-- s = the standard deviation of the sample
CASE v_count
WHEN 0
THEN
v_result := NULL;
WHEN 1
THEN
v_result := 0;
WHEN 2
THEN
v_result := 0;
ELSE
SELECT SUM(z)
INTO v_result
FROM (SELECT v_count
/ (v_count - 1)
/ (v_count - 2)
* POWER(
(COLUMN_VALUE - AVG(COLUMN_VALUE) OVER())
/ STDDEV(COLUMN_VALUE) OVER(),
3) z
FROM TABLE(p_nums));
END CASE;
RETURN v_result;
END;
Where numtab is a simple nested table collection:
CREATE OR REPLACE TYPE numtab IS TABLE OF NUMBER;
Using it is simply getting a collection and passing it to the function.
SQL> SELECT numtabskewness(nums)
FROM (SELECT numtab(1, 2, 4, 8, 16) nums FROM DUAL);
NUMTABSKEWNESS(NUMS)
___________________________________________
1.32531470981340493277036905772942591043
SQL> SELECT numtabskewness(CAST(COLLECT(object_id) AS numtab)) FROM dba_objects;
NUMTABSKEWNESS(CAST(COLLECT(OBJECT_ID)ASNUMTAB))
___________________________________________________
1.26477910446002362104514032748246287218
That works, but seems kind of inefficient to use sql within the pl/sql to do the calculation. We can iterate directly on the collection itself.
CREATE OR REPLACE FUNCTION sds.numtabskewness2(p_nums IN numtab)
RETURN NUMBER
IS
v_variance NUMBER := 0;
v_stddev NUMBER;
v_mean NUMBER := 0;
v_temp NUMBER := 0;
v_count INTEGER := p_nums.COUNT;
v_result NUMBER;
BEGIN
-- Implement the formula
-- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
--
-- Where:
-- n = number of elements in the sample
-- x(i) = the ith element with the sample (Sum where i=1..n)
-- m = the mean of the sample
-- s = the standard deviation of the sample
CASE v_count
WHEN 0
THEN
v_result := NULL;
WHEN 1
THEN
v_result := 0;
WHEN 2
THEN
v_result := 0;
ELSE
-- Calculate the mean from the total and count
FOR i IN 1 .. v_count
LOOP
v_mean := v_mean + p_nums(i);
END LOOP;
v_mean := v_mean / v_count;
-- calculate the variance by iterating through all of the values again
FOR i IN 1 .. v_count
LOOP
v_variance := v_variance + ((p_nums(i) - v_mean) ** 2);
END LOOP;
v_variance := v_variance / (v_count-1);
-- Standard deviation of population = square root of variance
v_stddev := SQRT(v_variance);
-- Sum the cube of the difference from mean divided by standard deviation
FOR i IN 1 .. v_count
LOOP
v_temp := v_temp + POWER((p_nums(i) - v_mean) / v_stddev, 3);
END LOOP;
-- apply count factoring against the sum
v_result := v_count / (v_count - 1) / (v_count - 2) * v_temp;
END CASE;
RETURN v_result;
END;
/
Testing the function on the same sample data used previously, note the calculated value is very similar to but not exactly the same as the previous version for a large sample. Again, something small in the internal rounding; but this time only after 37 decimal places.
SQL> SELECT numtabskewness2(nums)
FROM (SELECT numtab(1, 2, 4, 8, 16) nums FROM DUAL);
NUMTABSKEWNESS2(NUMS)
___________________________________________
1.32531470981340493277036905772942591043
SQL> SELECT numtabskewness2(CAST(COLLECT(object_id) AS numtab)) FROM dba_objects;
NUMTABSKEWNESS(CAST(COLLECT(OBJECT_ID)ASNUMTAB))
___________________________________________________
1.26477910446002362104514032748246287204
These work as long as we already have an aggregated collection, but it would be simpler to just call an aggregate directly. This is how I chose to implement my original function, using the ODCI syntax. Using this user-defined aggregate we can invoke it with a simple:
SELECT skewness(x) FROM sample_data;
Unfortunately, we can't use native aggregate results within our user-defined aggregate. So, we'll have to generate the count, mean, and standard deviation ourselves. To that end, we'll keep a running count and total along with the actual values as we go.
CREATE OR REPLACE TYPE skewness_agg_type AS OBJECT
(
v_values numtab,
STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_agg_type)
RETURN NUMBER,
MEMBER FUNCTION odciaggregateiterate(self IN OUT skewness_agg_type,
p_value IN NUMBER)
RETURN NUMBER,
MEMBER FUNCTION odciaggregatemerge(self IN OUT skewness_agg_type,
ctx2 IN skewness_agg_type)
RETURN NUMBER,
MEMBER FUNCTION odciaggregateterminate(self IN skewness_agg_type,
returnvalue OUT NUMBER,
flags IN NUMBER)
RETURN NUMBER
);
/
CREATE OR REPLACE TYPE BODY skewness_agg_type
IS
STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_agg_type)
RETURN NUMBER
IS
BEGIN
-- Initialize to an empty collection
ctx := skewness_agg_type(numtab());
RETURN odciconst.success;
END odciaggregateinitialize;
MEMBER FUNCTION odciaggregateiterate(self IN OUT skewness_agg_type,
p_value IN NUMBER)
RETURN NUMBER
IS
BEGIN
-- For each new value,
-- add the value to our collection
self.v_values.EXTEND;
self.v_values(self.v_values.COUNT) := p_value;
RETURN odciconst.success;
END odciaggregateiterate;
MEMBER FUNCTION odciaggregatemerge(self IN OUT skewness_agg_type,
ctx2 IN skewness_agg_type)
RETURN NUMBER
IS
BEGIN
-- If merging two sub-results (possibly from parallel threads)
-- The collection should hold all values from both intermediate results
self.v_values := self.v_values MULTISET UNION ALL ctx2.v_values;
RETURN odciconst.success;
END odciaggregatemerge;
MEMBER FUNCTION odciaggregateterminate(self IN skewness_agg_type,
returnvalue OUT NUMBER,
flags IN NUMBER)
RETURN NUMBER
IS
v_count INTEGER := self.v_values.COUNT;
BEGIN
-- Implement the formula
-- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
--
-- Where:
-- n = number of elements in the sample
-- x(i) = the ith element with the sample (Sum where i=1..n)
-- m = the mean of the sample
-- s = the standard deviation of the sample
SELECT SUM(z)
INTO returnvalue
FROM (SELECT v_count
/ (v_count - 1)
/ (v_count - 2)
* POWER(
(COLUMN_VALUE - AVG(COLUMN_VALUE) OVER())
/ STDDEV(COLUMN_VALUE) OVER(),
3) z
FROM TABLE(self.v_values));
RETURN odciconst.success;
END odciaggregateterminate;
END;
/
CREATE OR REPLACE FUNCTION skewness(p_value NUMBER)
RETURN NUMBER
PARALLEL_ENABLE
AGGREGATE USING skewness_agg_type;
/
Now we have an aggregate function we can test it with the same sample data above and should get the same result.
SQL> SELECT skewness(n)
2 FROM ( SELECT POWER(2, LEVEL - 1) n
3 FROM DUAL
4 CONNECT BY LEVEL <= 5);
SKEWNESS(N)
___________________________________________
1.32531470981340493277036905772942591043
SQL> SELECT skewness(object_id)
2 FROM dba_objects;
SKEWNESS(OBJECT_ID)
___________________________________________
1.26477910446002362104514032748246287218
As expected, the terminating member function using the SQL-based approach returns the same results as the SQL results shown above. Rewriting the terminator to iterate across the collection with pl/sql, will return those results. Below I've only listed the terminate function, the rest of the object body is the same as above.
MEMBER FUNCTION odciaggregateterminate(self IN skewness_agg_type,
returnvalue OUT NUMBER,
flags IN NUMBER)
RETURN NUMBER
IS
v_variance NUMBER := 0;
v_stddev NUMBER;
v_mean NUMBER := 0;
v_temp NUMBER := 0;
v_count INTEGER := v_values.COUNT;
BEGIN
-- Implement the formula
-- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
--
-- Where:
-- n = number of elements in the sample
-- x(i) = the ith element with the sample (Sum where i=1..n)
-- m = the mean of the sample
-- s = the standard deviation of the sample
CASE v_count
WHEN 0
THEN
returnvalue := NULL;
WHEN 1
THEN
returnvalue := 0;
WHEN 2
THEN
returnvalue := 0;
ELSE
-- Calculate the mean from the total and count
FOR i IN 1 .. v_count
LOOP
v_mean := v_mean + v_values(i);
END LOOP;
v_mean := v_mean / v_count;
-- Calculate the variance by iterating through all of the values
FOR i IN 1 .. v_count
LOOP
v_variance := v_variance + ((v_values(i) - v_mean) ** 2);
END LOOP;
v_variance := v_variance / (v_count-1);
-- Standard deviation = square root of variance
v_stddev := SQRT(v_variance);
-- Sum the cube of the difference from mean divided by standard deviation
FOR i IN 1 .. v_count
LOOP
v_temp :=
v_temp + POWER((v_values(i) - v_mean) / v_stddev, 3);
END LOOP;
-- apply count factoring against the sum
returnvalue := v_count / (v_count - 1) / (v_count - 2) * v_temp;
END CASE;
RETURN odciconst.success;
END odciaggregateterminate;
As expected, the results are the same as the previous pl/sql-based iteration.
SQL> SELECT skewness(n)
2 FROM ( SELECT POWER(2, LEVEL - 1) n
3 FROM DUAL
4 CONNECT BY LEVEL <= 5);
SKEWNESS(N)
___________________________________________
1.32531470981340493277036905772942591043
SQL> SELECT skewness(object_id)
2 FROM dba_objects;
SKEWNESS(OBJECT_ID)
___________________________________________
1.26477910446002362104514032748246287204
If you are unfamiliar with the ODCI syntax, the constructs above may appear overly complex; but they are actually fairly easy to use.
Each aggregate needs 4 parts
- initializer - this is the method invoked by the function before processing any rows. It sets up initial values, in this case an empty collection.
- iterator - This method is invoked once for each row processed. For this example, it appends the new value to the stored collection.
- merge - this method might not be called but is necessary to ensure proper processing of sub-results. The most common reason to encounter these would be parallel queries where 2 threads each process some subset of the data and the PQ coordinator must merge the results into a single total result
- terminator - this method is invoked after the last row is processed and a final result needs to be returned
If you have 21c or later then I recommend using the native functions over these manual methods, but if you have an older version or if you need to adjust the implementation to accommodate your own rounding rules, I hope these examples provide a useful framework to build on.