Close

Calculating Skewness of a Sample #JoelKallmanDay

Several years ago I needed to calculate the skewness of some data. That is, an indication of how asymmetric points are within a distribution. Unfortunately, I found there was no native feature to do this within SQL. So I wrote my own with the Oracle Data Cartridge Interface (ODCI.) Using ODCI you can create your own aggregate functions like MIN, MAX, and AVG.

Recently I had cause to revisit that function and thought I could do a little better. This also seemed like a good opportunity to share some of my work in tribute to Joel Kallman for all he gave to the Oracle community.

A happy discovery while looking into my old code – in the intervening years since my first implmentation Oracle had provided the SKEWNESS_SAMP function to their SQL implementation beginning with the 21c release.

Before I get into the doing things the hard way of implementing my own function, I’ll show the native way.

Here I’m calculating a small, known data set 1,2,4,8,16.

SQL> SELECT skewness_samp(n)
  2    FROM (    SELECT POWER(2, LEVEL - 1) n
  3                FROM DUAL
  4*         CONNECT BY LEVEL <= 5);

                           SKEWNESS_SAMP(N)
___________________________________________
   1.32531470981340446231691036153849023328

Next I'll use dba_objects which has around100,000 objects in my test database. Your results will vary.

SQL> SELECT skewness_samp(object_id)
  2*   FROM dba_objects;

                   SKEWNESS_SAMP(OBJECT_ID)
___________________________________________
   1.26477910446002342095053639554088606526

To implement my calculations, I'll be using the following formula:

\displaystyle \frac{n}{(n-1)(n-2)}\sum_i^n\left(\frac{x_i-\overline{x}}{\sigma}\right)^3

Where:

  • n = number of elements in the sample
  • x_i = the ith element within the sample (Sum where i=1 to n)
  • \overline{x} = the mean of the sample
  • \sigma = the standard deviation of the sample

So, using the same data as above, we can test this formula.

SQL> WITH sample_data AS
  2     (SELECT POWER(2, LEVEL - 1) n
  3                FROM DUAL
  4          CONNECT BY LEVEL <= 5)
  5  SELECT SUM(z) skewness
  6    FROM (SELECT   COUNT(*) OVER()
  7                 / (COUNT(*) OVER() - 1)
  8                 / (COUNT(*) OVER() - 2)
  9                 * POWER((n - AVG(n) OVER()) / STDDEV(n) OVER(), 3) z
 10*           FROM sample_data);

                                   SKEWNESS
___________________________________________
   1.32531470981340493277036905772942591043

So, it's not exactly the same as the native function, but differing only after 15 decimal places I think it's safe to say it's a limitation of internal rounding errors.

Using this logic we can build a function to iterate on a collection, such as an already aggregated set of values. With my function I'll have it return null and 0 for data sets of 0, 1, or 2 values.

CREATE OR REPLACE FUNCTION numtabskewness(p_nums IN numtab)
    RETURN NUMBER
IS
    v_result   NUMBER;
    v_count    INTEGER := p_nums.COUNT;
BEGIN
    -- Implement the formula
    -- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
    --
    -- Where:
    --   n = number of elements in the sample
    --   x(i) = the ith element with the sample (Sum where i=1..n)
    --   m = the mean of the sample
    --   s = the standard deviation of the sample
    CASE v_count
        WHEN 0
        THEN
            v_result := NULL;
        WHEN 1
        THEN
            v_result := 0;
        WHEN 2
        THEN
            v_result := 0;
        ELSE
            SELECT SUM(z)
              INTO v_result
              FROM (SELECT   v_count
                           / (v_count - 1)
                           / (v_count - 2)
                           * POWER(
                                   (COLUMN_VALUE - AVG(COLUMN_VALUE) OVER())
                                 / STDDEV(COLUMN_VALUE) OVER(),
                                 3) z
                      FROM TABLE(p_nums));
    END CASE;

    RETURN v_result;
END;

Where numtab is a simple nested table collection:

CREATE OR REPLACE TYPE numtab IS TABLE OF NUMBER;

Using it is simply getting a collection and passing it to the function.

SQL> SELECT numtabskewness(nums)
  FROM (SELECT numtab(1, 2, 4, 8, 16) nums FROM DUAL);

                       NUMTABSKEWNESS(NUMS)
___________________________________________
   1.32531470981340493277036905772942591043

SQL> SELECT numtabskewness(CAST(COLLECT(object_id) AS numtab)) FROM dba_objects;

   NUMTABSKEWNESS(CAST(COLLECT(OBJECT_ID)ASNUMTAB))
___________________________________________________
           1.26477910446002362104514032748246287218

That works, but seems kind of inefficient to use sql within the pl/sql to do the calculation. We can iterate directly on the collection itself.

CREATE OR REPLACE FUNCTION sds.numtabskewness2(p_nums IN numtab)
    RETURN NUMBER
IS
    v_variance   NUMBER := 0;
    v_stddev     NUMBER;
    v_mean       NUMBER := 0;
    v_temp       NUMBER := 0;
    v_count      INTEGER := p_nums.COUNT;
    v_result     NUMBER;
BEGIN
    -- Implement the formula
    -- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
    --
    -- Where:
    --   n = number of elements in the sample
    --   x(i) = the ith element with the sample (Sum where i=1..n)
    --   m = the mean of the sample
    --   s = the standard deviation of the sample

    CASE v_count
        WHEN 0
        THEN
            v_result := NULL;
        WHEN 1
        THEN
            v_result := 0;
        WHEN 2
        THEN
            v_result := 0;
        ELSE
            -- Calculate the mean from the total and count
            FOR i IN 1 .. v_count
            LOOP
                v_mean := v_mean + p_nums(i);
            END LOOP;

            v_mean := v_mean / v_count;

            -- calculate the variance by iterating through all of the values again
            FOR i IN 1 .. v_count
            LOOP
                v_variance := v_variance + ((p_nums(i) - v_mean) ** 2);
            END LOOP;

            v_variance := v_variance / (v_count-1);

            -- Standard deviation of population = square root of variance
            v_stddev := SQRT(v_variance);

            -- Sum the cube of the difference from mean divided by standard deviation
            FOR i IN 1 .. v_count
            LOOP
                v_temp := v_temp + POWER((p_nums(i) - v_mean) / v_stddev, 3);
            END LOOP;

            -- apply count factoring against the sum
            v_result := v_count / (v_count - 1) / (v_count - 2) * v_temp;
    END CASE;

    RETURN v_result;
END;
/

Testing the function on the same sample data used previously, note the calculated value is very similar to but not exactly the same as the previous version for a large sample. Again, something small in the internal rounding; but this time only after 37 decimal places.

SQL> SELECT numtabskewness2(nums)
  FROM (SELECT numtab(1, 2, 4, 8, 16) nums FROM DUAL);

                       NUMTABSKEWNESS2(NUMS)
___________________________________________
   1.32531470981340493277036905772942591043

SQL> SELECT numtabskewness2(CAST(COLLECT(object_id) AS numtab)) FROM dba_objects;

   NUMTABSKEWNESS(CAST(COLLECT(OBJECT_ID)ASNUMTAB))
___________________________________________________
           1.26477910446002362104514032748246287204

These work as long as we already have an aggregated collection, but it would be simpler to just call an aggregate directly. This is how I chose to implement my original function, using the ODCI syntax. Using this user-defined aggregate we can invoke it with a simple:

SELECT skewness(x)
  FROM sample_data;

Unfortunately, we can't use native aggregate results within our user-defined aggregate. So, we'll have to generate the count, mean, and standard deviation ourselves. To that end, we'll keep a running count and total along with the actual values as we go.

CREATE OR REPLACE TYPE skewness_agg_type AS OBJECT
(
    v_values numtab,
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_agg_type)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateiterate(self      IN OUT skewness_agg_type,
                                         p_value   IN     NUMBER)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregatemerge(self   IN OUT skewness_agg_type,
                                       ctx2   IN     skewness_agg_type)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateterminate(self          IN     skewness_agg_type,
                                           returnvalue      OUT NUMBER,
                                           flags         IN     NUMBER)
        RETURN NUMBER
);
/
CREATE OR REPLACE TYPE BODY skewness_agg_type
IS
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_agg_type)
        RETURN NUMBER
    IS
    BEGIN
        -- Initialize to an empty collection
        ctx := skewness_agg_type(numtab());
        RETURN odciconst.success;
    END odciaggregateinitialize;

    MEMBER FUNCTION odciaggregateiterate(self      IN OUT skewness_agg_type,
                                         p_value   IN     NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        -- For each new value,
        -- add the value to our collection
        self.v_values.EXTEND;
        self.v_values(self.v_values.COUNT) := p_value;
        RETURN odciconst.success;
    END odciaggregateiterate;

    MEMBER FUNCTION odciaggregatemerge(self   IN OUT skewness_agg_type,
                                       ctx2   IN     skewness_agg_type)
        RETURN NUMBER
    IS
    BEGIN
        -- If merging two sub-results (possibly from parallel threads)
        --  The collection should hold all values from both intermediate results
        self.v_values := self.v_values MULTISET UNION ALL ctx2.v_values;
        RETURN odciconst.success;
    END odciaggregatemerge;

    MEMBER FUNCTION odciaggregateterminate(self          IN     skewness_agg_type,
                                           returnvalue      OUT NUMBER,
                                           flags         IN     NUMBER)
        RETURN NUMBER
    IS
        v_count   INTEGER := self.v_values.COUNT;
    BEGIN
        -- Implement the formula
        -- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
        --
        -- Where:
        --   n = number of elements in the sample
        --   x(i) = the ith element with the sample (Sum where i=1..n)
        --   m = the mean of the sample
        --   s = the standard deviation of the sample

        SELECT SUM(z)
          INTO returnvalue
          FROM (SELECT   v_count
                       / (v_count - 1)
                       / (v_count - 2)
                       * POWER(
                               (COLUMN_VALUE - AVG(COLUMN_VALUE) OVER())
                             / STDDEV(COLUMN_VALUE) OVER(),
                             3) z
                  FROM TABLE(self.v_values));

        RETURN odciconst.success;
    END odciaggregateterminate;
END;
/
CREATE OR REPLACE FUNCTION skewness(p_value NUMBER)
    RETURN NUMBER
    PARALLEL_ENABLE
    AGGREGATE USING skewness_agg_type;
/

Now we have an aggregate function we can test it with the same sample data above and should get the same result.

SQL> SELECT skewness(n)
  2    FROM (    SELECT POWER(2, LEVEL - 1) n
  3                FROM DUAL
  4          CONNECT BY LEVEL <= 5);

                                SKEWNESS(N)
___________________________________________
   1.32531470981340493277036905772942591043


SQL> SELECT skewness(object_id)
  2    FROM dba_objects;

                        SKEWNESS(OBJECT_ID)
___________________________________________
   1.26477910446002362104514032748246287218

As expected, the terminating member function using the SQL-based approach returns the same results as the SQL results shown above. Rewriting the terminator to iterate across the collection with pl/sql, will return those results. Below I've only listed the terminate function, the rest of the object body is the same as above.

    MEMBER FUNCTION odciaggregateterminate(self          IN     skewness_agg_type,
                                           returnvalue      OUT NUMBER,
                                           flags         IN     NUMBER)
        RETURN NUMBER
    IS
        v_variance   NUMBER := 0;
        v_stddev     NUMBER;
        v_mean       NUMBER := 0;
        v_temp       NUMBER := 0;
        v_count      INTEGER := v_values.COUNT;
    BEGIN
        -- Implement the formula
        -- n/((n-1)(n-2)) SUM((x(i) - m)/s)^3
        --
        -- Where:
        --   n = number of elements in the sample
        --   x(i) = the ith element with the sample (Sum where i=1..n)
        --   m = the mean of the sample
        --   s = the standard deviation of the sample

        CASE v_count
            WHEN 0
            THEN
                returnvalue := NULL;
            WHEN 1
            THEN
                returnvalue := 0;
            WHEN 2
            THEN
                returnvalue := 0;
            ELSE
                -- Calculate the mean from the total and count
                FOR i IN 1 .. v_count
                LOOP
                    v_mean := v_mean + v_values(i);
                END LOOP;

                v_mean := v_mean / v_count;

                -- Calculate the variance by iterating through all of the values
                FOR i IN 1 .. v_count
                LOOP
                    v_variance := v_variance + ((v_values(i) - v_mean) ** 2);
                END LOOP;
                
                v_variance := v_variance / (v_count-1);

                -- Standard deviation = square root of variance
                v_stddev := SQRT(v_variance);

                -- Sum the cube of the difference from mean divided by standard deviation
                FOR i IN 1 .. v_count
                LOOP
                    v_temp :=
                        v_temp + POWER((v_values(i) - v_mean) / v_stddev, 3);
                END LOOP;

                -- apply count factoring against the sum
                returnvalue := v_count / (v_count - 1) / (v_count - 2) * v_temp;
        END CASE;

        RETURN odciconst.success;
    END odciaggregateterminate;

As expected, the results are the same as the previous pl/sql-based iteration.

SQL> SELECT skewness(n)
  2    FROM (    SELECT POWER(2, LEVEL - 1) n
  3                FROM DUAL
  4          CONNECT BY LEVEL <= 5);

                                SKEWNESS(N)
___________________________________________
   1.32531470981340493277036905772942591043


SQL> SELECT skewness(object_id)
  2    FROM dba_objects;

                        SKEWNESS(OBJECT_ID)
___________________________________________
   1.26477910446002362104514032748246287204

If you are unfamiliar with the ODCI syntax, the constructs above may appear overly complex; but they are actually fairly easy to use.

Each aggregate needs 4 parts

  • initializer - this is the method invoked by the function before processing any rows. It sets up initial values, in this case an empty collection.
  • iterator - This method is invoked once for each row processed. For this example, it appends the new value to the stored collection.
  • merge - this method might not be called but is necessary to ensure proper processing of sub-results. The most common reason to encounter these would be parallel queries where 2 threads each process some subset of the data and the PQ coordinator must merge the results into a single total result
  • terminator - this method is invoked after the last row is processed and a final result needs to be returned

If you have 21c or later then I recommend using the native functions over these manual methods, but if you have an older version or if you need to adjust the implementation to accommodate your own rounding rules, I hope these examples provide a useful framework to build on.

Leave a Reply