Close

Calculating Skewness of a Population

After completing my article about skewness of a sample for Joel Kallman Day I decided I could do one more and write up the related functionality for skewness of a population.

Like skewness of a sample, the purpose of the function is to indicate the degree and direction of asymmetry in a data distribution. Like the SKEWNESS_SAMP function in my previous article, the 21c release of the database includes the SKEWNESS_POP aggregate.

The usage is similar and here I’ll compare the results of the population vs sample skewness. Here I’m calculating a small, known data set 1,2,4,8,16.

SQL> select skewness_pop(n), skewness_samp(n) from
  2  (select power(2,level-1) n from dual connect by level <= 5);

     SKEWNESS_POP(N)                            SKEWNESS_SAMP(N)
____________________ ___________________________________________
   0.889048134816954    1.32531470981340446231691036153849023328

For a small sample like this, the difference in skew can be quite large . For larger samples, the two values will tend to grow closer together. Using dba_objects as my data population with about 100 thousand rows, the results are beginning to converge.

SQL> select skewness_pop(object_id), skewness_samp(object_id)
  2    from dba_objects;

   SKEWNESS_POP(OBJECT_ID)                    SKEWNESS_SAMP(OBJECT_ID)
__________________________ ___________________________________________
         1.264888176352768    1.26490775402173997059265535485057799779

To implement my calculations, I'll be using the following formula:

\displaystyle \frac{\sum_i^n\left(x_i-\overline{x}\right)^3/n}{\sigma^3}

Where:

  • n = number of data points in the population
  • x_i = the ith element of the population (Sum where i=1 to n)
  • \overline{x} = the mean of the population
  • \sigma = the standard deviation of the population

So, using the same data as above, we can test this formula and see we come up with similar results to the native function.

SQL> WITH
  2      sample_data
  3      AS
  4          (    SELECT POWER(2, LEVEL - 1) n
  5                 FROM DUAL
  6           CONNECT BY LEVEL <= 5)
  7    SELECT SUM(z) / cnt / POWER(sigma, 3) skewness
  8      FROM (SELECT POWER(n - AVG(n) OVER(), 3) z,
  9                   COUNT(*) OVER() cnt,
 10                   STDDEV_POP(n) OVER() sigma
 11              FROM sample_data)
 12  GROUP BY cnt, sigma;

                                    SKEWNESS
____________________________________________
   0.889048134816954315589774168371754332528

Using this logic we can build a function to iterate on a collection, such as an already aggregated set of values. With my function I'll have it return null and 0 for data sets of 0, 1, or 2 values, just as the native function does.

CREATE OR REPLACE FUNCTION numtab_population_skewness(p_nums IN numtab)
    RETURN NUMBER
IS
    v_variance   NUMBER := 0;
    v_stddev     NUMBER;
    v_mean       NUMBER := 0;
    v_temp       NUMBER := 0;
    v_count      INTEGER := p_nums.COUNT;
    v_result     NUMBER;
BEGIN
    -- Implement the formula
    --     sum((x(i)-m)^3) / n / (s^3)
    --
    -- Where:
    --   n = number of data points in the population
    --   x(i) = the ith element of the sample (Sum where i=1..n)
    --   m = the mean of the population
    --   s = the standard deviation of the population
    CASE v_count
        WHEN 0
        THEN
            v_result := NULL;
        WHEN 1
        THEN
            v_result := 0;
        WHEN 2
        THEN
            v_result := 0;
        ELSE
            -- Calculate the mean from the total and count
            FOR i IN 1 .. v_count
            LOOP
                v_mean := v_mean + p_nums(i);
            END LOOP;

            v_mean := v_mean / v_count;

            -- Standard deviation of population = square root of (variance/(N))
            -- So, first step is calculate the variance by iterating through all of the values
            FOR i IN 1 .. v_count
            LOOP
                v_variance := v_variance + ((p_nums(i) - v_mean) ** 2);
            END LOOP;

            v_stddev := SQRT(v_variance / v_count);

            -- Sum the cube of the difference from mean divided by standard deviation
            FOR i IN 1 .. v_count
            LOOP
                v_temp := v_temp + (p_nums(i) - v_mean) ** 3;
            END LOOP;

            v_result := v_temp / v_count / (v_stddev ** 3);
    END CASE;

    RETURN v_result;
END;
/

Where numtab is a simple nested table collection as used in the previous article:

CREATE OR REPLACE TYPE numtab IS TABLE OF NUMBER;

Using it is simply getting a collection and passing it to the function.

SQL> SELECT numtab_population_skewness(nums)
  2    FROM (SELECT numtab(1, 2, 4, 8, 16) nums FROM DUAL);

            NUMTAB_POPULATION_SKEWNESS(NUMS)
____________________________________________
   0.889048134816954315589774168371754332528

That works, but as with calculating skewness of samples, it will be easier if we can call the aggregate directly on a column of values. So we'll create a user-defined aggregate with the ODCI syntax using similar logic.

CREATE OR REPLACE TYPE skewness_pop_agg_type AS OBJECT
(
    v_values numtab,
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_pop_agg_type)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateiterate(self      IN OUT skewness_pop_agg_type,
                                         p_value   IN     NUMBER)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregatemerge(self   IN OUT skewness_pop_agg_type,
                                       ctx2   IN     skewness_pop_agg_type)
        RETURN NUMBER,
    MEMBER FUNCTION odciaggregateterminate(self          IN     skewness_pop_agg_type,
                                           returnvalue      OUT NUMBER,
                                           flags         IN     NUMBER)
        RETURN NUMBER
);
/
CREATE OR REPLACE TYPE BODY skewness_pop_agg_type
IS
    STATIC FUNCTION odciaggregateinitialize(ctx IN OUT skewness_pop_agg_type)
        RETURN NUMBER
    IS
    BEGIN
        -- Initialize to an empty collection
        ctx := skewness_pop_agg_type(numtab());
        RETURN odciconst.success;
    END odciaggregateinitialize;

    MEMBER FUNCTION odciaggregateiterate(self      IN OUT skewness_pop_agg_type,
                                         p_value   IN     NUMBER)
        RETURN NUMBER
    IS
    BEGIN
        -- For each new value,
        -- add the value to our collection
        self.v_values.EXTEND;
        self.v_values(self.v_values.COUNT) := p_value;
        RETURN odciconst.success;
    END odciaggregateiterate;

    MEMBER FUNCTION odciaggregatemerge(self   IN OUT skewness_pop_agg_type,
                                       ctx2   IN     skewness_pop_agg_type)
        RETURN NUMBER
    IS
    BEGIN
        -- If merging two sub-results (possibly from parallel threads)
        --  The collection should hold all values from both intermediate results
        self.v_values := self.v_values MULTISET UNION ALL ctx2.v_values;
        RETURN odciconst.success;
    END odciaggregatemerge;

    MEMBER FUNCTION odciaggregateterminate(self          IN     skewness_pop_agg_type,
                                           returnvalue      OUT NUMBER,
                                           flags         IN     NUMBER)
        RETURN NUMBER
    IS
        v_variance   NUMBER := 0;
        v_stddev     NUMBER;
        v_mean       NUMBER := 0;
        v_temp       NUMBER := 0;
        v_count      INTEGER := v_values.COUNT;
        v_result     NUMBER;
    BEGIN
        -- Implement the formula
        --     sum((x(i)-m)^3) / n / (s^3)
        --
        -- Where:
        --   n = number of data points in the population
        --   x(i) = the ith element of the sample (Sum where i=1..n)
        --   m = the mean of the population
        --   s = the standard deviation of the population
        CASE v_count
            WHEN 0
            THEN
                v_result := NULL;
            WHEN 1
            THEN
                v_result := 0;
            WHEN 2
            THEN
                v_result := 0;
            ELSE
                -- Calculate the mean from the total and count
                FOR i IN 1 .. v_count
                LOOP
                    v_mean := v_mean + v_values(i);
                END LOOP;

                v_mean := v_mean / v_count;

                -- Standard deviation of population = square root of variance
                -- So, first step is calculate the variance by iterating through all of the values again
                FOR i IN 1 .. v_count
                LOOP
                    v_variance := v_variance + ((v_values(i) - v_mean) ** 2);
                END LOOP;

                v_variance := v_variance / v_count;

                v_stddev := SQRT(v_variance);

                -- Sum the cube of the difference from mean
                FOR i IN 1 .. v_count
                LOOP
                    v_temp := v_temp + (v_values(i) - v_mean) ** 3;
                END LOOP;

                returnvalue := v_temp / v_count / (v_stddev ** 3);
        END CASE;

        RETURN odciconst.success;
    END odciaggregateterminate;
END;
/
CREATE OR REPLACE FUNCTION population_skewness(p_value NUMBER)
    RETURN NUMBER
    PARALLEL_ENABLE
    AGGREGATE USING skewness_pop_agg_type;
/

Now we have an aggregate function we can test it against the native function to see how it compares.

SQL> SELECT skewness_pop(n), population_skewness(n)
  2    FROM (    SELECT POWER(2, LEVEL - 1) n
  3                FROM DUAL
  4          CONNECT BY LEVEL <= 5);

     SKEWNESS_POP(N)                       POPULATION_SKEWNESS(N)
____________________ ____________________________________________
   0.889048134816954    0.889048134816954315589774168371754332528

SQL> SELECT skewness_pop(object_id), population_skewness(object_id)
  2    FROM dba_objects;

   SKEWNESS_POP(OBJECT_ID)            POPULATION_SKEWNESS(OBJECT_ID)
__________________________ _________________________________________
         1.265078306594029    1.265078306594028618334920641985970805

This looks pretty good.

As with the ODCI function I created for the samples, I recommend using the native function instead of the manual approach, provided you are on 21c or higher. But, if you have an older version where these functions aren't available yet, then using ODCI or the function over a collection above will allow you to calculate the values yourself.

Also, you can see the native function returns results with less precision than the manual calculations. This is, undoubtedly a way to deal with rounding errors at higher precision. If this is unacceptable, then using the pl/sql frameworks above may help you achieve greater precision in your results; but be careful as they are still subject to the precision limitations of the native functions used in the intermediate calculations.

I know skewness calculations aren't often the most pressing of concerns in database development, but they are another tool in your toolbox for better understanding your data. I hope this helps and thank you for reading. Questions, as always, are welcome.

Leave a Reply