Mean, Median, and Mode of summarized data

If you have a column of values, Oracle provides functions to find each of these values: AVG() for the mean (average), MEDIAN() for the median, and STATS_MODE for the mode. STATS_MODE came in 10g. Before that the mode is also possible but requires extra steps for counting and sorting.

For example, given the values 1, 2, 2, 3, 5, 7 ,8, the mean, median, and mode are 4, 3, and 2 respectively.

CREATE TABLE numbers (n)
AS
    (SELECT 1 FROM DUAL
     UNION ALL
     SELECT 2 FROM DUAL
     UNION ALL
     SELECT 2 FROM DUAL
     UNION ALL
     SELECT 3 FROM DUAL
     UNION ALL
     SELECT 5 FROM DUAL
     UNION ALL
     SELECT 7 FROM DUAL
     UNION ALL
     SELECT 8 FROM DUAL);

SELECT AVG(n) FROM numbers;
    AVG(N)
----------
         4

SELECT MEDIAN(n),
       PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY n) pd -- 9i
  FROM numbers;

 MEDIAN(N)         PD
---------- ----------
         3          3

SELECT STATS_MODE(n) FROM numbers;

STATS_MODE(N)
-------------
            2

That’s relatively straightforward; and many data sets can be processed in this manner. Note the MEDIAN value can also be calculated with the 50% percentile of the PERCENTILE_DISC function. MEDIAN was introduced in 10g. In 9i you needed PERCENTILE_DISC. In 8i or lower you would need to find the median yourself, probably with self-joins or sub-queries to find the middle element of the sorted values.

Also note, STATS_MODE only returns a single value, even if there are multiple values tied with the same count. According to the Oracle SQL Reference: “If more than one mode exists, then Oracle Database chooses one and returns only that one value.” Unfortunately, there is no way provided, as of 21c, to control which value will be returned. In my testing it has been the lowest of all mode values but that is not documented behavior, so I would not count on that always being the case.

To return all modes you can sort the values by their counts and then return those with the greatest counts. The following examples I use row limiting clause and RANK function to perform these steps. DENSE_RANK would also work here. Also, the RANK/DENSE_RANK solution would work for versions earlier than 10g as well. You could also use this method with a modified order by to return just a single value as well. Eliminating the “WITH TIES” portion of the row limiting clause. You could also switch RANK for ROW_NUMBER in the case of a single value as well.

SELECT DISTINCT n
    FROM (SELECT n, COUNT(*) OVER (PARTITION BY n) cnt FROM numbers)
ORDER BY cnt DESC
   FETCH FIRST 1 ROW WITH TIES;

SELECT DISTINCT n
  FROM (SELECT n, RANK() OVER (ORDER BY cnt DESC) rnk
          FROM (SELECT n, COUNT(*) OVER (PARTITION BY n) cnt FROM numbers))
 WHERE rnk = 1;

         N
----------
         2
         8

However, some data is stored in an already summarized form. That is, values with counts. None of the functions above can be applied directly to a table of this form.

create table numbers (n, cnt)
    AS
        (SELECT 1, 3 FROM DUAL
         UNION ALL
         SELECT 2, 9 FROM DUAL
         UNION ALL
         SELECT 3, 5 FROM DUAL
         UNION ALL
         SELECT 4, 1 FROM DUAL
         UNION ALL
         SELECT 5, 2 FROM DUAL
         UNION ALL
         SELECT 6, 20 FROM DUAL);

A simple way to approach data of this structure would be to expand the rows using the counts to create a number of rows matching the count for each value. Thus, for the data above, there would be three rows of 1, nine rows of 2, five rows of 3, one row of 4, two rows of 5, and twenty rows of 6.
There are multiple ways of doing the expansion, I’ll use a recursive query to generate row copies, and then apply the functions above on them.

WITH
    list_all (n, idx, cnt)
    AS
        (SELECT n, 1 idx, cnt FROM numbers
         UNION ALL
         SELECT n, idx + 1, cnt
           FROM list_all
          WHERE idx < cnt)
  SELECT DISTINCT
         n,
         cnt,
         AVG(n) OVER() mymean,
         PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY n) OVER() pd, -- 9i
         MEDIAN(n) OVER() mymedian, -- 10g or higher
         LAST_VALUE(n)
             OVER(ORDER BY cnt
                  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) mymode
    FROM list_all
ORDER BY n;

         N        CNT     MYMEAN         PD   MYMEDIAN     MYMODE
---------- ---------- ---------- ---------- ---------- ----------
         1          3       4.25          5        5.5          6
         2          9       4.25          5        5.5          6
         3          5       4.25          5        5.5          6
         4          1       4.25          5        5.5          6
         5          2       4.25          5        5.5          6
         6         20       4.25          5        5.5          6

That works for relatively low counts; but it does not scale well as the counts grow. Not only do the functions have to process all those rows, but the act of generating the rows becomes an expense as well. Using the sample data above the query executed in less than 0.1 seconds on my test system. With counts in the hundreds of thousands, the query still functions but the execution time extended to an unacceptable 70 seconds.

CREATE TABLE numbers
(
    n,
    cnt
)
AS
    (SELECT 1, 300000 FROM DUAL
     UNION ALL
     SELECT 2, 900000 FROM DUAL
     UNION ALL
     SELECT 3, 500000 FROM DUAL
     UNION ALL
     SELECT 4, 100000 FROM DUAL
     UNION ALL
     SELECT 5, 200000 FROM DUAL
     UNION ALL
     SELECT 6, 2000000 FROM DUAL);

With these larger counts, rather than generating the rows, we can calculate the mean, median, and mode directly.

The mean (average) is simple enough, it’s just dividing the sum of the product of the values and counts by the sum of the counts.

SELECT SUM(n * cnt) / SUM(cnt) mymean FROM numbers;

    MYMEAN
----------
      4.25

Mode is quite easy, simply sorting by the counts in descending order and returning the top value. There can be more than one mode in a data set, so you should return ties. You could do this with RANK/DENSE_RANK or a row-limiting clause.

SELECT n, cnt
FROM numbers
ORDER BY cnt DESC
FETCH FIRST 1 ROW WITH TIES;

SELECT n
  FROM (SELECT n, RANK() OVER (ORDER BY cnt DESC) rnk FROM numbers)
 WHERE rnk = 1;

         N        CNT
---------- ----------
         6    2000000

If, for some reason, you only wanted a single value returned among ties, then as mentioned above with unsummarized ties – you could modify the ORDER BY clauses in these examples to control which value is returned among ties. Eliminating “WITH TIES” from the row-limiting clause and optionally using ROW_NUMBER instead of RANK or DENSE_RANK would work then select the single value.

While those are fairly straightforward, calculating the median value is a little trickier. Without generating all of the rows we still need to find which value would be in the middle of the set if the rows did exist.

To that end, I envision the set in ranges of sorted values. For the data above, there are four million total values. The first 300000 of them are value 1. Value 2 would be the next 900000 values in order, followed by 500000 3s, 100000 4s, 200000 5s, and finally 2000000 6s. The low and high indexes of each value then can be determined by summing the counts up to that value.

SELECT n, cnt, cnt_total, lo_range, hi_range
  FROM (SELECT n,
               cnt,
               SUM(cnt) OVER() cnt_total,
                 NVL(
                     SUM(cnt)
                         OVER(ORDER BY n
                              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
                     0)
               + 1 lo_range,
               SUM(cnt) OVER (ORDER BY n) hi_range
          FROM numbers);

         N        CNT  CNT_TOTAL   LO_RANGE   HI_RANGE
---------- ---------- ---------- ---------- ----------
         1     300000    4000000          1     300000
         2     900000    4000000     300001    1200000
         3     500000    4000000    1200001    1700000
         4     100000    4000000    1700001    1800000
         5     200000    4000000    1800001    2000000
         6    2000000    4000000    2000001    4000000

With 4 million values the median would be the average of the value at the 2000000th index and the value at the 2000001st index. Looking at the boundaries of the ranges we can see the median would fall between 5 and 6.

SELECT MIN(
           CASE
               WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range
               THEN
                   n
           END) lo_n,
       MAX(
           CASE
               WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range
               THEN
                   n
           END) hi_n
  FROM (SELECT n, cnt, cnt_total, lo_range, hi_range
          FROM (SELECT n,
                       cnt,
                       SUM(cnt) OVER() cnt_total,
                         NVL(
                             SUM(cnt)
                                 OVER(
                                     ORDER BY n
                                     ROWS BETWEEN UNBOUNDED PRECEDING
                                          AND     1 PRECEDING),
                             0)
                       + 1 lo_range,
                       SUM(cnt) OVER (ORDER BY n) hi_range
                  FROM numbers));

      LO_N       HI_N
---------- ----------
         5          6

Thus the median is (5+6)/2 or 5.5.

SELECT (lo_n + hi_n) / 2 mymedian
  FROM (SELECT MIN(
                   CASE
                       WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range
                                                           AND hi_range
                       THEN
                           n
                   END) lo_n,
               MAX(
                   CASE
                       WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range
                                                          AND hi_range
                       THEN
                           n
                   END) hi_n
          FROM (SELECT n, cnt, cnt_total, lo_range, hi_range
                  FROM (SELECT n,
                               cnt,
                               SUM(cnt) OVER() cnt_total,
                                 NVL(
                                     SUM(cnt)
                                         OVER(
                                             ORDER BY n
                                             ROWS BETWEEN UNBOUNDED PRECEDING
                                                  AND     1 PRECEDING),
                                     0)
                               + 1 lo_range,
                               SUM(cnt) OVER (ORDER BY n) hi_range
                          FROM numbers)));

  MYMEDIAN
----------
       5.5

In this example, the number of values was even, so the median was an average of the two middle values. With an odd number of values, there will be a single value in the middle. The query above still works.

I’ll adjust the count of the 6 values up by 1 and you can see the median move from 5.5 up to 6 based on the new total.

UPDATE numbers
   SET cnt = 2000001
 WHERE n = 6;
SELECT (lo_n + hi_n) / 2 mymedian
  FROM (SELECT MIN(
                   CASE
                       WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range
                                                           AND hi_range
                       THEN
                           n
                   END) lo_n,
               MAX(
                   CASE
                       WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range
                                                          AND hi_range
                       THEN
                           n
                   END) hi_n
          FROM (SELECT n, cnt, cnt_total, lo_range, hi_range
                  FROM (SELECT n,
                               cnt,
                               SUM(cnt) OVER() cnt_total,
                                 NVL(
                                     SUM(cnt)
                                         OVER(
                                             ORDER BY n
                                             ROWS BETWEEN UNBOUNDED PRECEDING
                                                  AND     1 PRECEDING),
                                     0)
                               + 1 lo_range,
                               SUM(cnt) OVER (ORDER BY n) hi_range
                          FROM numbers)));
  MYMEDIAN
----------
         6

Now I’ll adjust the count of the 6 values down and you can see the median move down to 5 based on the new total.

UPDATE numbers
   SET cnt = 1999999
 WHERE n = 6;

SELECT (lo_n + hi_n) / 2 mymedian
  FROM (SELECT MIN(
                   CASE
                       WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range
                                                           AND hi_range
                       THEN
                           n
                   END) lo_n,
               MAX(
                   CASE
                       WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range
                                                          AND hi_range
                       THEN
                           n
                   END) hi_n
          FROM (SELECT n, cnt, cnt_total, lo_range, hi_range
                  FROM (SELECT n,
                               cnt,
                               SUM(cnt) OVER() cnt_total,
                                 NVL(
                                     SUM(cnt)
                                         OVER(
                                             ORDER BY n
                                             ROWS BETWEEN UNBOUNDED PRECEDING
                                                  AND     1 PRECEDING),
                                     0)
                               + 1 lo_range,
                               SUM(cnt) OVER (ORDER BY n) hi_range
                          FROM numbers)));

  MYMEDIAN
----------
         5

Using these methods the count values have no scaling impact on finding the mean, median, and mode. The total number of value/count pairs will still have an impact but those are processed with standard aggregates or analytic functions which are already efficient for even large volumes.

I hope you these techniques helpful. Questions and comments, as always, are welcome.

2 thoughts on “Mean, Median, and Mode of summarized data”

Iudith Mentzel says:

2022-09-23 at 14:07

Hi Sean,

Just a short remark:

For calculating the “mode”, there is an aggregate function available, I think since version 10g, which, true, went a little unobserved …
So, you can simply do the following:

SELECT STATS_MODE(n)
FROM numbers;

Very nice post, as always 🙂

Cheers & Best Regards,
Iudith Mentzel
1. Sean D. Stuber says:
  
  2022-09-23 at 21:24
  
  Thank you very much for pointing out that oversight. I have added it to the examples above.
  As an added bonus, while including and documenting it, I noticed my initial example also suffered from one of the limitations of STATS_MODE- that is, it only returned a single value.
  So I’ve replaced that example with syntax that will return multiple modes and described modifications if you wanted to return a single value but not leave it up to Oracle to decide which one among the ties will be returned.
  
  Thanks for reading, and thanks again for the note!

Comments are closed.