If you have a column of values, Oracle provides functions to find each of these values: AVG() for the mean (average), MEDIAN() for the median, and STATS_MODE for the mode. STATS_MODE came in 10g. Before that the mode is also possible but requires extra steps for counting and sorting.
For example, given the values 1, 2, 2, 3, 5, 7 ,8, the mean, median, and mode are 4, 3, and 2 respectively.
CREATE TABLE numbers (n) AS (SELECT 1 FROM DUAL UNION ALL SELECT 2 FROM DUAL UNION ALL SELECT 2 FROM DUAL UNION ALL SELECT 3 FROM DUAL UNION ALL SELECT 5 FROM DUAL UNION ALL SELECT 7 FROM DUAL UNION ALL SELECT 8 FROM DUAL); SELECT AVG(n) FROM numbers; AVG(N) ---------- 4 SELECT MEDIAN(n), PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY n) pd -- 9i FROM numbers; MEDIAN(N) PD ---------- ---------- 3 3 SELECT STATS_MODE(n) FROM numbers; STATS_MODE(N) ------------- 2
That’s relatively straightforward; and many data sets can be processed in this manner. Note the MEDIAN value can also be calculated with the 50% percentile of the PERCENTILE_DISC function. MEDIAN was introduced in 10g. In 9i you needed PERCENTILE_DISC. In 8i or lower you would need to find the median yourself, probably with self-joins or sub-queries to find the middle element of the sorted values.
Also note, STATS_MODE only returns a single value, even if there are multiple values tied with the same count. According to the Oracle SQL Reference: “If more than one mode exists, then Oracle Database chooses one and returns only that one value.” Unfortunately, there is no way provided, as of 21c, to control which value will be returned. In my testing it has been the lowest of all mode values but that is not documented behavior, so I would not count on that always being the case.
To return all modes you can sort the values by their counts and then return those with the greatest counts. The following examples I use row limiting clause and RANK function to perform these steps. DENSE_RANK would also work here. Also, the RANK/DENSE_RANK solution would work for versions earlier than 10g as well. You could also use this method with a modified order by to return just a single value as well. Eliminating the “WITH TIES” portion of the row limiting clause. You could also switch RANK for ROW_NUMBER in the case of a single value as well.
SELECT DISTINCT n FROM (SELECT n, COUNT(*) OVER (PARTITION BY n) cnt FROM numbers) ORDER BY cnt DESC FETCH FIRST 1 ROW WITH TIES; SELECT DISTINCT n FROM (SELECT n, RANK() OVER (ORDER BY cnt DESC) rnk FROM (SELECT n, COUNT(*) OVER (PARTITION BY n) cnt FROM numbers)) WHERE rnk = 1; N ---------- 2 8
However, some data is stored in an already summarized form. That is, values with counts. None of the functions above can be applied directly to a table of this form.
create table numbers (n, cnt) AS (SELECT 1, 3 FROM DUAL UNION ALL SELECT 2, 9 FROM DUAL UNION ALL SELECT 3, 5 FROM DUAL UNION ALL SELECT 4, 1 FROM DUAL UNION ALL SELECT 5, 2 FROM DUAL UNION ALL SELECT 6, 20 FROM DUAL);
A simple way to approach data of this structure would be to expand the rows using the counts to create a number of rows matching the count for each value. Thus, for the data above, there would be three rows of 1, nine rows of 2, five rows of 3, one row of 4, two rows of 5, and twenty rows of 6.
There are multiple ways of doing the expansion, I’ll use a recursive query to generate row copies, and then apply the functions above on them.
WITH list_all (n, idx, cnt) AS (SELECT n, 1 idx, cnt FROM numbers UNION ALL SELECT n, idx + 1, cnt FROM list_all WHERE idx < cnt) SELECT DISTINCT n, cnt, AVG(n) OVER() mymean, PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY n) OVER() pd, -- 9i MEDIAN(n) OVER() mymedian, -- 10g or higher LAST_VALUE(n) OVER(ORDER BY cnt ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) mymode FROM list_all ORDER BY n; N CNT MYMEAN PD MYMEDIAN MYMODE ---------- ---------- ---------- ---------- ---------- ---------- 1 3 4.25 5 5.5 6 2 9 4.25 5 5.5 6 3 5 4.25 5 5.5 6 4 1 4.25 5 5.5 6 5 2 4.25 5 5.5 6 6 20 4.25 5 5.5 6
That works for relatively low counts; but it does not scale well as the counts grow. Not only do the functions have to process all those rows, but the act of generating the rows becomes an expense as well. Using the sample data above the query executed in less than 0.1 seconds on my test system. With counts in the hundreds of thousands, the query still functions but the execution time extended to an unacceptable 70 seconds.
CREATE TABLE numbers ( n, cnt ) AS (SELECT 1, 300000 FROM DUAL UNION ALL SELECT 2, 900000 FROM DUAL UNION ALL SELECT 3, 500000 FROM DUAL UNION ALL SELECT 4, 100000 FROM DUAL UNION ALL SELECT 5, 200000 FROM DUAL UNION ALL SELECT 6, 2000000 FROM DUAL);
With these larger counts, rather than generating the rows, we can calculate the mean, median, and mode directly.
The mean (average) is simple enough, it’s just dividing the sum of the product of the values and counts by the sum of the counts.
SELECT SUM(n * cnt) / SUM(cnt) mymean FROM numbers; MYMEAN ---------- 4.25
Mode is quite easy, simply sorting by the counts in descending order and returning the top value. There can be more than one mode in a data set, so you should return ties. You could do this with RANK/DENSE_RANK or a row-limiting clause.
SELECT n, cnt FROM numbers ORDER BY cnt DESC FETCH FIRST 1 ROW WITH TIES; SELECT n FROM (SELECT n, RANK() OVER (ORDER BY cnt DESC) rnk FROM numbers) WHERE rnk = 1; N CNT ---------- ---------- 6 2000000
If, for some reason, you only wanted a single value returned among ties, then as mentioned above with unsummarized ties – you could modify the ORDER BY clauses in these examples to control which value is returned among ties. Eliminating “WITH TIES” from the row-limiting clause and optionally using ROW_NUMBER instead of RANK or DENSE_RANK would work then select the single value.
While those are fairly straightforward, calculating the median value is a little trickier. Without generating all of the rows we still need to find which value would be in the middle of the set if the rows did exist.
To that end, I envision the set in ranges of sorted values. For the data above, there are four million total values. The first 300000 of them are value 1. Value 2 would be the next 900000 values in order, followed by 500000 3s, 100000 4s, 200000 5s, and finally 2000000 6s. The low and high indexes of each value then can be determined by summing the counts up to that value.
SELECT n, cnt, cnt_total, lo_range, hi_range FROM (SELECT n, cnt, SUM(cnt) OVER() cnt_total, NVL( SUM(cnt) OVER(ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + 1 lo_range, SUM(cnt) OVER (ORDER BY n) hi_range FROM numbers); N CNT CNT_TOTAL LO_RANGE HI_RANGE ---------- ---------- ---------- ---------- ---------- 1 300000 4000000 1 300000 2 900000 4000000 300001 1200000 3 500000 4000000 1200001 1700000 4 100000 4000000 1700001 1800000 5 200000 4000000 1800001 2000000 6 2000000 4000000 2000001 4000000
With 4 million values the median would be the average of the value at the 2000000th index and the value at the 2000001st index. Looking at the boundaries of the ranges we can see the median would fall between 5 and 6.
SELECT MIN( CASE WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) lo_n, MAX( CASE WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) hi_n FROM (SELECT n, cnt, cnt_total, lo_range, hi_range FROM (SELECT n, cnt, SUM(cnt) OVER() cnt_total, NVL( SUM(cnt) OVER( ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + 1 lo_range, SUM(cnt) OVER (ORDER BY n) hi_range FROM numbers)); LO_N HI_N ---------- ---------- 5 6
Thus the median is (5+6)/2 or 5.5.
SELECT (lo_n + hi_n) / 2 mymedian FROM (SELECT MIN( CASE WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) lo_n, MAX( CASE WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) hi_n FROM (SELECT n, cnt, cnt_total, lo_range, hi_range FROM (SELECT n, cnt, SUM(cnt) OVER() cnt_total, NVL( SUM(cnt) OVER( ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + 1 lo_range, SUM(cnt) OVER (ORDER BY n) hi_range FROM numbers))); MYMEDIAN ---------- 5.5
In this example, the number of values was even, so the median was an average of the two middle values. With an odd number of values, there will be a single value in the middle. The query above still works.
I’ll adjust the count of the 6 values up by 1 and you can see the median move from 5.5 up to 6 based on the new total.
UPDATE numbers SET cnt = 2000001 WHERE n = 6; SELECT (lo_n + hi_n) / 2 mymedian FROM (SELECT MIN( CASE WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) lo_n, MAX( CASE WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) hi_n FROM (SELECT n, cnt, cnt_total, lo_range, hi_range FROM (SELECT n, cnt, SUM(cnt) OVER() cnt_total, NVL( SUM(cnt) OVER( ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + 1 lo_range, SUM(cnt) OVER (ORDER BY n) hi_range FROM numbers))); MYMEDIAN ---------- 6
Now I’ll adjust the count of the 6 values down and you can see the median move down to 5 based on the new total.
UPDATE numbers SET cnt = 1999999 WHERE n = 6; SELECT (lo_n + hi_n) / 2 mymedian FROM (SELECT MIN( CASE WHEN FLOOR((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) lo_n, MAX( CASE WHEN CEIL((cnt_total + 1) / 2) BETWEEN lo_range AND hi_range THEN n END) hi_n FROM (SELECT n, cnt, cnt_total, lo_range, hi_range FROM (SELECT n, cnt, SUM(cnt) OVER() cnt_total, NVL( SUM(cnt) OVER( ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) + 1 lo_range, SUM(cnt) OVER (ORDER BY n) hi_range FROM numbers))); MYMEDIAN ---------- 5
Using these methods the count values have no scaling impact on finding the mean, median, and mode. The total number of value/count pairs will still have an impact but those are processed with standard aggregates or analytic functions which are already efficient for even large volumes.
I hope you these techniques helpful. Questions and comments, as always, are welcome.
Hi Sean,
Just a short remark:
For calculating the “mode”, there is an aggregate function available, I think since version 10g, which, true, went a little unobserved …
So, you can simply do the following:
SELECT STATS_MODE(n)
FROM numbers;
Very nice post, as always 🙂
Cheers & Best Regards,
Iudith Mentzel
Thank you very much for pointing out that oversight. I have added it to the examples above.
As an added bonus, while including and documenting it, I noticed my initial example also suffered from one of the limitations of STATS_MODE- that is, it only returned a single value.
So I’ve replaced that example with syntax that will return multiple modes and described modifications if you wanted to return a single value but not leave it up to Oracle to decide which one among the ties will be returned.
Thanks for reading, and thanks again for the note!