How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
`SELECT MEDIAN(salary) FROM employee`

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
```SELECT AVG(salary) FROM ( SELECT salary FROM employee WHERE salary IS NOT NULL ORDER by salary ASC LIMIT n,p) T```
where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
```SELECT AVG(salary) FROM ( SELECT salary FROM employee WHERE salary IS NOT NULL ORDER by salary ASC LIMIT p OFFSET n) T```
This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
```SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) FROM employee```
For DB2, the following query is used in Talend Open Profiler:
```SELECT AVG(salary) FROM ( SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee ) x WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) ) OR ( MOD(total,2) = 1 AND rn = next )```
For Microsoft SQL Server, we used the TOP clause as follows
```SELECT AVG(CAST(salary AS NUMERIC)) FROM ( SELECT TOP n salary FROM ( SELECT TOP m salary FROM employee WHERE salary IS NOT NULL ORDER BY salary ASC ) AS FOO ORDER BY salary DESC ) AS BAR ```
where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
```SELECT AVG(CAST (salary AS NUMERIC)) FROM ( SELECT DISTINCT salary FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) STT WHERE (SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) AS SOU WHERE SOU.salary <= STT.salary) AND (SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) AS SUR WHERE SUR.salary >= STT.salary) ) T```
This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

12 responses

5 04 2010

In Sybase SQL Anywhere (v 11.0):
The CUME_DIST function provides a simple method to determine the median of a set of values. CUME_DIST can be used to compute the median value successfully in the face of ties and whether the input contains an even or odd number of rows. Essentially, you need only determine the first row with a CUME_DIST value of greater than or equal to 0.5.

The following query returns the product information for the product with the median unit price:

SELECT FIRST *
FROM ( SELECT Description, Quantity, UnitPrice,
CUME_DIST( ) OVER ( ORDER BY UnitPrice ASC ) AS CDist
FROM Products ) As DT
WHERE CDist >= 0.5
ORDER BY CDist;

That’s a straight lift from the help file; hope it works!

5 04 2010

Thanks Michael. I’ll check the version of our Sybase server and I’ll try this CUME_DIST function.

6 04 2010

Shouldn’t it be LIMIT n, p in the mysql variant?

6 04 2010

Thanks Goran,
you’re right, the correct syntax is LIMIT offset, rows in MySQL. Hence it is indeed LIMIT n,p where n is the offset and p is the number of returned rows. I fix this error immediately and also for Postgresql where I inverted n and p too.
Now it’s fixed.

23 07 2010

I’m a little late to this party, but the following seems to work for me in Sybase and only uses ABS() and AVG(). Am I missing some boundary condition?

select avg(salary) from employee t where abs((select count(*) from employee where salary >=t.salary) – (select count(*) from employee where salary <=t.salary)) <= 1

19 11 2010

For MS-SQL:
WITH
tt AS
(SELECT Salary AS VAL
FROM employee)
SELECT AVG(VAL)
FROM (SELECT TOP(2-((SELECT COUNT(*) FROM tt) % 2)) VAL
FROM (SELECT TOP((SELECT COUNT(*) FROM tt)/2+1) VAL
FROM tt
ORDER BY VAL asc) AS T1
ORDER BY VAL desc) AS T

7 04 2011

Yes I was missing something in my previous post. This one does not fail when the median value appears multiple times in the table. I still like how it only uses abs and avg and so works on very old Transact SQL varieties (Sybase/MS). Note that it is slow for more than about 50000 rows.

select avg(salary) from employee t
where abs((select count(*) from employee where salary >=t.salary)-
(select count(*) from employee where salary <=t.salary)) <=
(select count(*) from employee where salary=t.salary)

29 08 2011

On Ingres VectorWise currently:

select avg(c) from (
select c, rn from (
select
c,
row_number() over (order by c asc) as rn
from t
) t1
where rn >= (select cast(((select count(*) from t) / 2.0 + 0.5) as int))
and rn <= (select cast(((select count(*) from t) / 2.0 + 1.0) as int))
) r

Please, correct me, if it's wrong 😉

16 01 2012

This is great, but doesn’t make sense for the MS solution. Perhaps should be:

SELECT AVG(CAST(salary AS NUMERIC)) FROM (
SELECT TOP p salary FROM (
SELECT TOP m salary FROM employee
WHERE salary IS NOT NULL ORDER BY salary ASC
) AS FOO
ORDER BY salary DESC
) AS BAR

as you are using average for 2 values in the middle not m or n values..

16 01 2012

Sorry it shouldn’t be m or p: imho it should be 2 (N is an even number) and 1 (N is odd number)

16 01 2012

I insist: please test and have a look at the execution plan! For me the best version for SQL Server was indeed:
WITH
tt AS
(SELECT Salary AS VAL
FROM employee)
SELECT AVG(VAL)
FROM (SELECT TOP(2-((SELECT COUNT(*) FROM tt) % 2)) VAL
FROM (SELECT TOP((SELECT COUNT(*) FROM tt)/2+1) VAL
FROM tt
ORDER BY VAL asc) AS T1
ORDER BY VAL desc) AS T

19 04 2012

what is mean difference between sql server 2005 and sql server 2008