How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT p OFFSET n) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employee

For DB2, the following query is used in Talend Open Profiler:
SELECT AVG(salary) FROM (
SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee
) x
WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )
OR
( MOD(total,2) = 1 AND rn = next )

For Microsoft SQL Server, we used the TOP clause as follows
SELECT AVG(CAST(salary AS NUMERIC)) FROM (
SELECT TOP n salary FROM (
SELECT TOP m salary FROM employee
WHERE salary IS NOT NULL ORDER BY salary ASC
) AS FOO
ORDER BY salary DESC
) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT AVG(CAST (salary AS NUMERIC)) FROM (
SELECT DISTINCT salary FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) STT
WHERE
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SOU
WHERE SOU.salary <= STT.salary)
AND
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SUR
WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

Advertisements