In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes

`SELECT MEDIAN(salary) FROM employee`

But for other databases, things begin to be more tricky. Let’s take **MySQL** first. One way to compute the median is the following:

`SELECT AVG(salary) FROM (`

SELECT salary FROM employee

WHERE salary IS NOT NULL

ORDER by salary ASC

LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For **Postgresql**, the query is similar to the MySQL query and uses LIMIT too.

`SELECT AVG(salary) FROM (`

SELECT salary FROM employee

WHERE salary IS NOT NULL

ORDER by salary ASC

LIMIT p OFFSET n) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).

For **Oracle 9i**, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:

`SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)`

FROM employee

For **DB2**, the following query is used in Talend Open Profiler:

`SELECT AVG(salary) FROM (`

SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee

) x

WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )

OR

( MOD(total,2) = 1 AND rn = next )

For **Microsoft SQL Server**, we used the TOP clause as follows

`SELECT AVG(CAST(salary AS NUMERIC)) FROM (`

SELECT TOP n salary FROM (

SELECT TOP m salary FROM employee

WHERE salary IS NOT NULL ORDER BY salary ASC

) AS FOO

ORDER BY salary DESC

) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on **Sybase ASE** is the following:

`SELECT AVG(CAST (salary AS NUMERIC)) FROM (`

SELECT DISTINCT salary FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) STT

WHERE

(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) AS SOU

WHERE SOU.salary <= STT.salary)

AND

(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) AS SUR

WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

MichaelR(12:42:52) :In Sybase SQL Anywhere (v 11.0):

The CUME_DIST function provides a simple method to determine the median of a set of values. CUME_DIST can be used to compute the median value successfully in the face of ties and whether the input contains an even or odd number of rows. Essentially, you need only determine the first row with a CUME_DIST value of greater than or equal to 0.5.

The following query returns the product information for the product with the median unit price:

SELECT FIRST *

FROM ( SELECT Description, Quantity, UnitPrice,

CUME_DIST( ) OVER ( ORDER BY UnitPrice ASC ) AS CDist

FROM Products ) As DT

WHERE CDist >= 0.5

ORDER BY CDist;

That’s a straight lift from the help file; hope it works!

scorreia(14:59:13) :Thanks Michael. I’ll check the version of our Sybase server and I’ll try this CUME_DIST function.

goran(19:18:49) :Shouldn’t it be LIMIT n, p in the mysql variant?

scorreia(21:07:24) :Thanks Goran,

you’re right, the correct syntax is

LIMIT offset, rowsin MySQL. Hence it is indeed LIMIT n,p where n is the offset and p is the number of returned rows. I fix this error immediately and also for Postgresql where I inverted n and p too.Now it’s fixed.

(kendall)(17:54:45) :I’m a little late to this party, but the following seems to work for me in Sybase and only uses ABS() and AVG(). Am I missing some boundary condition?

select avg(salary) from employee t where abs((select count(*) from employee where salary >=t.salary) – (select count(*) from employee where salary <=t.salary)) <= 1

Olivier Comte(12:54:34) :For MS-SQL:

WITH

tt AS

(SELECT Salary AS VAL

FROM employee)

SELECT AVG(VAL)

FROM (SELECT TOP(2-((SELECT COUNT(*) FROM tt) % 2)) VAL

FROM (SELECT TOP((SELECT COUNT(*) FROM tt)/2+1) VAL

FROM tt

ORDER BY VAL asc) AS T1

ORDER BY VAL desc) AS T

(kendall)(11:45:46) :Yes I was missing something in my previous post. This one does not fail when the median value appears multiple times in the table. I still like how it only uses abs and avg and so works on very old Transact SQL varieties (Sybase/MS). Note that it is slow for more than about 50000 rows.

select avg(salary) from employee t

where abs((select count(*) from employee where salary >=t.salary)-

(select count(*) from employee where salary <=t.salary)) <=

(select count(*) from employee where salary=t.salary)

Konrad Procak(13:48:34) :On Ingres VectorWise currently:

select avg(c) from (

select c, rn from (

select

c,

row_number() over (order by c asc) as rn

from t

) t1

where rn >= (select cast(((select count(*) from t) / 2.0 + 0.5) as int))

and rn <= (select cast(((select count(*) from t) / 2.0 + 1.0) as int))

) r

Please, correct me, if it's wrong ;)

dinabee(17:37:08) :This is great, but doesn’t make sense for the MS solution. Perhaps should be:

SELECT AVG(CAST(salary AS NUMERIC)) FROM (

SELECT TOP p salary FROM (

SELECT TOP m salary FROM employee

WHERE salary IS NOT NULL ORDER BY salary ASC

) AS FOO

ORDER BY salary DESC

) AS BAR

as you are using average for 2 values in the middle not m or n values..

dinabee(17:41:02) :Sorry it shouldn’t be m or p: imho it should be 2 (N is an even number) and 1 (N is odd number)

Olivier Comte(20:03:58) :I insist: please test and have a look at the execution plan! For me the best version for SQL Server was indeed:

WITH

tt AS

(SELECT Salary AS VAL

FROM employee)

SELECT AVG(VAL)

FROM (SELECT TOP(2-((SELECT COUNT(*) FROM tt) % 2)) VAL

FROM (SELECT TOP((SELECT COUNT(*) FROM tt)/2+1) VAL

FROM tt

ORDER BY VAL asc) AS T1

ORDER BY VAL desc) AS T

jitendra mali(20:06:31) :what is mean difference between sql server 2005 and sql server 2008