In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes

`SELECT MEDIAN(salary) FROM employee`

But for other databases, things begin to be more tricky. Let’s take **MySQL** first. One way to compute the median is the following:

`SELECT AVG(salary) FROM (`

SELECT salary FROM employee

WHERE salary IS NOT NULL

ORDER by salary ASC

LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For **Postgresql**, the query is similar to the MySQL query and uses LIMIT too.

`SELECT AVG(salary) FROM (`

SELECT salary FROM employee

WHERE salary IS NOT NULL

ORDER by salary ASC

LIMIT p OFFSET n) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).

For **Oracle 9i**, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:

`SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)`

FROM employee

For **DB2**, the following query is used in Talend Open Profiler:

`SELECT AVG(salary) FROM (`

SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee

) x

WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )

OR

( MOD(total,2) = 1 AND rn = next )

For **Microsoft SQL Server**, we used the TOP clause as follows

`SELECT AVG(CAST(salary AS NUMERIC)) FROM (`

SELECT TOP n salary FROM (

SELECT TOP m salary FROM employee

WHERE salary IS NOT NULL ORDER BY salary ASC

) AS FOO

ORDER BY salary DESC

) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on **Sybase ASE** is the following:

`SELECT AVG(CAST (salary AS NUMERIC)) FROM (`

SELECT DISTINCT salary FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) STT

WHERE

(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) AS SOU

WHERE SOU.salary <= STT.salary)

AND

(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (

SELECT salary FROM employee

UNION ALL

SELECT salary FROM employee

) AS SUR

WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.