The other day I had to calculate the median per group within MySQL. Turns out this is not as easy as it sounds mostly due to the fact that there exists no median() function in MySQL. After quite some trial and error and some helpful tipps I figured it out though. Especially for people that are not that experienced with MySQL, here is my own writeup of the solution.
Here’s the example we’re starting with. A series of data points for a two groups of candidates (male and female). We want to calculate the median height per group.
gender | height |
---|---|
male | 1.85 |
male | 1.75 |
male | 1.6 |
female | 1.55 |
female | 1.6 |
male | 1.6 |
male | 2.01 |
female | 1.8 |
We’re starting very easy by selecting both columns and sorting them by group and height.
SELECT
gender, height
FROM
heights
ORDER BY gender , height asc
Returning the following result:
gender | height |
---|---|
female | 1.55 |
female | 1.6 |
female | 1.8 |
male | 1.6 |
male | 1.6 |
male | 1.75 |
male | 1.85 |
male | 2.01 |
Next up, we’ll need to count the rows per group and add it to the table as another column.
SELECT
gender,
height,
(SELECT
COUNT(*)
FROM
heights
WHERE
a.gender = gender) AS total_of_group
FROM
(SELECT
gender, height
FROM
heights
ORDER BY gender , height) AS a
Resulting:
gender | height | total_of_group |
---|---|---|
female | 1.55 | 3 |
female | 1.6 | 3 |
female | 1.8 | 3 |
male | 1.6 | 5 |
male | 1.6 | 5 |
male | 1.75 | 5 |
male | 1.85 | 5 |
male | 2.01 | 5 |
Now the select becomes a little bit more complex but this is the real important part. We’re adding two variables row_number and median_group. Row number should count up 1 each row until we’re hitting the next group, then it should start again at 1. We’re achieving this with the median_group helper variable. At each result the median_group variable is set to the rows group name (here gender). The case select detects if there has been a change of the groups name and resets the counter accordingly.
SET @row_number:=0;
SET @median_group:='';
SELECT
@row_number:=CASE
WHEN @median_group = gender THEN @row_number + 1
ELSE 1
END AS count_of_group,
@median_group:=gender AS median_group,
gender,
height,
(SELECT
COUNT(*)
FROM
heights
WHERE
a.gender = gender) AS total_of_group
FROM
(SELECT
gender, height
FROM
heights
ORDER BY gender , height) AS a
Result:
count_of_group | median_group | gender | height | total_of_group |
---|---|---|---|---|
1 | female | female | 1.55 | 3 |
2 | female | female | 1.6 | 3 |
3 | female | female | 1.8 | 3 |
1 | male | male | 1.6 | 5 |
2 | male | male | 1.6 | 5 |
3 | male | male | 1.75 | 5 |
4 | male | male | 1.85 | 5 |
5 | male | male | 2.01 | 5 |
Now we have everything to actually calculate the median. We’re wrapping another select around the current one and apply the median calculations.
By dividing the count of the group by 2 (total_of_group / 2.0) and doing the same adding 1 (total_of_group / 2.0 + 1) we’re defining a search range for our select that we apply to the count_of_group. What this means is that we select only those rows that lie between this range, effectively selecting the middle of each groups running count.
We’re also applying an average to the select (AVG(height)) because there might not always be only one row that gets selected for each group.
SET @row_number:=0;
SET @median_group:='';
SELECT
median_group, AVG(height) AS median
FROM
(SELECT
@row_number:=CASE
WHEN @median_group = gender THEN @row_number + 1
ELSE 1
END AS count_of_group,
@median_group:=gender AS median_group,
gender,
height,
(SELECT
COUNT(*)
FROM
heights
WHERE
a.gender = gender) AS total_of_group
FROM
(SELECT
gender, height
FROM
heights
ORDER BY gender , height) AS a) AS b
WHERE
count_of_group BETWEEN total_of_group / 2.0 AND total_of_group / 2.0 + 1
GROUP BY median_group
Final result:
median_group | median |
---|---|
female | 1.6 |
male | 1.75 |
And we’re done. Exactly what we wanted to achieve. Here is the SQL Fiddle of the whole thing. Please let me know if this post helped you or if there is an easier way of doing this.