How to calculate the median per group with MySQL
• public
The other day I had to calculate the median per group within MySQL. Turns out this is not as easy as it sounds mostly due to the fact that there exists no median() function in MySQL. After quite some trial and error and some helpful tipps I figured it out though. Especially for people that are not that experienced with MySQL, here is my own writeup of the solution.
Here’s the example we’re starting with. A series of data points for a two groups of candidates (male and female). We want to calculate the median height per group.
gender | height |
---|---|
male | 1.85 |
male | 1.75 |
male | 1.6 |
female | 1.55 |
female | 1.6 |
male | 1.6 |
male | 2.01 |
female | 1.8 |
We’re starting very easy by selecting both columns and sorting them by group and height.
SELECT gender, height FROM heights ORDER BY gender , height asc
Returning the following result:
gender | height |
---|---|
female | 1.55 |
female | 1.6 |
female | 1.8 |
male | 1.6 |
male | 1.6 |
male | 1.75 |
male | 1.85 |
male | 2.01 |
Next up, we’ll need to count the rows per group and add it to the table as another column.
SELECT gender, height, (SELECT COUNT(*) FROM heights WHERE a.gender = gender) AS total_of_group FROM (SELECT gender, height FROM heights ORDER BY gender , height) AS a
Resulting:
gender | height | total_of_group |
---|---|---|
female | 1.55 | 3 |
female | 1.6 | 3 |
female | 1.8 | 3 |
male | 1.6 | 5 |
male | 1.6 | 5 |
male | 1.75 | 5 |
male | 1.85 | 5 |
male | 2.01 | 5 |
Now the select becomes a little bit more complex but this is the real important part. We’re adding two variables row_number and median_group. Row number should count up 1 each row until we’re hitting the next group, then it should start again at 1. We’re achieving this with the median_group helper variable. At each result the median_group variable is set to the rows group name (here gender). The case select detects if there has been a change of the groups name and resets the counter accordingly.
SET @row_number:=0; SET @median_group:=''; SELECT @row_number:=CASE WHEN @median_group = gender THEN @row_number + 1 ELSE 1 END AS count_of_group, @median_group:=gender AS median_group, gender, height, (SELECT COUNT(*) FROM heights WHERE a.gender = gender) AS total_of_group FROM (SELECT gender, height FROM heights ORDER BY gender , height) AS a
Result:
count_of_group | median_group | gender | height | total_of_group |
---|---|---|---|---|
1 | female | female | 1.55 | 3 |
2 | female | female | 1.6 | 3 |
3 | female | female | 1.8 | 3 |
1 | male | male | 1.6 | 5 |
2 | male | male | 1.6 | 5 |
3 | male | male | 1.75 | 5 |
4 | male | male | 1.85 | 5 |
5 | male | male | 2.01 | 5 |
Now we have everything to actually calculate the median. We’re wrapping another select around the current one and apply the median calculations.
By dividing the count of the group by 2 (total_of_group / 2.0) and doing the same adding 1 (total_of_group / 2.0 + 1) we’re defining a search range for our select that we apply to the count_of_group. What this means is that we select only those rows that lie between this range, effectively selecting the middle of each groups running count.
We’re also applying an average to the select (AVG(height)) because there might not always be only one row that gets selected for each group.
SET @row_number:=0; SET @median_group:=''; SELECT median_group, AVG(height) AS median FROM (SELECT @row_number:=CASE WHEN @median_group = gender THEN @row_number + 1 ELSE 1 END AS count_of_group, @median_group:=gender AS median_group, gender, height, (SELECT COUNT(*) FROM heights WHERE a.gender = gender) AS total_of_group FROM (SELECT gender, height FROM heights ORDER BY gender , height) AS a) AS b WHERE count_of_group BETWEEN total_of_group / 2.0 AND total_of_group / 2.0 + 1 GROUP BY median_group
Final result:
median_group | median |
---|---|
female | 1.6 |
male | 1.75 |
And we’re done. Exactly what we wanted to achieve. Here is the SQL Fiddle of the whole thing. Please let me know if this post helped you or if there is an easier way of doing this.