How to calculate the median per group with MySQL

Median of Group in MySQL

The other day I had to calculate the median per group within MySQL. Turns out this is not as easy as it sounds mostly due to the fact that there exists no median() function in MySQL. After quite some trial and error and some helpful tipps I figured it out though. Especially for people that are not that experienced with MySQL, here is my own writeup of the solution.

Here’s the example we’re starting with. A series of data points for a two groups of candidates (male and female). We want to calculate the median height per group.

gender height
male 1.85
male 1.75
male 1.6
female 1.55
female 1.6
male 1.6
male 2.01
female 1.8

We’re starting very easy by selecting both columns and sorting them by group and height.

SELECT 
	gender, height
FROM
	heights
ORDER BY gender , height asc

Returning the following result:

gender height
female 1.55
female 1.6
female 1.8
male 1.6
male 1.6
male 1.75
male 1.85
male 2.01

Next up, we’ll need to count the rows per group and add it to the table as another column.

SELECT 
	gender,
	height,
	(SELECT 
		COUNT(*)
		FROM
			heights
		WHERE
			a.gender = gender) AS total_of_group
FROM
	(SELECT 
		gender, height
	FROM
		heights
	ORDER BY gender , height) AS a

Resulting:

gender height total_of_group
female 1.55 3
female 1.6 3
female 1.8 3
male 1.6 5
male 1.6 5
male 1.75 5
male 1.85 5
male 2.01 5

Now the select becomes a little bit more complex but this is the real important part. We’re adding two variables row_number and median_group. Row number should count up 1 each row until we’re hitting the next group, then it should start again at 1. We’re achieving this with the median_group helper variable. At each result the median_group variable is set to the rows group name (here gender). The case select detects if there has been a change of the groups name and resets the counter accordingly.

SET @row_number:=0; 
SET @median_group:='';

SELECT 
	@row_number:=CASE
		WHEN @median_group = gender THEN @row_number + 1
		ELSE 1
	END AS count_of_group,
	@median_group:=gender AS median_group,
	gender,
	height,
	(SELECT 
			COUNT(*)
		FROM
			heights
		WHERE
			a.gender = gender) AS total_of_group
FROM
	(SELECT 
		gender, height
	FROM
		heights
	ORDER BY gender , height) AS a

Result:

count_of_group median_group gender height total_of_group
1 female female 1.55 3
2 female female 1.6 3
3 female female 1.8 3
1 male male 1.6 5
2 male male 1.6 5
3 male male 1.75 5
4 male male 1.85 5
5 male male 2.01 5

Now we have everything to actually calculate the median. We’re wrapping another select around the current one and apply the median calculations.

By dividing the count of the group by 2 (total_of_group / 2.0) and doing the same adding 1 (total_of_group / 2.0 + 1) we’re defining a search range for our select that we apply to the count_of_group. What this means is that we select only those rows that lie between this range, effectively selecting the middle of each groups running count.

We’re also applying an average to the select (AVG(height)) because there might not always be only one row that gets selected for each group.

SET @row_number:=0; 
SET @median_group:='';

SELECT 
	median_group, AVG(height) AS median
FROM
	(SELECT 
		@row_number:=CASE
				WHEN @median_group = gender THEN @row_number + 1
				ELSE 1
			END AS count_of_group,
			@median_group:=gender AS median_group,
			gender,
			height,
			(SELECT 
					COUNT(*)
				FROM
					heights
				WHERE
					a.gender = gender) AS total_of_group
	FROM
		(SELECT 
		gender, height
	FROM
		heights
	ORDER BY gender , height) AS a) AS b
WHERE
	count_of_group BETWEEN total_of_group / 2.0 AND total_of_group / 2.0 + 1
GROUP BY median_group

Final result:

median_group median
female 1.6
male 1.75

And we’re done. Exactly what we wanted to achieve. Here is the SQL Fiddle of the whole thing. Please let me know if this post helped you or if there is an easier way of doing this.