Sunday, September 24, 2023

Median and Quartiles by Group in SQLite

Data distributions are often characterized by their range mean, median, and quartiles.  For data sets in SQLite, the range (min and max) and mean are easily calculated with built-in aggregate functions.  Calculation of quartiles (the first quartile, the median, and the third quartile) require custom SQL code, as there are no built-in aggregate functions for these values.  This previous post contains code to calculate the median, but that technique cannot be extended to calculate the first and third quartiles.

There are other available examples of calculation of quartiles in SQLite, but (as of this writing) all of them (that I have found) yield a result that exactly matches one of the values in the data set.  These results may be only approximate for some data sets, specifically when there is an even number of rows overall or within two adjacent quartiles.

The code below takes the same approach to calculating the first and third quartiles as is standard for the median:

  • If there is an odd number of rows, take the middle value
  • If there is an even number of rows, take the average of the highest value in the lower bin (i.e., the lower half for the median) and the lowest value in the upper bin.

The solution presented here differs from others in accommodating the second of these cases.  In addition, it allows for calculation of these statistics for subgroups of the overall data set.

This illustration is designed to be run on a table with the following columns:

  • "id": A categorical column used to divide the data set into subgroups.  Statistics will be calculated for each unique "id".
  • "value": A numerical column for which the median and quartiles will be calculated.

Such a table can be produced with the following code:

create table testdata as
select *
from
	(select value as id from generate_series(1,8)) as d
	cross join (select random() as value from generate_series(1, 10000));

This approach uses SQLite's 'ntile()' function to subdivide the data set (or each subgroup) into equal or nearly-equal subsets.  The sizes and bounding values for each of these subsets are then extracted to allow determination of which of the two cases listed above should be used for each quartile and the median.

with nt as (
	SELECT
		id, value,
		ntile(4) over (partition by id order by value) as ntile_4
	FROM
		testdata
	),
idlist as (select distinct id from nt),
q1 as (
	select id, max(value) as q1_max, count(*) as q1_n
	from nt where ntile_4 = 1 group by id),
q2 as (
	select id, min(value) as q2_min, max(value) as q2_max, count(*) as q2_n
	from nt where ntile_4 = 2 group by id),
q3 as (
	select id, min(value) as q3_min, max(value) as q3_max, count(*) as q3_n
	from nt where ntile_4 = 3 group by id),
q4 as (
	select id, min(value) as q4_min, count(*) as q4_n
	from nt where ntile_4 = 4 group by id),
s as (
	SELECT
		st.id,
		q1_max, q1_n,
		q2_min, q2_max, q2_n,
		q3_min , q3_max , q3_n,
		q4_min, q4_n
	from
		idlist as st
		left join q1 on q1.id=st.id
		left join q2 on q2.id=st.id
		left join q3 on q3.id=st.id
		left join q4 on q4.id=st.id)
select
	id,
	case when q1_n = q2_n then (q1_max + q2_min)/2
		else case when q1_n > q2_n then q1_max else q2_min end end as q1,
	case when q1_n + q2_n = q3_n + q4_n then (q2_max + q3_min)/2
		else case when q1_n + q2_n > q3_n + q4_n then q2_max else q3_min end end as median,
	case when q3_n = q4_n then (q3_max + q4_min)/2
		else case when q3_n > q4_n then q3_max else q4_min end end as q3
from s
order by id
;

No comments:

Post a Comment