Sunday, September 24, 2023

Median and Quartiles by Group in SQLite

Data distributions are often characterized by their range mean, median, and quartiles.  For data sets in SQLite, the range (min and max) and mean are easily calculated with built-in aggregate functions.  Calculation of quartiles (the first quartile, the median, and the third quartile) require custom SQL code, as there are no built-in aggregate functions for these values.  This previous post contains code to calculate the median, but that technique cannot be extended to calculate the first and third quartiles.

There are other available examples of calculation of quartiles in SQLite, but (as of this writing) all of them (that I have found) yield a result that exactly matches one of the values in the data set.  These results may be only approximate for some data sets, specifically when there is an even number of rows overall or within two adjacent quartiles.

The code below takes the same approach to calculating the first and third quartiles as is standard for the median:

  • If there is an odd number of rows, take the middle value
  • If there is an even number of rows, take the average of the highest value in the lower bin (i.e., the lower half for the median) and the lowest value in the upper bin.

The solution presented here differs from others in accommodating the second of these cases.  In addition, it allows for calculation of these statistics for subgroups of the overall data set.

This illustration is designed to be run on a table with the following columns:

  • "id": A categorical column used to divide the data set into subgroups.  Statistics will be calculated for each unique "id".
  • "value": A numerical column for which the median and quartiles will be calculated.

Such a table can be produced with the following code:

create table testdata as
select *
from
	(select value as id from generate_series(1,8)) as d
	cross join (select random() as value from generate_series(1, 10000));

This approach uses SQLite's 'ntile()' function to subdivide the data set (or each subgroup) into equal or nearly-equal subsets.  The sizes and bounding values for each of these subsets are then extracted to allow determination of which of the two cases listed above should be used for each quartile and the median.

with nt as (
	SELECT
		id, value,
		ntile(4) over (partition by id order by value) as ntile_4
	FROM
		testdata
	),
idlist as (select distinct id from nt),
q1 as (
	select id, max(value) as q1_max, count(*) as q1_n
	from nt where ntile_4 = 1 group by id),
q2 as (
	select id, min(value) as q2_min, max(value) as q2_max, count(*) as q2_n
	from nt where ntile_4 = 2 group by id),
q3 as (
	select id, min(value) as q3_min, max(value) as q3_max, count(*) as q3_n
	from nt where ntile_4 = 3 group by id),
q4 as (
	select id, min(value) as q4_min, count(*) as q4_n
	from nt where ntile_4 = 4 group by id),
s as (
	SELECT
		st.id,
		q1_max, q1_n,
		q2_min, q2_max, q2_n,
		q3_min , q3_max , q3_n,
		q4_min, q4_n
	from
		idlist as st
		left join q1 on q1.id=st.id
		left join q2 on q2.id=st.id
		left join q3 on q3.id=st.id
		left join q4 on q4.id=st.id)
select
	id,
	case when q1_n = q2_n then (q1_max + q2_min)/2
		else case when q1_n > q2_n then q1_max else q2_min end end as q1,
	case when q1_n + q2_n = q3_n + q4_n then (q2_max + q3_min)/2
		else case when q1_n + q2_n > q3_n + q4_n then q2_max else q3_min end end as median,
	case when q3_n = q4_n then (q3_max + q4_min)/2
		else case when q3_n > q4_n then q3_max else q4_min end end as q3
from s
order by id
;

Monday, September 4, 2023

Why Don't People Ask More Questions?

More than what?  said a person who asks a lot of questions.

This is not a new question. There are numerous summaries of reasons why people don't ask questions, and suggestions for how one can become a better question-asker. A lot of existing discussion is focused on general conversations; the focus here is specifically on technical conversations in professional settings, where the goal is to convey information from one party to another (that is, report-talk rather than rapport-talk (Tannen 1990).

This post summarizes commonly suggested reasons why people don't ask more question, touches briefly on some philosophy of communication, and suggests an additional (partial) answer that doesn't appear to be discussed elsewhere.


Common Reasons that People Don't Ask Questions


Commonly-cited reasons for a person's reluctance to ask questions are:

  • Risk aversion -- This may take several forms:
    • Unwillingness to ask what may be a stupid question, and suffer embarassment or reputational harm.
    • Unwillingness to seem pushy (this overlaps somewhat with deference to authority, below).
    • Fear of consequences -- asking questions that challenge the status quo might be perceived as having a risk to one's position or standing.
    • Avoidance of transparency -- fear of having the same question turned back on oneself, or of being asked to explain the basis of the question.
  • Deference to authority -- Feeling that one does not have the position, or background, for asking questions to be acceptable to others (imposter syndrome).
  • Egotism -- This takes two forms:
    • Desiring to present oneself as knowledgeable, and not needing to ask questions.
    • Desiring, and impatient, to speak oneself, so not wishing to preserve another person's centrality to the conversation by asking questions.
  • Apathy -- The topic is uninteresting.
  • Animus -- The speaker is uninteresting or worse.
  • Sex- or culturally-determined differences — women may ask fewer questions in some circumstances, particularly those that may be confrontational (Carter et al., 2018; Schmidt and Davenport, 2017, Tannen 1990).

These are potential explanations for Why don't people ask the questions that they have?, but something else to consider is Why don't people think of questions to ask?  Simple explanations are apathy (again) and the lack of critical thinking skills.  However, differing contexts of speaker and listener may also play a large role.


Conversational Implicature and Ambiguity


The reasons why any questions at all might need to be asked can be attributed to communication problem: either a lack of clarity on the speaker's part, or a lack of understanding on the questioner's part. The existence of such communication problems seems like it should be an aberration in natural language. Natural languages have evolved, however, to allow ambiguity, and there may even be some advantages to this ambiguity (Piantadosi et al. 2012). Languages can be constructed to minimize or eliminate ambiguity (cf. Lojban), but natural languages have not developed to be free of ambiguity.

 

Conversational implicature is the term used to describe the disjunction in communication when what is said is not what is meant. Conversational implicature underlies many common forms of communication, ranging from humor to political dog whistles. But it is not limited to such special cases; it is common in everyday conversation. A humdrum example from Paul Grice, who coined the term (reproduced from Benotti and Blackburn [2014]) is:


Man standing by his car: "I am out of petrol."

Passer-by: "There is a garage around the corner."


The effectiveness of conversational implicature in this case rests on common context and inferences, such as:

  • Both parties know that petrol is available at garages.
  • "Around the corner" means that the garage is within walking distance.
  • The passer-by knows that the garage is open and has petrol to sell.

Implicature simplifies conversation by eliminating the need to explicitly state those things that are part of the participants' shared context, or that can be reasonably inferred.


Much--even most--meaning in conversation does not reside in the words spoken at all, but is filled in by the person listening. -- D. Tannen (1990)


Although the result of conversational implicature can be ambiguity, ambiguity may have important benefits for the simplification of communication (Piantadosi et al. 2012).


A meta-context for the use of conversational implicature is that the conversation is governed by the Cooperative Principle: all parties to the conversation are making a good-faith effort to exchange information reliably (Benotti and Blackburn 2014; Bi 2019). Therefore, underlying all the other contextual and inferential information that is required for communication is the presumption that both parties are engaging in honest communication.


Lack of clarity, lack of understanding, and even misperception arise when the participants do not have a shared context, or when the same inference is not reasonable to, or is not made by, both parties.


Commonality of context and inferential ability is assumed by the speaker unless they are actually attempting to mislead. It is incumbent on the listener—the potential questioner—to identify when the speaker's context or inferences may be different than their own. If the listener does not have any relevant context, and cannot make any relevant inference, the problem should be immediately obvious to the listener. The listener's response in such cases should be either to ask a question or to submit to losing the thread of the conversation.


An even more challenging situation occurs when the listener's context is different from that of the speaker, or when the inference made by the listener is different from that of the speaker. This will lead to misunderstanding that may not be recognized by either the speaker or the listener. If the speaker understands their audience, they should refrain from using implicatures that can be misperceived in this way. However, the listener must be constantly alert for the possibilities of differing contexts and implications. This burden requires a mode of listening that goes beyond the simple acquisition of information, and that engages the ability to imagine the existence of alternate contexts and implications.



More Reasons Why People Don't Ask More Questions


Recognition of the consequences of conversational implicature allows the identification of several possible additional reasons why people don't ask questions—specifically, why they don't have questions to ask. These reasons are not commonly discussed:

  • The listener does not recognize the existence of contextual and inferential assumptions made by the speaker (whether or not their own assumptions would be the same or different).
  • The listener recognizes the contextual and inferential assumptions but either cannot, or does not make the effort to, imagine potential alternatives.

That is, sometimes people don't ask questions because they don't realize that they don't actually understand what was said.  This can result from a failure to see through language's fog of ambiguity, or to 'see' the wrong thing entirely.

 

These reasons are not necessarily independent of those listed in the first section above. Some may have had the experience of working with others who comprehensively question assumptions—or at least, voice those questions—in some social circumstances but not in others.


For example, although violation of the Cooperative Principle may be a deliberate and acceptable strategy to improve communication (Bi 2019), it may also seem to be transgressive in that it is uncooperative. In combination with other factors listed above, some individuals may feel that the transgressive cost outweighs the informational benefit.

 

On the other side of the coin are those who frequently ask questions. Some individuals may be less inclined to adhere to the Cooperative Principle, because of their nature or preferred (possibly strategic) mode of interaction. This tendency may also be situational. Alternatively, some individuals may not very sensitive to the norms of cooperative social conventions.


Summary


Conversational implicature and the ambiguity that it encompasses provide a basis for understanding why individuals do not ask questions in some circumstances.




References


Benotti, L., and P. Blackburn. 2014. Context and Implicature. Context in Computing, Springer New York. pp.419-436.


Bi, M. 2019. Analysis of the Conversational Implicature of Violating Cooperative Principles in Daily Discourse. American Journal of History and Culture, 2:13.


Carter, A. J., A. Croft, D. Lukas, and G. M. Sandstrom. 2018. Women's visibility in academic seminars: Women ask fewer questions than men. PLOS ONE, Public Library of Science (PLoS), 2018, 14, e0212146.


Piantadosi, S. T., H. Tily, and E. Gibson. 2012. The communicative function of ambiguity in language. Cognition 122:280-291.


Schmidt, S. J. and J. R. Davenport, J. R. 2017. Who asks questions at astronomy meetings?

Nature Astronomy, 1, 0153.


Tannen, D. 1990. You Just Don't Understand: Women and Men in Conversation. Harper-Collins, New York.



Wednesday, August 30, 2023

Calculating with Significant Digits in PostgreSQL

Data that are stored in databases may be measurement results that are inherently imprecise. The precision of a measurement is often represented by the number of significant digits associated with the measurement. Calculations performed with such measurements should preserve, or propagate, the significant digits so that the precision of the calculated result is also represented appropriately.


Numeric data types supported by database management systems (DBMSs) do not have any attribute that allows the precision of each numeric value to be represented. If the precision of a measured value is to be represented by significant digits, and stored with the value, then ordinarily a separate table column must be used for the number of significant digits for each measurement.


There are other approaches to representing precision that can be used. One is to store all measured values as text. This is not a wholly adequate approach because, unless scientific notation is used, when there are multiple trailing zeroes to the left of the decimal point, the precision cannot be represented accurately. Also, this is an unworkable approach if any calculations are to be performed: text values must be cast to numeric values before calculation, which results in loss of the purely visual representation of the number of significant digits, and consequently no way to propagate precision estimates throughout calculation.


Also, precision may be represented in ways other than using significant digits. Recording a standard deviation for each measurement, instead of the significant digits, is one alternate approach. This may be appropriate when calibration data, or counting statistics for radioactivity measurements, provide a statistical estimate of the precision of each value.


This post is focused solely on the case where precision is represented by the number of significant digits for each measured value.


Among commonly used DBMSs, PostgreSQL is uniquely suited for managing imprecise measurement data because of the ease of creating composite data types, and operators and aggregate functions for those data types. The use of composite data types, operators, and aggregate functions for measurement data are illustrated in the following sections.


 

A Composite Data Type for Measurement Data


Postgres allows users to extend the built-in data types either with user-defined types that are defined using C or some other low-level language, or with composite types that are defined using the CREATE TYPE statement. The CREATE TYPE statement provides all the functionality that is needed to implement a measurement type that combines both a floating-point number and an integer value for the significant digits. The definition of this type is simply:


create type meas_value as (
	value double precision,
	sig_figs integer);

 

Construction of, and access to, values of composite types like this are well described in the PostgreSQL documentation.


To perform calculations with values of this type, operators or aggregate types, or both, must be defined. In addition, to handle calculations with significant digits, other supporting functions are also needed.


 

Supporting Functions for Calculations with Significant Digits


Some types of calculations using measurement values require tracking of the least significant place of each measured value, where "place" refers to the one's place, 10's place, 100's place, etc.  For example, the least significant place (LSP) of a sum of measured values is equal to the highest LSP of any of the summands.

The LSP of a measured value can be found by subtracting the number of significant digits from the most significant place (MSP).  The MSP can be found by taking the base-10 logarithm of the value, and converting the result to an integer.

After a sum is calculated and its LSP found, the LSP must be converted back into the number of significant digits.  This operation is almost exactly the same as the previous one, subtracting the LSP from the MSP.  Functions named 'lsp()' and 'figs_from_lsp()' that perform these operations are:

-- Find the LSP for a measured value and sig. figs.
CREATE FUNCTION lsp(in value double precision, in ppf integer) RETURNS integer
language sql immutable leakproof strict parallel safe
RETURN floor(log(abs(value))) + 1 - ppf;

-- Calculate the sig. digits for a measured value and LSP
CREATE FUNCTION figs_from_lsp(in value double precision, in ppf integer) RETURNS integer
language plpgsql immutable leakproof strict parallel safe
as $$
DECLARE
	figs integer;
BEGIN
	figs := floor(log(abs(value))) + 1 - ppf;
	IF figs < 1 then
		figs := 1;
	END IF;
	RETURN figs;
END;
$$;

In Postgres, the 'log()' function is the base-10 logarithm.  The logarithm is converted to an integer by taking its floor and adding one.  It may seem that these two operations could be replaced by a single 'ceil()' (ceiling) function, but they cannot because the ceiling function produces integers for fractional powers of ten (0.1, 0.01, 0.001, etc.) that are inconsistent with the desired results.

The 'figs_from_lsp()' function performs essentially the same operation as the 'lsp()' function, but guarantees that at least one significant digit will be returned.  Operations such as subtraction can lead to a total loss of precision, but as a practical matter even such results are considered (in this implementation) to have at least one significant digit.  An alternative approach would be to raise an exception when there is a total loss of precision.

Although calculations should ordinarily be carried out using as many digits of precision as are available, rounding a measured value to the proper number of significant digits may be desirable before displaying or reporting the value.  The following function will round the value field of a 'meas_value' data type to the given number of significant digits:

CREATE FUNCTION mv_round(mv meas_value) returns meas_value
language plpgsql immutable leakproof strict parallel safe
AS $$
DECLARE
	f double precision;
BEGIN
	f := 10^lsp(mv.value, mv.sig_figs);
	return cast((round(mv.value / f) * f, mv.sig_figs) as meas_value);
END;
$$;

Note that this function does not do the same thing as Postgres' own 'round()' function.  Postgres' function rounds a double-precision number to the nearest integer, whereas 'mv_round()' rounds to the given number of significant digits.  However, the 'mv_round()' function uses Postgres' 'round()' function, and therefore follows its tie-breaking behavior.

Also note that because of the inherently imprecise nature of common implementations of floating-point numbers, rounding a measured value to a specific number of significant digits does not necessarily eliminate all 'noise' bits from the floating-point representation.  Comparison of two rounded measured values should be done with the same caution as should be used for floating-point numbers in general.  The equality operator described in the next section provides a robust means of comparing two data values of the 'meas_value' type.


Operators for the 'meas_value' Composite Type

Operators are binary and unary operators such as "+", "-", "*", and "/". A new operator for a custom type has two parts:

  • A function that carries out the operation, taking one or two arguments
  • A CREATE OPERATOR statement that specifies the operator symbol, the associated function, the data type(s) for the argument(s).  Additional attributes may be specified for some types of operators.

Addition

The infix operator of "+" (i.e., sum) for two measurement value ('meas_value') data types, is defined as follows:

CREATE FUNCTION mv_add_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value + mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR + (
	function = mv_add_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Difference

The operator for the difference of two measured values ("-") is similar to that for addition:

CREATE FUNCTION mv_sub_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value - mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	return rv;
END;
$$;

CREATE OPERATOR - (
	function = mv_sub_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Multiplication

The multiplication and division operators only need to compare the number of significant digits for the two multiplicands, without any computation of the LSP.  The result of multiplication and division is only as precise (has only the number of significant digits) as the least precise of the multiplicands.  The multiplication operator can be defined as follows:

CREATE FUNCTION mv_prod_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value * mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	return rv;
END;
$$;

CREATE OPERATOR * (
	function = mv_prod_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Division

The division operator is defined similarly to the multiplication operator:

CREATE FUNCTION mv_div_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value / mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR / (
	function = mv_div_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Equality

Operators for composite data types can return any data type.  An example of such an operator is the equality operator, which returns a boolean data type.  Equality between two imprecise measured values (as represented by the 'meas_value' data type) can be evaluated in several different ways:

  • The numeric parts are equal
  • The numeric parts are equal only up to the number of significant digits for each of them
  • The number of significant digits for each of them is also equal.


The following implementation of an equality operator considers two measured values to be equal if a) they are equal up to the number of significant digits for each, and b) they both have the same number of significant digits.  This implementation uses a double equal sign ("==") as an equality operator, rather than the single equal sign that is used for native data types.  A single equal sign might be used by an alternate equality operator, such as one that does not require both values to have the same number of significant digits.

CREATE FUNCTION mv_eq_mv(mv1 meas_value, mv2 meas_value)
RETURNS boolean
language plpgsql
as $$
DECLARE
	rv boolean;
BEGIN
	if mv1.sig_figs <> mv2.sig_figs then
		rv := False;
	else
		if round(mv1.value/(10^lsp(mv1.value, mv1.sig_figs))) = round(mv2.value/(10^lsp(mv2.value, mv2.sig_figs))) then
			rv := True;
		else
			rv := False;
		end if;
	end if;
	return rv;
end;
$$;

CREATE OPERATOR == (
	function = mv_eq_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

Note that this equality test scales both double-precision values to integers prior to comparing them.  This is similar to the 'mv_round()' function, but omits an unnecessary multiplication for the sake of better performance and robustness of the comparison.

Other Operators

Other operators can be defined in the same way.  Comprehensive support for calculations with measured values should include equivalents to the "+", "-", "*", and "/" binary operators that take one 'meas_val' data type and one (precise) floating point number, in either order.


Aggregate Functions for the 'meas_value' Data Type

Although binary operations like addition and multiplication are important, similar operations across multiple rows of a data table are often called for.  Native SQL aggregate functions such as 'sum()' and 'avg()' can be extended to use composite data types such as the 'meas_value' type.  The names of the native functions can be used with the 'meas_value' data type because Postgres allows function overloading (multiple dispatch).

The 'sum()' and 'avg()' aggregate functions for the 'meas_value' type require that the highest LSP be tracked across all of the rows that are operated on.  To minimize conversions back and forth between numbers of significant digits and LSP values, a second custom composite type is used to track ('accumulate') both the running sum and the highest LSP.  This type is defined as follows:

create type meas_value_accum as (
	value double precision,
	hlsp integer,
	nrows integer);

In this composite type, the 'value' field is used to store the running sum, the 'hlsp' field is used to record the highest LSP, and the 'nrows' field is used to record the number of rows operated on.

The definition of aggregate functions in PostgreSQL consists of several different parts, similar to the definition of operators.  For aggregate functions, these parts are:

  • A function that is run for every row
  • If needed, a function that is run after all the rows have been processed
  • A CREATE AGGREGATE statement that identifies the function(s) to be used, the data type to be used, and, if necessary, an initial condition for a value of that data type.

The 'sum()' Aggregate Function

The 'sum()' aggregate function for the 'meas_value' data type requires all three of these elements:

  • A function that is run for every row: 'mv_sum()' in the listing below
  • A function that is run after all rows have been processed: 'mv_sum_final()' in the listing below
  • A CREATE AGGREGATE statement.


The code for these is as follows:

CREATE FUNCTION mv_sum(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_sum_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN cast((mv_curr.value, sfigs) as meas_value);
END;
$$;


CREATE AGGREGATE sum(meas_value) (
	SFUNC = mv_sum,
	STYPE = meas_value_accum,
	FINALFUNC = mv_sum_final,
	INITCOND = '(0, -99999, 0)'
	);

 

The 'avg()' Aggregate Function

The definition of the 'avg()' aggregate function is almost identical to that of the 'sum()' aggregate function.  The only additional actions that are required are to track the number of rows processed in the accumulator, and to divide the sum by the number of rows in the final function.  The code for the 'avg()' aggregate function is as follows:

REATE FUNCTION mv_avg(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	mv_curr.nrows = mv_curr.nrows + 1;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_avg_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN (mv_curr.value / mv_curr.nrows, sfigs);
END;
$$;


CREATE AGGREGATE avg(meas_value) (
	SFUNC = mv_avg,
	STYPE = meas_value_accum,
	FINALFUNC = mv_avg_final
	, INITCOND = '(0, -99999, 0)'
	);

 

Summary

Measurement data are ordinarily imprecise, their precision is frequently represented by their number of significant digits, and calculations using measurements should propagate precision estimates to the final result.  This post provides core code for propagating significant digits through calculations in PostgreSQL.

PostgreSQL allows straightforward creation of composite data types, operators, and aggregate functions to handle measurement data with associated significant digits.  Overloading of these custom operators and aggregate functions on the native database features allow measurement data, including significant digits, to be operated on as easily as any native data type.

In combination with SQL's other features for data selection and summarization, the ability to easily perform calculations with imprecise measurement data enhances the value of PostgreSQL as a data analysis tool.