Wednesday, August 30, 2023

Calculating with Significant Digits in PostgreSQL

Data that are stored in databases may be measurement results that are inherently imprecise. The precision of a measurement is often represented by the number of significant digits associated with the measurement. Calculations performed with such measurements should preserve, or propagate, the significant digits so that the precision of the calculated result is also represented appropriately.


Numeric data types supported by database management systems (DBMSs) do not have any attribute that allows the precision of each numeric value to be represented. If the precision of a measured value is to be represented by significant digits, and stored with the value, then ordinarily a separate table column must be used for the number of significant digits for each measurement.


There are other approaches to representing precision that can be used. One is to store all measured values as text. This is not a wholly adequate approach because, unless scientific notation is used, when there are multiple trailing zeroes to the left of the decimal point, the precision cannot be represented accurately. Also, this is an unworkable approach if any calculations are to be performed: text values must be cast to numeric values before calculation, which results in loss of the purely visual representation of the number of significant digits, and consequently no way to propagate precision estimates throughout calculation.


Also, precision may be represented in ways other than using significant digits. Recording a standard deviation for each measurement, instead of the significant digits, is one alternate approach. This may be appropriate when calibration data, or counting statistics for radioactivity measurements, provide a statistical estimate of the precision of each value.


This post is focused solely on the case where precision is represented by the number of significant digits for each measured value.


Among commonly used DBMSs, PostgreSQL is uniquely suited for managing imprecise measurement data because of the ease of creating composite data types, and operators and aggregate functions for those data types. The use of composite data types, operators, and aggregate functions for measurement data are illustrated in the following sections.


 

A Composite Data Type for Measurement Data


Postgres allows users to extend the built-in data types either with user-defined types that are defined using C or some other low-level language, or with composite types that are defined using the CREATE TYPE statement. The CREATE TYPE statement provides all the functionality that is needed to implement a measurement type that combines both a floating-point number and an integer value for the significant digits. The definition of this type is simply:


create type meas_value as (
	value double precision,
	sig_figs integer);

 

Construction of, and access to, values of composite types like this are well described in the PostgreSQL documentation.


To perform calculations with values of this type, operators or aggregate types, or both, must be defined. In addition, to handle calculations with significant digits, other supporting functions are also needed.


 

Supporting Functions for Calculations with Significant Digits


Some types of calculations using measurement values require tracking of the least significant place of each measured value, where "place" refers to the one's place, 10's place, 100's place, etc.  For example, the least significant place (LSP) of a sum of measured values is equal to the highest LSP of any of the summands.

The LSP of a measured value can be found by subtracting the number of significant digits from the most significant place (MSP).  The MSP can be found by taking the base-10 logarithm of the value, and converting the result to an integer.

After a sum is calculated and its LSP found, the LSP must be converted back into the number of significant digits.  This operation is almost exactly the same as the previous one, subtracting the LSP from the MSP.  Functions named 'lsp()' and 'figs_from_lsp()' that perform these operations are:

-- Find the LSP for a measured value and sig. figs.
CREATE FUNCTION lsp(in value double precision, in ppf integer) RETURNS integer
language sql immutable leakproof strict parallel safe
RETURN floor(log(abs(value))) + 1 - ppf;

-- Calculate the sig. digits for a measured value and LSP
CREATE FUNCTION figs_from_lsp(in value double precision, in ppf integer) RETURNS integer
language plpgsql immutable leakproof strict parallel safe
as $$
DECLARE
	figs integer;
BEGIN
	figs := floor(log(abs(value))) + 1 - ppf;
	IF figs < 1 then
		figs := 1;
	END IF;
	RETURN figs;
END;
$$;

In Postgres, the 'log()' function is the base-10 logarithm.  The logarithm is converted to an integer by taking its floor and adding one.  It may seem that these two operations could be replaced by a single 'ceil()' (ceiling) function, but they cannot because the ceiling function produces integers for fractional powers of ten (0.1, 0.01, 0.001, etc.) that are inconsistent with the desired results.

The 'figs_from_lsp()' function performs essentially the same operation as the 'lsp()' function, but guarantees that at least one significant digit will be returned.  Operations such as subtraction can lead to a total loss of precision, but as a practical matter even such results are considered (in this implementation) to have at least one significant digit.  An alternative approach would be to raise an exception when there is a total loss of precision.

Although calculations should ordinarily be carried out using as many digits of precision as are available, rounding a measured value to the proper number of significant digits may be desirable before displaying or reporting the value.  The following function will round the value field of a 'meas_value' data type to the given number of significant digits:

CREATE FUNCTION mv_round(mv meas_value) returns meas_value
language plpgsql immutable leakproof strict parallel safe
AS $$
DECLARE
	f double precision;
BEGIN
	f := 10^lsp(mv.value, mv.sig_figs);
	return cast((round(mv.value / f) * f, mv.sig_figs) as meas_value);
END;
$$;

Note that this function does not do the same thing as Postgres' own 'round()' function.  Postgres' function rounds a double-precision number to the nearest integer, whereas 'mv_round()' rounds to the given number of significant digits.  However, the 'mv_round()' function uses Postgres' 'round()' function, and therefore follows its tie-breaking behavior.

Also note that because of the inherently imprecise nature of common implementations of floating-point numbers, rounding a measured value to a specific number of significant digits does not necessarily eliminate all 'noise' bits from the floating-point representation.  Comparison of two rounded measured values should be done with the same caution as should be used for floating-point numbers in general.  The equality operator described in the next section provides a robust means of comparing two data values of the 'meas_value' type.


Operators for the 'meas_value' Composite Type

Operators are binary and unary operators such as "+", "-", "*", and "/". A new operator for a custom type has two parts:

  • A function that carries out the operation, taking one or two arguments
  • A CREATE OPERATOR statement that specifies the operator symbol, the associated function, the data type(s) for the argument(s).  Additional attributes may be specified for some types of operators.

Addition

The infix operator of "+" (i.e., sum) for two measurement value ('meas_value') data types, is defined as follows:

CREATE FUNCTION mv_add_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value + mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR + (
	function = mv_add_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Difference

The operator for the difference of two measured values ("-") is similar to that for addition:

CREATE FUNCTION mv_sub_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value - mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	return rv;
END;
$$;

CREATE OPERATOR - (
	function = mv_sub_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Multiplication

The multiplication and division operators only need to compare the number of significant digits for the two multiplicands, without any computation of the LSP.  The result of multiplication and division is only as precise (has only the number of significant digits) as the least precise of the multiplicands.  The multiplication operator can be defined as follows:

CREATE FUNCTION mv_prod_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value * mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	return rv;
END;
$$;

CREATE OPERATOR * (
	function = mv_prod_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Division

The division operator is defined similarly to the multiplication operator:

CREATE FUNCTION mv_div_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value / mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR / (
	function = mv_div_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Equality

Operators for composite data types can return any data type.  An example of such an operator is the equality operator, which returns a boolean data type.  Equality between two imprecise measured values (as represented by the 'meas_value' data type) can be evaluated in several different ways:

  • The numeric parts are equal
  • The numeric parts are equal only up to the number of significant digits for each of them
  • The number of significant digits for each of them is also equal.


The following implementation of an equality operator considers two measured values to be equal if a) they are equal up to the number of significant digits for each, and b) they both have the same number of significant digits.  This implementation uses a double equal sign ("==") as an equality operator, rather than the single equal sign that is used for native data types.  A single equal sign might be used by an alternate equality operator, such as one that does not require both values to have the same number of significant digits.

CREATE FUNCTION mv_eq_mv(mv1 meas_value, mv2 meas_value)
RETURNS boolean
language plpgsql
as $$
DECLARE
	rv boolean;
BEGIN
	if mv1.sig_figs <> mv2.sig_figs then
		rv := False;
	else
		if round(mv1.value/(10^lsp(mv1.value, mv1.sig_figs))) = round(mv2.value/(10^lsp(mv2.value, mv2.sig_figs))) then
			rv := True;
		else
			rv := False;
		end if;
	end if;
	return rv;
end;
$$;

CREATE OPERATOR == (
	function = mv_eq_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

Note that this equality test scales both double-precision values to integers prior to comparing them.  This is similar to the 'mv_round()' function, but omits an unnecessary multiplication for the sake of better performance and robustness of the comparison.

Other Operators

Other operators can be defined in the same way.  Comprehensive support for calculations with measured values should include equivalents to the "+", "-", "*", and "/" binary operators that take one 'meas_val' data type and one (precise) floating point number, in either order.


Aggregate Functions for the 'meas_value' Data Type

Although binary operations like addition and multiplication are important, similar operations across multiple rows of a data table are often called for.  Native SQL aggregate functions such as 'sum()' and 'avg()' can be extended to use composite data types such as the 'meas_value' type.  The names of the native functions can be used with the 'meas_value' data type because Postgres allows function overloading (multiple dispatch).

The 'sum()' and 'avg()' aggregate functions for the 'meas_value' type require that the highest LSP be tracked across all of the rows that are operated on.  To minimize conversions back and forth between numbers of significant digits and LSP values, a second custom composite type is used to track ('accumulate') both the running sum and the highest LSP.  This type is defined as follows:

create type meas_value_accum as (
	value double precision,
	hlsp integer,
	nrows integer);

In this composite type, the 'value' field is used to store the running sum, the 'hlsp' field is used to record the highest LSP, and the 'nrows' field is used to record the number of rows operated on.

The definition of aggregate functions in PostgreSQL consists of several different parts, similar to the definition of operators.  For aggregate functions, these parts are:

  • A function that is run for every row
  • If needed, a function that is run after all the rows have been processed
  • A CREATE AGGREGATE statement that identifies the function(s) to be used, the data type to be used, and, if necessary, an initial condition for a value of that data type.

The 'sum()' Aggregate Function

The 'sum()' aggregate function for the 'meas_value' data type requires all three of these elements:

  • A function that is run for every row: 'mv_sum()' in the listing below
  • A function that is run after all rows have been processed: 'mv_sum_final()' in the listing below
  • A CREATE AGGREGATE statement.


The code for these is as follows:

CREATE FUNCTION mv_sum(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_sum_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN cast((mv_curr.value, sfigs) as meas_value);
END;
$$;


CREATE AGGREGATE sum(meas_value) (
	SFUNC = mv_sum,
	STYPE = meas_value_accum,
	FINALFUNC = mv_sum_final,
	INITCOND = '(0, -99999, 0)'
	);

 

The 'avg()' Aggregate Function

The definition of the 'avg()' aggregate function is almost identical to that of the 'sum()' aggregate function.  The only additional actions that are required are to track the number of rows processed in the accumulator, and to divide the sum by the number of rows in the final function.  The code for the 'avg()' aggregate function is as follows:

REATE FUNCTION mv_avg(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	mv_curr.nrows = mv_curr.nrows + 1;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_avg_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN (mv_curr.value / mv_curr.nrows, sfigs);
END;
$$;


CREATE AGGREGATE avg(meas_value) (
	SFUNC = mv_avg,
	STYPE = meas_value_accum,
	FINALFUNC = mv_avg_final
	, INITCOND = '(0, -99999, 0)'
	);

 

Summary

Measurement data are ordinarily imprecise, their precision is frequently represented by their number of significant digits, and calculations using measurements should propagate precision estimates to the final result.  This post provides core code for propagating significant digits through calculations in PostgreSQL.

PostgreSQL allows straightforward creation of composite data types, operators, and aggregate functions to handle measurement data with associated significant digits.  Overloading of these custom operators and aggregate functions on the native database features allow measurement data, including significant digits, to be operated on as easily as any native data type.

In combination with SQL's other features for data selection and summarization, the ability to easily perform calculations with imprecise measurement data enhances the value of PostgreSQL as a data analysis tool.

No comments:

Post a Comment