Sunday, September 24, 2023

Median and Quartiles by Group in SQLite

Data distributions are often characterized by their range mean, median, and quartiles.  For data sets in SQLite, the range (min and max) and mean are easily calculated with built-in aggregate functions.  Calculation of quartiles (the first quartile, the median, and the third quartile) require custom SQL code, as there are no built-in aggregate functions for these values.  This previous post contains code to calculate the median, but that technique cannot be extended to calculate the first and third quartiles.

There are other available examples of calculation of quartiles in SQLite, but (as of this writing) all of them (that I have found) yield a result that exactly matches one of the values in the data set.  These results may be only approximate for some data sets, specifically when there is an even number of rows overall or within two adjacent quartiles.

The code below takes the same approach to calculating the first and third quartiles as is standard for the median:

  • If there is an odd number of rows, take the middle value
  • If there is an even number of rows, take the average of the highest value in the lower bin (i.e., the lower half for the median) and the lowest value in the upper bin.

The solution presented here differs from others in accommodating the second of these cases.  In addition, it allows for calculation of these statistics for subgroups of the overall data set.

This illustration is designed to be run on a table with the following columns:

  • "id": A categorical column used to divide the data set into subgroups.  Statistics will be calculated for each unique "id".
  • "value": A numerical column for which the median and quartiles will be calculated.

Such a table can be produced with the following code:

create table testdata as
select *
from
	(select value as id from generate_series(1,8)) as d
	cross join (select random() as value from generate_series(1, 10000));

This approach uses SQLite's 'ntile()' function to subdivide the data set (or each subgroup) into equal or nearly-equal subsets.  The sizes and bounding values for each of these subsets are then extracted to allow determination of which of the two cases listed above should be used for each quartile and the median.

with nt as (
	SELECT
		id, value,
		ntile(4) over (partition by id order by value) as ntile_4
	FROM
		testdata
	),
idlist as (select distinct id from nt),
q1 as (
	select id, max(value) as q1_max, count(*) as q1_n
	from nt where ntile_4 = 1 group by id),
q2 as (
	select id, min(value) as q2_min, max(value) as q2_max, count(*) as q2_n
	from nt where ntile_4 = 2 group by id),
q3 as (
	select id, min(value) as q3_min, max(value) as q3_max, count(*) as q3_n
	from nt where ntile_4 = 3 group by id),
q4 as (
	select id, min(value) as q4_min, count(*) as q4_n
	from nt where ntile_4 = 4 group by id),
s as (
	SELECT
		st.id,
		q1_max, q1_n,
		q2_min, q2_max, q2_n,
		q3_min , q3_max , q3_n,
		q4_min, q4_n
	from
		idlist as st
		left join q1 on q1.id=st.id
		left join q2 on q2.id=st.id
		left join q3 on q3.id=st.id
		left join q4 on q4.id=st.id)
select
	id,
	case when q1_n = q2_n then (q1_max + q2_min)/2
		else case when q1_n > q2_n then q1_max else q2_min end end as q1,
	case when q1_n + q2_n = q3_n + q4_n then (q2_max + q3_min)/2
		else case when q1_n + q2_n > q3_n + q4_n then q2_max else q3_min end end as median,
	case when q3_n = q4_n then (q3_max + q4_min)/2
		else case when q3_n > q4_n then q3_max else q4_min end end as q3
from s
order by id
;

Monday, September 4, 2023

Why Don't People Ask More Questions?

More than what?  said a person who asks a lot of questions.

This is not a new question. There are numerous summaries of reasons why people don't ask questions, and suggestions for how one can become a better question-asker. A lot of existing discussion is focused on general conversations; the focus here is specifically on technical conversations in professional settings, where the goal is to convey information from one party to another (that is, report-talk rather than rapport-talk (Tannen 1990).

This post summarizes commonly suggested reasons why people don't ask more question, touches briefly on some philosophy of communication, and suggests an additional (partial) answer that doesn't appear to be discussed elsewhere.


Common Reasons that People Don't Ask Questions


Commonly-cited reasons for a person's reluctance to ask questions are:

  • Risk aversion -- This may take several forms:
    • Unwillingness to ask what may be a stupid question, and suffer embarassment or reputational harm.
    • Unwillingness to seem pushy (this overlaps somewhat with deference to authority, below).
    • Fear of consequences -- asking questions that challenge the status quo might be perceived as having a risk to one's position or standing.
    • Avoidance of transparency -- fear of having the same question turned back on oneself, or of being asked to explain the basis of the question.
  • Deference to authority -- Feeling that one does not have the position, or background, for asking questions to be acceptable to others (imposter syndrome).
  • Egotism -- This takes two forms:
    • Desiring to present oneself as knowledgeable, and not needing to ask questions.
    • Desiring, and impatient, to speak oneself, so not wishing to preserve another person's centrality to the conversation by asking questions.
  • Apathy -- The topic is uninteresting.
  • Animus -- The speaker is uninteresting or worse.
  • Sex- or culturally-determined differences — women may ask fewer questions in some circumstances, particularly those that may be confrontational (Carter et al., 2018; Schmidt and Davenport, 2017, Tannen 1990).

These are potential explanations for Why don't people ask the questions that they have?, but something else to consider is Why don't people think of questions to ask?  Simple explanations are apathy (again) and the lack of critical thinking skills.  However, differing contexts of speaker and listener may also play a large role.


Conversational Implicature and Ambiguity


The reasons why any questions at all might need to be asked can be attributed to communication problem: either a lack of clarity on the speaker's part, or a lack of understanding on the questioner's part. The existence of such communication problems seems like it should be an aberration in natural language. Natural languages have evolved, however, to allow ambiguity, and there may even be some advantages to this ambiguity (Piantadosi et al. 2012). Languages can be constructed to minimize or eliminate ambiguity (cf. Lojban), but natural languages have not developed to be free of ambiguity.

 

Conversational implicature is the term used to describe the disjunction in communication when what is said is not what is meant. Conversational implicature underlies many common forms of communication, ranging from humor to political dog whistles. But it is not limited to such special cases; it is common in everyday conversation. A humdrum example from Paul Grice, who coined the term (reproduced from Benotti and Blackburn [2014]) is:


Man standing by his car: "I am out of petrol."

Passer-by: "There is a garage around the corner."


The effectiveness of conversational implicature in this case rests on common context and inferences, such as:

  • Both parties know that petrol is available at garages.
  • "Around the corner" means that the garage is within walking distance.
  • The passer-by knows that the garage is open and has petrol to sell.

Implicature simplifies conversation by eliminating the need to explicitly state those things that are part of the participants' shared context, or that can be reasonably inferred.


Much--even most--meaning in conversation does not reside in the words spoken at all, but is filled in by the person listening. -- D. Tannen (1990)


Although the result of conversational implicature can be ambiguity, ambiguity may have important benefits for the simplification of communication (Piantadosi et al. 2012).


A meta-context for the use of conversational implicature is that the conversation is governed by the Cooperative Principle: all parties to the conversation are making a good-faith effort to exchange information reliably (Benotti and Blackburn 2014; Bi 2019). Therefore, underlying all the other contextual and inferential information that is required for communication is the presumption that both parties are engaging in honest communication.


Lack of clarity, lack of understanding, and even misperception arise when the participants do not have a shared context, or when the same inference is not reasonable to, or is not made by, both parties.


Commonality of context and inferential ability is assumed by the speaker unless they are actually attempting to mislead. It is incumbent on the listener—the potential questioner—to identify when the speaker's context or inferences may be different than their own. If the listener does not have any relevant context, and cannot make any relevant inference, the problem should be immediately obvious to the listener. The listener's response in such cases should be either to ask a question or to submit to losing the thread of the conversation.


An even more challenging situation occurs when the listener's context is different from that of the speaker, or when the inference made by the listener is different from that of the speaker. This will lead to misunderstanding that may not be recognized by either the speaker or the listener. If the speaker understands their audience, they should refrain from using implicatures that can be misperceived in this way. However, the listener must be constantly alert for the possibilities of differing contexts and implications. This burden requires a mode of listening that goes beyond the simple acquisition of information, and that engages the ability to imagine the existence of alternate contexts and implications.



More Reasons Why People Don't Ask More Questions


Recognition of the consequences of conversational implicature allows the identification of several possible additional reasons why people don't ask questions—specifically, why they don't have questions to ask. These reasons are not commonly discussed:

  • The listener does not recognize the existence of contextual and inferential assumptions made by the speaker (whether or not their own assumptions would be the same or different).
  • The listener recognizes the contextual and inferential assumptions but either cannot, or does not make the effort to, imagine potential alternatives.

That is, sometimes people don't ask questions because they don't realize that they don't actually understand what was said.  This can result from a failure to see through language's fog of ambiguity, or to 'see' the wrong thing entirely.

 

These reasons are not necessarily independent of those listed in the first section above. Some may have had the experience of working with others who comprehensively question assumptions—or at least, voice those questions—in some social circumstances but not in others.


For example, although violation of the Cooperative Principle may be a deliberate and acceptable strategy to improve communication (Bi 2019), it may also seem to be transgressive in that it is uncooperative. In combination with other factors listed above, some individuals may feel that the transgressive cost outweighs the informational benefit.

 

On the other side of the coin are those who frequently ask questions. Some individuals may be less inclined to adhere to the Cooperative Principle, because of their nature or preferred (possibly strategic) mode of interaction. This tendency may also be situational. Alternatively, some individuals may not very sensitive to the norms of cooperative social conventions.


Summary


Conversational implicature and the ambiguity that it encompasses provide a basis for understanding why individuals do not ask questions in some circumstances.




References


Benotti, L., and P. Blackburn. 2014. Context and Implicature. Context in Computing, Springer New York. pp.419-436.


Bi, M. 2019. Analysis of the Conversational Implicature of Violating Cooperative Principles in Daily Discourse. American Journal of History and Culture, 2:13.


Carter, A. J., A. Croft, D. Lukas, and G. M. Sandstrom. 2018. Women's visibility in academic seminars: Women ask fewer questions than men. PLOS ONE, Public Library of Science (PLoS), 2018, 14, e0212146.


Piantadosi, S. T., H. Tily, and E. Gibson. 2012. The communicative function of ambiguity in language. Cognition 122:280-291.


Schmidt, S. J. and J. R. Davenport, J. R. 2017. Who asks questions at astronomy meetings?

Nature Astronomy, 1, 0153.


Tannen, D. 1990. You Just Don't Understand: Women and Men in Conversation. Harper-Collins, New York.



Wednesday, August 30, 2023

Calculating with Significant Digits in PostgreSQL

Data that are stored in databases may be measurement results that are inherently imprecise. The precision of a measurement is often represented by the number of significant digits associated with the measurement. Calculations performed with such measurements should preserve, or propagate, the significant digits so that the precision of the calculated result is also represented appropriately.


Numeric data types supported by database management systems (DBMSs) do not have any attribute that allows the precision of each numeric value to be represented. If the precision of a measured value is to be represented by significant digits, and stored with the value, then ordinarily a separate table column must be used for the number of significant digits for each measurement.


There are other approaches to representing precision that can be used. One is to store all measured values as text. This is not a wholly adequate approach because, unless scientific notation is used, when there are multiple trailing zeroes to the left of the decimal point, the precision cannot be represented accurately. Also, this is an unworkable approach if any calculations are to be performed: text values must be cast to numeric values before calculation, which results in loss of the purely visual representation of the number of significant digits, and consequently no way to propagate precision estimates throughout calculation.


Also, precision may be represented in ways other than using significant digits. Recording a standard deviation for each measurement, instead of the significant digits, is one alternate approach. This may be appropriate when calibration data, or counting statistics for radioactivity measurements, provide a statistical estimate of the precision of each value.


This post is focused solely on the case where precision is represented by the number of significant digits for each measured value.


Among commonly used DBMSs, PostgreSQL is uniquely suited for managing imprecise measurement data because of the ease of creating composite data types, and operators and aggregate functions for those data types. The use of composite data types, operators, and aggregate functions for measurement data are illustrated in the following sections.


 

A Composite Data Type for Measurement Data


Postgres allows users to extend the built-in data types either with user-defined types that are defined using C or some other low-level language, or with composite types that are defined using the CREATE TYPE statement. The CREATE TYPE statement provides all the functionality that is needed to implement a measurement type that combines both a floating-point number and an integer value for the significant digits. The definition of this type is simply:


create type meas_value as (
	value double precision,
	sig_figs integer);

 

Construction of, and access to, values of composite types like this are well described in the PostgreSQL documentation.


To perform calculations with values of this type, operators or aggregate types, or both, must be defined. In addition, to handle calculations with significant digits, other supporting functions are also needed.


 

Supporting Functions for Calculations with Significant Digits


Some types of calculations using measurement values require tracking of the least significant place of each measured value, where "place" refers to the one's place, 10's place, 100's place, etc.  For example, the least significant place (LSP) of a sum of measured values is equal to the highest LSP of any of the summands.

The LSP of a measured value can be found by subtracting the number of significant digits from the most significant place (MSP).  The MSP can be found by taking the base-10 logarithm of the value, and converting the result to an integer.

After a sum is calculated and its LSP found, the LSP must be converted back into the number of significant digits.  This operation is almost exactly the same as the previous one, subtracting the LSP from the MSP.  Functions named 'lsp()' and 'figs_from_lsp()' that perform these operations are:

-- Find the LSP for a measured value and sig. figs.
CREATE FUNCTION lsp(in value double precision, in ppf integer) RETURNS integer
language sql immutable leakproof strict parallel safe
RETURN floor(log(abs(value))) + 1 - ppf;

-- Calculate the sig. digits for a measured value and LSP
CREATE FUNCTION figs_from_lsp(in value double precision, in ppf integer) RETURNS integer
language plpgsql immutable leakproof strict parallel safe
as $$
DECLARE
	figs integer;
BEGIN
	figs := floor(log(abs(value))) + 1 - ppf;
	IF figs < 1 then
		figs := 1;
	END IF;
	RETURN figs;
END;
$$;

In Postgres, the 'log()' function is the base-10 logarithm.  The logarithm is converted to an integer by taking its floor and adding one.  It may seem that these two operations could be replaced by a single 'ceil()' (ceiling) function, but they cannot because the ceiling function produces integers for fractional powers of ten (0.1, 0.01, 0.001, etc.) that are inconsistent with the desired results.

The 'figs_from_lsp()' function performs essentially the same operation as the 'lsp()' function, but guarantees that at least one significant digit will be returned.  Operations such as subtraction can lead to a total loss of precision, but as a practical matter even such results are considered (in this implementation) to have at least one significant digit.  An alternative approach would be to raise an exception when there is a total loss of precision.

Although calculations should ordinarily be carried out using as many digits of precision as are available, rounding a measured value to the proper number of significant digits may be desirable before displaying or reporting the value.  The following function will round the value field of a 'meas_value' data type to the given number of significant digits:

CREATE FUNCTION mv_round(mv meas_value) returns meas_value
language plpgsql immutable leakproof strict parallel safe
AS $$
DECLARE
	f double precision;
BEGIN
	f := 10^lsp(mv.value, mv.sig_figs);
	return cast((round(mv.value / f) * f, mv.sig_figs) as meas_value);
END;
$$;

Note that this function does not do the same thing as Postgres' own 'round()' function.  Postgres' function rounds a double-precision number to the nearest integer, whereas 'mv_round()' rounds to the given number of significant digits.  However, the 'mv_round()' function uses Postgres' 'round()' function, and therefore follows its tie-breaking behavior.

Also note that because of the inherently imprecise nature of common implementations of floating-point numbers, rounding a measured value to a specific number of significant digits does not necessarily eliminate all 'noise' bits from the floating-point representation.  Comparison of two rounded measured values should be done with the same caution as should be used for floating-point numbers in general.  The equality operator described in the next section provides a robust means of comparing two data values of the 'meas_value' type.


Operators for the 'meas_value' Composite Type

Operators are binary and unary operators such as "+", "-", "*", and "/". A new operator for a custom type has two parts:

  • A function that carries out the operation, taking one or two arguments
  • A CREATE OPERATOR statement that specifies the operator symbol, the associated function, the data type(s) for the argument(s).  Additional attributes may be specified for some types of operators.

Addition

The infix operator of "+" (i.e., sum) for two measurement value ('meas_value') data types, is defined as follows:

CREATE FUNCTION mv_add_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value + mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR + (
	function = mv_add_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Difference

The operator for the difference of two measured values ("-") is similar to that for addition:

CREATE FUNCTION mv_sub_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value - mv2.value;
	pl1 = lsp(mv1.value, mv1.sig_figs);
	pl2 = lsp(mv2.value, mv2.sig_figs);
	if pl1 > pl2 then
		rv.sig_figs = figs_from_lsp(rv.value, pl1);
	else
		rv.sig_figs = figs_from_lsp(rv.value, pl2);
	end if;
	return rv;
END;
$$;

CREATE OPERATOR - (
	function = mv_sub_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Multiplication

The multiplication and division operators only need to compare the number of significant digits for the two multiplicands, without any computation of the LSP.  The result of multiplication and division is only as precise (has only the number of significant digits) as the least precise of the multiplicands.  The multiplication operator can be defined as follows:

CREATE FUNCTION mv_prod_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
as $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value * mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	return rv;
END;
$$;

CREATE OPERATOR * (
	function = mv_prod_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Division

The division operator is defined similarly to the multiplication operator:

CREATE FUNCTION mv_div_mv(mv1 meas_value, mv2 meas_value)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	rv meas_value;
	pl1 integer;
	pl2 integer;
BEGIN
	rv.value = mv1.value / mv2.value;
	if mv1.sig_figs > mv2.sig_figs then
		rv.sig_figs = mv2.sig_figs;
	else
		rv.sig_figs = mv1.sig_figs;
	end if;
	RETURN rv;
END;
$$;

CREATE OPERATOR / (
	function = mv_div_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

 

Equality

Operators for composite data types can return any data type.  An example of such an operator is the equality operator, which returns a boolean data type.  Equality between two imprecise measured values (as represented by the 'meas_value' data type) can be evaluated in several different ways:

  • The numeric parts are equal
  • The numeric parts are equal only up to the number of significant digits for each of them
  • The number of significant digits for each of them is also equal.


The following implementation of an equality operator considers two measured values to be equal if a) they are equal up to the number of significant digits for each, and b) they both have the same number of significant digits.  This implementation uses a double equal sign ("==") as an equality operator, rather than the single equal sign that is used for native data types.  A single equal sign might be used by an alternate equality operator, such as one that does not require both values to have the same number of significant digits.

CREATE FUNCTION mv_eq_mv(mv1 meas_value, mv2 meas_value)
RETURNS boolean
language plpgsql
as $$
DECLARE
	rv boolean;
BEGIN
	if mv1.sig_figs <> mv2.sig_figs then
		rv := False;
	else
		if round(mv1.value/(10^lsp(mv1.value, mv1.sig_figs))) = round(mv2.value/(10^lsp(mv2.value, mv2.sig_figs))) then
			rv := True;
		else
			rv := False;
		end if;
	end if;
	return rv;
end;
$$;

CREATE OPERATOR == (
	function = mv_eq_mv,
	leftarg = meas_value,
	rightarg = meas_value
	);

Note that this equality test scales both double-precision values to integers prior to comparing them.  This is similar to the 'mv_round()' function, but omits an unnecessary multiplication for the sake of better performance and robustness of the comparison.

Other Operators

Other operators can be defined in the same way.  Comprehensive support for calculations with measured values should include equivalents to the "+", "-", "*", and "/" binary operators that take one 'meas_val' data type and one (precise) floating point number, in either order.


Aggregate Functions for the 'meas_value' Data Type

Although binary operations like addition and multiplication are important, similar operations across multiple rows of a data table are often called for.  Native SQL aggregate functions such as 'sum()' and 'avg()' can be extended to use composite data types such as the 'meas_value' type.  The names of the native functions can be used with the 'meas_value' data type because Postgres allows function overloading (multiple dispatch).

The 'sum()' and 'avg()' aggregate functions for the 'meas_value' type require that the highest LSP be tracked across all of the rows that are operated on.  To minimize conversions back and forth between numbers of significant digits and LSP values, a second custom composite type is used to track ('accumulate') both the running sum and the highest LSP.  This type is defined as follows:

create type meas_value_accum as (
	value double precision,
	hlsp integer,
	nrows integer);

In this composite type, the 'value' field is used to store the running sum, the 'hlsp' field is used to record the highest LSP, and the 'nrows' field is used to record the number of rows operated on.

The definition of aggregate functions in PostgreSQL consists of several different parts, similar to the definition of operators.  For aggregate functions, these parts are:

  • A function that is run for every row
  • If needed, a function that is run after all the rows have been processed
  • A CREATE AGGREGATE statement that identifies the function(s) to be used, the data type to be used, and, if necessary, an initial condition for a value of that data type.

The 'sum()' Aggregate Function

The 'sum()' aggregate function for the 'meas_value' data type requires all three of these elements:

  • A function that is run for every row: 'mv_sum()' in the listing below
  • A function that is run after all rows have been processed: 'mv_sum_final()' in the listing below
  • A CREATE AGGREGATE statement.


The code for these is as follows:

CREATE FUNCTION mv_sum(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_sum_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN cast((mv_curr.value, sfigs) as meas_value);
END;
$$;


CREATE AGGREGATE sum(meas_value) (
	SFUNC = mv_sum,
	STYPE = meas_value_accum,
	FINALFUNC = mv_sum_final,
	INITCOND = '(0, -99999, 0)'
	);

 

The 'avg()' Aggregate Function

The definition of the 'avg()' aggregate function is almost identical to that of the 'sum()' aggregate function.  The only additional actions that are required are to track the number of rows processed in the accumulator, and to divide the sum by the number of rows in the final function.  The code for the 'avg()' aggregate function is as follows:

REATE FUNCTION mv_avg(mv_curr meas_value_accum, mv_next meas_value)
RETURNS meas_value_accum
language plpgsql
AS $$
DECLARE
	lsp integer;
BEGIN
	if mv_next.value is not null then
		mv_curr.value = mv_curr.value + mv_next.value;
		if mv_next.sig_figs is not null then
			lsp = lsp(mv_next.value, mv_next.sig_figs);
			if mv_curr.hlsp is null then
				mv_curr.hlsp = lsp;
			elsif lsp > mv_curr.hlsp then
				mv_curr.hlsp = lsp;
			end if;
		end if;
	end if;
	mv_curr.nrows = mv_curr.nrows + 1;
	RETURN mv_curr;
END;
$$;


CREATE FUNCTION mv_avg_final(mv_curr meas_value_accum)
RETURNS meas_value
language plpgsql
AS $$
DECLARE
	sfigs integer;
BEGIN
	sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp);
	RETURN (mv_curr.value / mv_curr.nrows, sfigs);
END;
$$;


CREATE AGGREGATE avg(meas_value) (
	SFUNC = mv_avg,
	STYPE = meas_value_accum,
	FINALFUNC = mv_avg_final
	, INITCOND = '(0, -99999, 0)'
	);

 

Summary

Measurement data are ordinarily imprecise, their precision is frequently represented by their number of significant digits, and calculations using measurements should propagate precision estimates to the final result.  This post provides core code for propagating significant digits through calculations in PostgreSQL.

PostgreSQL allows straightforward creation of composite data types, operators, and aggregate functions to handle measurement data with associated significant digits.  Overloading of these custom operators and aggregate functions on the native database features allow measurement data, including significant digits, to be operated on as easily as any native data type.

In combination with SQL's other features for data selection and summarization, the ability to easily perform calculations with imprecise measurement data enhances the value of PostgreSQL as a data analysis tool.

Saturday, October 2, 2021

Are We in the Twilight Years of Programming Language Development?

The following figure shows the number of new programming languages created per year since 1970.  The data are from the Wikipedia page Timeline of Programming Languages.


There's a consistent decline since the mid-'90s.  Have most of the best programming languages already been developed, so we don't need any more?



Saturday, August 28, 2021

Data Transformation Operations and a Standard Terminology

Data standardization, normalization, rescaling, and transformation: what they are and when to apply them

There are several different types of operations that are commonly applied to data before further analysis.  These operations are applied to individual data values, but are typically based on the attributes of a group of data values.  In a data matrix where columns are variables and rows are instances, the group that is used may be either all of the values in a row or all of the values in a column.  Which type of group to use, as well as which type of operation to apply, depends on both the data themselves and the needs of the analysis to be conducted.

Different types of operations that are commonly applied to data are standardization, rescaling, normalization, and non-linear transformation.  These terms are often used ambiguously and inconsistently.  Misunderstanding what others have done, or misrepresenting what you have done, is easy when these operations are not clearly defined and distinguished.

The various operations that may be applied to data are:

  • Standardization: Dividing every value by another value or by a constant.  The denominator is the standard to which each individual value is referenced.
  • Rescaling or range scaling: Converting the data from the original range of the group to a fixed range such as -1 to 1 or 0 to 1.
  • Mean centering: Subtracting the mean of the group from each value.
  • Normalization or z-transformation: Subtracting the mean of the group from each value and then dividing each value by the standard deviation of the group.
  • Non-linear transformation: Any non-linear operation such as taking the logarithm, square root, or probit of each data value.

Contributing to the ambiguity of these terms is that there is no single term that encompasses this group of operations as a whole.  The word “transformation” is most commonly used for this purpose, although such usage potentially conflicts with use of the same term for different linear and non-linear operations upon the data.  The word “normalization” is also often used as a catch-all for various types of operations on data.  Because normalization (as defined above) results in the calculation of standard (Z) scores, normalization is often referred to as simply standardization.  The word “normalization” is also frequently used when the actual operation referenced is either standardization or rescaling.  Non-linear transformations may also be referred to as rescaling because they change the data from a linear to a non-linear scale.

Because of these ambiguities:

  •  Be specific when describing the operations that you have applied to the data.  Use the terms above as the basis for your descriptions.
  •  Be careful when reading others’ descriptions of the operations that they have applied.  If a description is not specific, assume that the author is not sensitive to ambiguities in terminology, and may be using different definitions than those above.

Summary of Data Relationships Under Different Data Transformation Operations

The following table summarizes how data distributions are affected by the different operations.

Operation Linearity preserved Proportionality preserved Additivity preserved Sensitivity to statistical outliers (c)
Standardization Yes (a) Yes (a,b) No Moderate
Rescaling or range scaling Yes No No Extreme
Mean centering Yes No No Moderate
Normalization or z-transformation Yes No No Moderate
Non-linear transformation No No No Operation-dependent


Notes

a) Only within each group for which the same divisor has been applied to all values.

b) Standardization cannot create proportional relationships; that it does can be a hidden assumption when it is used.

c) None of the operations eliminate outliers.  Range scaling is very sensitive to outliers because an outlier will determine one end of the range.  Some non-linear transformation, such as log transformation, may reduce the relative weight of outliers.

Guidelines for Application

Some very general guidelines for application of these operations are:

  • If a statistical analysis is to be conducted, non-linear transformation may be needed to ensure that the data satisfy requirements of the method.
  • Normalization (z-transformation) may be appropriate prior to regression analyses when the purpose is to interpret the regression coefficients as measures of the relative importance of different variables.  Regression is insensitive to operations that preserve linear relationship, so standardization or rescaling are not ordinarily needed.
  • If an analysis based on similarity is to be conducted (e.g., clustering), then non-linear transformation should be avoided.  Most similarity measures are based on linear relationships (e.g., Euclidean distance), so non-linear transforms will alter the estimates of similarities between instances.
  • If an analysis based on Pearson correlations is to be conducted, then non-linear transformations should be used with caution, because the results of the analysis will not apply to data in the original scale.
  • When each variable represents a component of a larger quantity (e.g., variables are PCB congeners and total PCBs is the larger quantity), then standardization of each value to the sum is frequently appropriate, to eliminate the effect of differences between instances.  This operation is commonly referred to as "row-sum normalization," although it is not strictly normalization following the definitions above.

Application of more than one of the data transformation operations may be appropriate for some data sets and some analyses.

 

Sunday, April 18, 2021

Data Management Best Practices

There are a lot of web pages that present some interpretation of data management best practices.  Most of them are, in my opinion, uselessly general.  Herewith, then, is my list of specific recommended best practices.

 

1. Know the use of the data
    a. For the overall project
    b. For each data summary request

2. Use a single authoritative data store.

3. Back up important data.

4. Verify or characterize data quality.

5. Control access to sensitive data.

6. Track data changes.

7. Document data management practices.
    a. Default practices
    b. Project- or data-set-specific

8. Preserve original data.

9. Script all data revisions and data summarizations.

10. Use version control for things that may change.
    a. Scripts
    b. Regular or periodic data exports

11. Record metadata
    a. For both incoming and outgoing data
    b. Metadata includes
        1. Provenance: Who created, provided, or produced the data.
        2. Content: What the data set contains
        3. Purpose: What the data are intended to be used for.
        4. Method: How the data were generated or selected and summarized.
        5. History: The history of any revisions made to the data or the data summarization method.
    C. Forms of metadata include:
        1. Copies of emails or other documents that transmit data or request data.
        2. Header notes in scripts.
        3. Metadata pages and glossary pages in data summaries.
        4. Custom log files created by scripted data operations.

12. Date-tag directory and file names where the sequence of changes may affect their validity or interpretability.


Sunday, March 21, 2021

Crosstabbing Data in Postgres with an Indefinite Number of Columns

PostgreSQL's tablefunc extension provides the crosstab function, which can be used to transform normalized data into a crosstabbed format.  However, one of the drawbacks of the crosstab function is that it requires that the names of all of the column headers be listed in one of its arguments.  In some cases (actually, pretty much every case when I have reason to use crosstab), the number of distinct values that are to be transformed into column headers is not fixed.  Finding those and manually creating the appropriate crosstab argument is tedious and error-prone.

To eliminate this step and to substantially automate the task of crosstabbing data in Postgres, I wrote the following script for execsql.  This uses Postgres' string_agg() aggregate function, Postgres' quote_ident() function, and an execsql substitution variable to dynamically create the list of column headers that is to be used as an argument to crosstab.

This script should be run using execsql's execute script metacommand.

 

-- ########################################################################
-- !x! BEGIN SCRIPT PG_XTAB with parameters (dtable, rowid, colhdr, value, value_type, xtable)
--        Purpose:
--            Crosstab any three columns from a data table.
--        Parameters:
--            dtable        : The name of the input data table.
--            rowid        : The name of the column that should be kept as a row identifier.
--            colhdr        : The name of the colulmn containing values that will
--                            be used as column headers in the output.
--            value        : The name of the column containing values that will
--                            appear in each cell of the cross-table.
--            value_type    : The Postgres data type of the values.  Types with spaces
--                            should be double-quoted (e.g., "double precision").
--            xtable        : The name of the temporary table that is to be created.
--            col_order    : Optional: the name of a column in the input table
--                            with values that have a 1:1 relatiionship with
--                            values of 'colhdr' and that should be used to order
--                            the crosstabbed columns in the output.
--        Notes:
--            1. There must be only one 'value' for each combination of 'rowid'
--                and 'colhdr'.  This constraint will be checked, and if it fails,
--                the failing values will be displayed if a global 'verbose' variable
--                is set, and you will be given an option to to continue or halt.
--                If it fails and 'verbose' is not set, the script will halt.
--            2. The output table 'xtable' is deleted at the beginning of the script,
--                so it will not exist if you choose to continue after an error.
-- =========================================================================

-- Drop the table for return values so that it doesn't exist if this exits on error.
drop table if exists !!#xtable!! cascade;

-- ------------------------------------------------------------------------
--        QA Check: Unique combination of row, column, and value.
-- ------------------------------------------------------------------------
drop view if exists pgx_ck1 cascade;
create temporary view pgx_ck1 as
select distinct !!#rowid!!, !!#colhdr!!, !!#value!!
from !!#dtable!!;
drop view if exists pgx_ck2 cascade;
create temporary view pgx_ck2 as
select !!#rowid!!, !!#colhdr!!
from pgx_ck1
group by !!#rowid!!, !!#colhdr!!
having count(*) > 1;
-- !x! if(hasrows(pgx_ck2))
    -- !x! if(sub_defined(verbose) and !!verbose!!)
        -- !x! prompt ask "There are multiple results per cell in the input to PG_XTAB, so the cross-table will not be created.  Do you want to continue anyway?" sub pgx_continue compare pgx_ck1 and pgx_ck2 key (!!#rowid!!, !!#colhdr!!, !!#value!!)
        -- !x! if(not !!pgx_continue!!)
            -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script." display pgx_ck1
        -- !x! endif
    -- !x! else
        -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script."
    -- !x! endif
-- !x! endif

-- ------------------------------------------------------------------------
--        QA Check: Only one column order value for each column header.
-- ------------------------------------------------------------------------
-- !x! if(sub_defined(#col_order))
    drop view if exists pgx_ck3 cascade;
    create temporary view pgx_ck3 as
    select !!#colhdr!!, count(*) as ordering_values
    from (select distinct !!#colhdr!!, !!#col_order!! from !!#dtable!!) as vs
    group by !!#colhdr!!
    having count(*) > 1;
    -- !x! if(hasrows(pgx_ck3))
        -- !x! HALT "There are multiple ordering values for each column header value in the input to PG_XTAB."
    -- !x! endif
-- !x! endif


-- ------------------------------------------------------------------------
--        Get the list of column headers
-- ------------------------------------------------------------------------
drop view if exists pgx_hdrs cascade;
-- !x! if(sub_defined(#col_order))
    -- !x! sub orderclause order by !!#col_order!!
    -- !x! sub ordercol , !!#col_order!!
-- !x! else
    -- !x! sub_empty orderclause
    -- !x! sub_empty ordercol
-- !x! endif
create temporary view pgx_hdrs as
select hdr
from (
    select distinct quote_ident(!!#colhdr!!) as hdr !!ordercol!!
    from !!#dtable!!
    !!orderclause!!
    ) as hd
!!orderclause!!;

-- ------------------------------------------------------------------------
--        Get the three columns of data in a form to be crosstabbed.
-- ------------------------------------------------------------------------
drop view if exists pgx_xinp cascade;
create temporary view pgx_xinp as
select
    !!#rowid!!,
    quote_ident(!!#colhdr!!) as hdr,
    !!#value!! as value
from
    !!#dtable!!
order by
    !!#rowid!!;

-- ------------------------------------------------------------------------
--        Get the type declarations for the crosstabbed columns.
-- ------------------------------------------------------------------------
drop view if exists pgx_type_decl cascade;
create temporary view pgx_type_decl as
select string_agg(hdr || ' !!#value_type!!', ', ') as coltypes
from pgx_hdrs;
-- !x! subdata ~coltypes pgx_type_decl

-- ------------------------------------------------------------------------
--        Create the crosstab.
-- ------------------------------------------------------------------------
create temporary table !!#xtable!! as
select *
from
    crosstab('select * from pgx_xinp', 'select * from pgx_hdrs')
        as ct(!!#rowid!! text, !!~coltypes!!);


-- !x! END SCRIPT PG_XTAB
-- ########################################################################

Wednesday, March 17, 2021

Checklist for SQL Script Code Review

The following list of items focuses on points of potential weakness and improvement to consider during a code review of a SQL script.

  1. Do the header notes identify the script's purpose, author, and revision history?
  2. Is the code formatted for readability?
  3. Do comments accurately describe the purpose of functional blocks within the script?
  4. Are nullable columns used in joins without accounting for the possibility of nulls?
  5. Are nullable columns used in WHERE clause conditions without accounting for the possibility of nulls?
  6. Are all appropriate columns used in each join expression?
  7. Do any inner joins result in the erroneous exclusion of rows because one table has non-matching rows?
  8. Do any joins result in unintended expansion of rows (e.g., because there is a many-to-many relationship between the tables when the relationship should be one-to-one or one-to-many)?
  9. Are values in imported data sets tested for completeness, consistency, and uniqueness, as appropriate?
  10. Are there undocumented or implicit assumptions about data or relationships in the code?
  11. Are any hard-coded values correct for the database in use (e.g., codes originally to be used with a different database carried over to the current database; capitalization correct)?
  12. Is the logic of AND and OR clauses in WHERE clauses correct?
  13. Is the logic of non-equi-join conditions correct?
  14. Is the logic or algebra of calculations correct?
  15. Are function arguments correct?
  16. Are there any equality comparisons between floating-point numbers when they are not drawn unmodified from the same table column?
  17. Are constant values defined as substitution variables in a configuration section at the head of the script or in a configuration file?
  18. Are table names used in execsql  IMPORT metacommands in the appropriate case for the DBMS (e.g., lowercase for Postgres)?
  19. Are library scripts used where appropriate?
  20. Are transactions used where appropriate, and can committing of transactions be easily enabled or disabled?
  21. Is code repetition minimized through appropriate use of temporary queries or tables, subqueries, CTEs, functions, or execsql INCLUDE or SCRIPT metacommands?
  22. Does the code provide information about progress and status as it runs?
  23. Are any important actions missing?

Saturday, December 5, 2020

Sequentially Writing Data to Zip Files with Python

The write() and writestr() methods of the ZipFile class in Python's zipfile library allow the addition of an entire file or a single string as a member file within a zip file.  However, if a large amount of data will be generated dynamically, these methods do not allow separate data items to be written sequentially to a member file within a zip file.

The StreamableZipfile class in the following code snippet provides this missing capability.

import time
import zipfile

class StreamableZipfile(object):
	def __init__(self, zipfile_name, mode='a'):
		# Compression type and level are only available in Python
		# since versions 3.3 and 3.7, respectively.
		self.zf = zipfile.ZipFile(zipfile_name, mode,
				compression=zipfile.ZIP_BZIP2, compresslevel=9)
	def close(self):
		self.zf.close()
	def member_file(self, member_filename):
		# Creates a ZipInfo object (file) within the zipfile
		# and opens it for writing.
		self.current_zinfo = zipfile.ZipInfo(filename=member_filename,
						date_time=time.localtime(time.time())[:6])
		self.current_zinfo.compress_type = self.zf.compression
		self.current_zinfo._compresslevel = self.zf.compresslevel
		self.current_zinfo.file_size = 0
		self.current_handle = self.zf.open(self.current_zinfo, mode='w')
	def close_member(self):
		self.current_handle.close()
	def write(self, str_data):
		# Writes the given text to the currently open member.
		data = str_data.encode("utf-8")
		with self.zf._lock:
			self.current_zinfo.file_size = self.current_zinfo.file_size + len(data)
			self.current_handle.write(data)

This simple implementation does not include any error checking and always compresses the data with the bzip2 algorithm, using the highest compression level.  A more robust and flexible implementation would eliminate these limitations.  Modifications are also needed for versions of Python prior to 3.7.

Use of the StreamableZipfile class is illustrated by the following code, which creates a zip file containing two separate files, where lines are written sequentially to each of the files.

import os

zfname = "Test.zip"
if os.path.isfile(zfname):
	os.remove(zfname)

# Open the streamable zip file and write several lines to a file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file1.txt')
zfile.write("This is file 1, line 1\n")
zfile.write("This is file 1, line 2\n")
zfile.write("This is file 1, line 3\n")
zfile.close_member()
zfile.close()


# Open the same zip file and write lines to another file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file2.txt')
zfile.write("This is file 2, line 1\n")
zfile.write("This is file 2, line 2\n")
zfile.write("This is file 2, line 3\n")
zfile.close_member()
zfile.close()

Saturday, September 12, 2020

The Three Rules of Budgeting

1. If it's a task that you know how to do, you're going to under-budget it.

2. If it's a task that you don't know how to do, you're going to under-budget it by a lot.

3. The client always follows Rule 2.