Splinter of the Singularity: 2021

Saturday, October 2, 2021

Are We in the Twilight Years of Programming Language Development?

The following figure shows the number of new programming languages created per year since 1970. The data are from the Wikipedia page Timeline of Programming Languages.

There's a consistent decline since the mid-'90s. Have most of the best programming languages already been developed, so we don't need any more?

Saturday, August 28, 2021

Data Transformation Operations and a Standard Terminology

Data standardization, normalization, rescaling, and transformation: what they are and when to apply them

There are several different types of operations that are commonly applied to data before further analysis. These operations are applied to individual data values, but are typically based on the attributes of a group of data values. In a data matrix where columns are variables and rows are instances, the group that is used may be either all of the values in a row or all of the values in a column. Which type of group to use, as well as which type of operation to apply, depends on both the data themselves and the needs of the analysis to be conducted.

Different types of operations that are commonly applied to data are standardization, rescaling, normalization, and non-linear transformation. These terms are often used ambiguously and inconsistently. Misunderstanding what others have done, or misrepresenting what you have done, is easy when these operations are not clearly defined and distinguished.

The various operations that may be applied to data are:

Standardization: Dividing every value by another value or by a constant. The denominator is the standard to which each individual value is referenced.
Rescaling or range scaling: Converting the data from the original range of the group to a fixed range such as -1 to 1 or 0 to 1.
Mean centering: Subtracting the mean of the group from each value.
Normalization or z-transformation: Subtracting the mean of the group from each value and then dividing each value by the standard deviation of the group.
Non-linear transformation: Any non-linear operation such as taking the logarithm, square root, or probit of each data value.

Contributing to the ambiguity of these terms is that there is no single term that encompasses this group of operations as a whole. The word “transformation” is most commonly used for this purpose, although such usage potentially conflicts with use of the same term for different linear and non-linear operations upon the data. The word “normalization” is also often used as a catch-all for various types of operations on data. Because normalization (as defined above) results in the calculation of standard (Z) scores, normalization is often referred to as simply standardization. The word “normalization” is also frequently used when the actual operation referenced is either standardization or rescaling. Non-linear transformations may also be referred to as rescaling because they change the data from a linear to a non-linear scale.

Because of these ambiguities:

Be specific when describing the operations that you have applied to the data. Use the terms above as the basis for your descriptions.
Be careful when reading others’ descriptions of the operations that they have applied. If a description is not specific, assume that the author is not sensitive to ambiguities in terminology, and may be using different definitions than those above.

Summary of Data Relationships Under Different Data Transformation Operations

The following table summarizes how data distributions are affected by the different operations.

Operation	Linearity preserved	Proportionality preserved	Additivity preserved	Sensitivity to statistical outliers (c)
Standardization	Yes (a)	Yes (a,b)	No	Moderate
Rescaling or range scaling	Yes	No	No	Extreme
Mean centering	Yes	No	No	Moderate
Normalization or z-transformation	Yes	No	No	Moderate
Non-linear transformation	No	No	No	Operation-dependent

Notes

a) Only within each group for which the same divisor has been applied to all values.

b) Standardization cannot create proportional relationships; that it does can be a hidden assumption when it is used.

c) None of the operations eliminate outliers. Range scaling is very sensitive to outliers because an outlier will determine one end of the range. Some non-linear transformation, such as log transformation, may reduce the relative weight of outliers.

Guidelines for Application

Some very general guidelines for application of these operations are:

If a statistical analysis is to be conducted, non-linear transformation may be needed to ensure that the data satisfy requirements of the method.
Normalization (z-transformation) may be appropriate prior to regression analyses when the purpose is to interpret the regression coefficients as measures of the relative importance of different variables. Regression is insensitive to operations that preserve linear relationship, so standardization or rescaling are not ordinarily needed.
If an analysis based on similarity is to be conducted (e.g., clustering), then non-linear transformation should be avoided. Most similarity measures are based on linear relationships (e.g., Euclidean distance), so non-linear transforms will alter the estimates of similarities between instances.
If an analysis based on Pearson correlations is to be conducted, then non-linear transformations should be used with caution, because the results of the analysis will not apply to data in the original scale.
When each variable represents a component of a larger quantity (e.g., variables are PCB congeners and total PCBs is the larger quantity), then standardization of each value to the sum is frequently appropriate, to eliminate the effect of differences between instances. This operation is commonly referred to as "row-sum normalization," although it is not strictly normalization following the definitions above.

Application of more than one of the data transformation operations may be appropriate for some data sets and some analyses.

Sunday, April 18, 2021

Data Management Best Practices

There are a lot of web pages that present some interpretation of data management best practices. Most of them are, in my opinion, uselessly general. Herewith, then, is my list of specific recommended best practices.

1. Know the use of the data
   a. For the overall project
   b. For each data summary request

2. Use a single authoritative data store.

3. Back up important data.

4. Verify or characterize data quality.

5. Control access to sensitive data.

6. Track data changes.

7. Document data management practices.
   a. Default practices
   b. Project- or data-set-specific

8. Preserve original data.

9. Script all data revisions and data summarizations.

10. Use version control for things that may change.
   a. Scripts
   b. Regular or periodic data exports

11. Record metadata
   a. For both incoming and outgoing data
   b. Metadata includes
        1. Provenance: Who created, provided, or produced the data.
        2. Content: What the data set contains
        3. Purpose: What the data are intended to be used for.
        4. Method: How the data were generated or selected and summarized.
        5. History: The history of any revisions made to the data or the data summarization method.
   C. Forms of metadata include:
        1. Copies of emails or other documents that transmit data or request data.
        2. Header notes in scripts.
        3. Metadata pages and glossary pages in data summaries.
        4. Custom log files created by scripted data operations.

12. Date-tag directory and file names where the sequence of changes may affect their validity or interpretability.

Sunday, March 21, 2021

Crosstabbing Data in Postgres with an Indefinite Number of Columns

PostgreSQL's tablefunc extension provides the crosstab function, which can be used to transform normalized data into a crosstabbed format. However, one of the drawbacks of the crosstab function is that it requires that the names of all of the column headers be listed in one of its arguments. In some cases (actually, pretty much every case when I have reason to use crosstab), the number of distinct values that are to be transformed into column headers is not fixed. Finding those and manually creating the appropriate crosstab argument is tedious and error-prone.

To eliminate this step and to substantially automate the task of crosstabbing data in Postgres, I wrote the following script for execsql. This uses Postgres' string_agg() aggregate function, Postgres' quote_ident() function, and an execsql substitution variable to dynamically create the list of column headers that is to be used as an argument to crosstab.

This script should be run using execsql's execute script metacommand.

-- ########################################################################
-- !x! BEGIN SCRIPT PG_XTAB with parameters (dtable, rowid, colhdr, value, value_type, xtable)
--        Purpose:
--            Crosstab any three columns from a data table.
--        Parameters:
--            dtable        : The name of the input data table.
--            rowid        : The name of the column that should be kept as a row identifier.
--            colhdr        : The name of the colulmn containing values that will
--                            be used as column headers in the output.
--            value        : The name of the column containing values that will
--                            appear in each cell of the cross-table.
--            value_type    : The Postgres data type of the values.  Types with spaces
--                            should be double-quoted (e.g., "double precision").
--            xtable        : The name of the temporary table that is to be created.
--            col_order    : Optional: the name of a column in the input table
--                            with values that have a 1:1 relatiionship with
--                            values of 'colhdr' and that should be used to order
--                            the crosstabbed columns in the output.
--        Notes:
--            1. There must be only one 'value' for each combination of 'rowid'
--                and 'colhdr'.  This constraint will be checked, and if it fails,
--                the failing values will be displayed if a global 'verbose' variable
--                is set, and you will be given an option to to continue or halt.
--                If it fails and 'verbose' is not set, the script will halt.
--            2. The output table 'xtable' is deleted at the beginning of the script,
--                so it will not exist if you choose to continue after an error.
-- =========================================================================

-- Drop the table for return values so that it doesn't exist if this exits on error.
drop table if exists !!#xtable!! cascade;

-- ------------------------------------------------------------------------
--        QA Check: Unique combination of row, column, and value.
-- ------------------------------------------------------------------------
drop view if exists pgx_ck1 cascade;
create temporary view pgx_ck1 as
select distinct !!#rowid!!, !!#colhdr!!, !!#value!!
from !!#dtable!!;
drop view if exists pgx_ck2 cascade;
create temporary view pgx_ck2 as
select !!#rowid!!, !!#colhdr!!
from pgx_ck1
group by !!#rowid!!, !!#colhdr!!
having count(*) > 1;
-- !x! if(hasrows(pgx_ck2))
    -- !x! if(sub_defined(verbose) and !!verbose!!)
        -- !x! prompt ask "There are multiple results per cell in the input to PG_XTAB, so the cross-table will not be created.  Do you want to continue anyway?" sub pgx_continue compare pgx_ck1 and pgx_ck2 key (!!#rowid!!, !!#colhdr!!, !!#value!!)
        -- !x! if(not !!pgx_continue!!)
            -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script." display pgx_ck1
        -- !x! endif
    -- !x! else
        -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script."
    -- !x! endif
-- !x! endif

-- ------------------------------------------------------------------------
--        QA Check: Only one column order value for each column header.
-- ------------------------------------------------------------------------
-- !x! if(sub_defined(#col_order))
    drop view if exists pgx_ck3 cascade;
    create temporary view pgx_ck3 as
    select !!#colhdr!!, count(*) as ordering_values
    from (select distinct !!#colhdr!!, !!#col_order!! from !!#dtable!!) as vs
    group by !!#colhdr!!
    having count(*) > 1;
    -- !x! if(hasrows(pgx_ck3))
        -- !x! HALT "There are multiple ordering values for each column header value in the input to PG_XTAB."
    -- !x! endif
-- !x! endif


-- ------------------------------------------------------------------------
--        Get the list of column headers
-- ------------------------------------------------------------------------
drop view if exists pgx_hdrs cascade;
-- !x! if(sub_defined(#col_order))
    -- !x! sub orderclause order by !!#col_order!!
    -- !x! sub ordercol , !!#col_order!!
-- !x! else
    -- !x! sub_empty orderclause
    -- !x! sub_empty ordercol
-- !x! endif
create temporary view pgx_hdrs as
select hdr
from (
    select distinct quote_ident(!!#colhdr!!) as hdr !!ordercol!!
    from !!#dtable!!
    !!orderclause!!
    ) as hd
!!orderclause!!;

-- ------------------------------------------------------------------------
--        Get the three columns of data in a form to be crosstabbed.
-- ------------------------------------------------------------------------
drop view if exists pgx_xinp cascade;
create temporary view pgx_xinp as
select
    !!#rowid!!,
    quote_ident(!!#colhdr!!) as hdr,
    !!#value!! as value
from
    !!#dtable!!
order by
    !!#rowid!!;

-- ------------------------------------------------------------------------
--        Get the type declarations for the crosstabbed columns.
-- ------------------------------------------------------------------------
drop view if exists pgx_type_decl cascade;
create temporary view pgx_type_decl as
select string_agg(hdr || ' !!#value_type!!', ', ') as coltypes
from pgx_hdrs;
-- !x! subdata ~coltypes pgx_type_decl

-- ------------------------------------------------------------------------
--        Create the crosstab.
-- ------------------------------------------------------------------------
create temporary table !!#xtable!! as
select *
from
    crosstab('select * from pgx_xinp', 'select * from pgx_hdrs')
        as ct(!!#rowid!! text, !!~coltypes!!);


-- !x! END SCRIPT PG_XTAB
-- ########################################################################

Wednesday, March 17, 2021

Checklist for SQL Script Code Review

The following list of items focuses on points of potential weakness and improvement to consider during a code review of a SQL script.

Do the header notes identify the script's purpose, author, and revision history?
Is the code formatted for readability?
Do comments accurately describe the purpose of functional blocks within the script?
Are nullable columns used in joins without accounting for the possibility of nulls?
Are nullable columns used in WHERE clause conditions without accounting for the possibility of nulls?
Are all appropriate columns used in each join expression?
Do any inner joins result in the erroneous exclusion of rows because one table has non-matching rows?
Do any joins result in unintended expansion of rows (e.g., because there is a many-to-many relationship between the tables when the relationship should be one-to-one or one-to-many)?
Are values in imported data sets tested for completeness, consistency, and uniqueness, as appropriate?
Are there undocumented or implicit assumptions about data or relationships in the code?
Are any hard-coded values correct for the database in use (e.g., codes originally to be used with a different database carried over to the current database; capitalization correct)?
Is the logic of AND and OR clauses in WHERE clauses correct?
Is the logic of non-equi-join conditions correct?
Is the logic or algebra of calculations correct?
Are function arguments correct?
Are there any equality comparisons between floating-point numbers when they are not drawn unmodified from the same table column?
Are constant values defined as substitution variables in a configuration section at the head of the script or in a configuration file?
Are table names used in execsql IMPORT metacommands in the appropriate case for the DBMS (e.g., lowercase for Postgres)?
Are library scripts used where appropriate?
Are transactions used where appropriate, and can committing of transactions be easily enabled or disabled?
Is code repetition minimized through appropriate use of temporary queries or tables, subqueries, CTEs, functions, or execsql INCLUDE or SCRIPT metacommands?
Does the code provide information about progress and status as it runs?
Are any important actions missing?

Splinter of the Singularity