A three-part series on unmixing using NMF.
Part 1: Introduction and Implementation
A three-part series on unmixing using NMF.
Part 1: Introduction and Implementation
Data distributions are often characterized by their range mean, median, and quartiles. For data sets in SQLite, the range (min and max) and mean are easily calculated with built-in aggregate functions. Calculation of quartiles (the first quartile, the median, and the third quartile) require custom SQL code, as there are no built-in aggregate functions for these values. This previous post contains code to calculate the median, but that technique cannot be extended to calculate the first and third quartiles.
There are other available examples of calculation of quartiles in SQLite, but (as of this writing) all of them (that I have found) yield a result that exactly matches one of the values in the data set. These results may be only approximate for some data sets, specifically when there is an even number of rows overall or within two adjacent quartiles.
The code below takes the same approach to calculating the first and third quartiles as is standard for the median:
The solution presented here differs from others in accommodating the second of these cases. In addition, it allows for calculation of these statistics for subgroups of the overall data set.
This illustration is designed to be run on a table with the following columns:
Such a table can be produced with the following code:
create table testdata as select * from (select value as id from generate_series(1,8)) as d cross join (select random() as value from generate_series(1, 10000));
This approach uses SQLite's 'ntile()' function to subdivide the data set (or each subgroup) into equal or nearly-equal subsets. The sizes and bounding values for each of these subsets are then extracted to allow determination of which of the two cases listed above should be used for each quartile and the median.
with nt as ( SELECT id, value, ntile(4) over (partition by id order by value) as ntile_4 FROM testdata ), idlist as (select distinct id from nt), q1 as ( select id, max(value) as q1_max, count(*) as q1_n from nt where ntile_4 = 1 group by id), q2 as ( select id, min(value) as q2_min, max(value) as q2_max, count(*) as q2_n from nt where ntile_4 = 2 group by id), q3 as ( select id, min(value) as q3_min, max(value) as q3_max, count(*) as q3_n from nt where ntile_4 = 3 group by id), q4 as ( select id, min(value) as q4_min, count(*) as q4_n from nt where ntile_4 = 4 group by id), s as ( SELECT st.id, q1_max, q1_n, q2_min, q2_max, q2_n, q3_min , q3_max , q3_n, q4_min, q4_n from idlist as st left join q1 on q1.id=st.id left join q2 on q2.id=st.id left join q3 on q3.id=st.id left join q4 on q4.id=st.id) select id, case when q1_n = q2_n then (q1_max + q2_min)/2 else case when q1_n > q2_n then q1_max else q2_min end end as q1, case when q1_n + q2_n = q3_n + q4_n then (q2_max + q3_min)/2 else case when q1_n + q2_n > q3_n + q4_n then q2_max else q3_min end end as median, case when q3_n = q4_n then (q3_max + q4_min)/2 else case when q3_n > q4_n then q3_max else q4_min end end as q3 from s order by id ;
This is not a new question. There are numerous summaries of reasons why people don't ask questions, and suggestions for how one can become a better question-asker. A lot of existing discussion is focused on general conversations; the focus here is specifically on technical conversations in professional settings, where the goal is to convey information from one party to another (that is, report-talk rather than rapport-talk (Tannen 1990).
This post summarizes commonly suggested reasons why people don't ask more question, touches briefly on some philosophy of communication, and suggests an additional (partial) answer that doesn't appear to be discussed elsewhere.
Commonly-cited reasons for a person's reluctance to ask questions are:
These are potential explanations for Why don't people ask the questions that they have?, but something else to consider is Why don't people think of questions to ask? Simple explanations are apathy (again) and the lack of critical thinking skills. However, differing contexts of speaker and listener may also play a large role.
The reasons why any questions at all might need to be asked can be attributed to communication problem: either a lack of clarity on the speaker's part, or a lack of understanding on the questioner's part. The existence of such communication problems seems like it should be an aberration in natural language. Natural languages have evolved, however, to allow ambiguity, and there may even be some advantages to this ambiguity (Piantadosi et al. 2012). Languages can be constructed to minimize or eliminate ambiguity (cf. Lojban), but natural languages have not developed to be free of ambiguity.
Conversational implicature is the term used to describe the disjunction in communication when what is said is not what is meant. Conversational implicature underlies many common forms of communication, ranging from humor to political dog whistles. But it is not limited to such special cases; it is common in everyday conversation. A humdrum example from Paul Grice, who coined the term (reproduced from Benotti and Blackburn [2014]) is:
Man standing by his car: "I am out of petrol."
Passer-by: "There is a garage around the corner."
The effectiveness of conversational implicature in this case rests on common context and inferences, such as:
Implicature simplifies conversation by eliminating the need to explicitly state those things that are part of the participants' shared context, or that can be reasonably inferred.
Much--even most--meaning in conversation does not reside in the words spoken at all, but is filled in by the person listening. -- D. Tannen (1990)
Although the result of conversational implicature can be ambiguity, ambiguity may have important benefits for the simplification of communication (Piantadosi et al. 2012).
A meta-context for the use of conversational implicature is that the conversation is governed by the Cooperative Principle: all parties to the conversation are making a good-faith effort to exchange information reliably (Benotti and Blackburn 2014; Bi 2019). Therefore, underlying all the other contextual and inferential information that is required for communication is the presumption that both parties are engaging in honest communication.
Lack of clarity, lack of understanding, and even misperception arise when the participants do not have a shared context, or when the same inference is not reasonable to, or is not made by, both parties.
Commonality of context and inferential ability is assumed by the speaker unless they are actually attempting to mislead. It is incumbent on the listener—the potential questioner—to identify when the speaker's context or inferences may be different than their own. If the listener does not have any relevant context, and cannot make any relevant inference, the problem should be immediately obvious to the listener. The listener's response in such cases should be either to ask a question or to submit to losing the thread of the conversation.
An even more challenging situation occurs when the listener's context is different from that of the speaker, or when the inference made by the listener is different from that of the speaker. This will lead to misunderstanding that may not be recognized by either the speaker or the listener. If the speaker understands their audience, they should refrain from using implicatures that can be misperceived in this way. However, the listener must be constantly alert for the possibilities of differing contexts and implications. This burden requires a mode of listening that goes beyond the simple acquisition of information, and that engages the ability to imagine the existence of alternate contexts and implications.
Recognition of the consequences of conversational implicature allows the identification of several possible additional reasons why people don't ask questions—specifically, why they don't have questions to ask. These reasons are not commonly discussed:
That is, sometimes people don't ask questions because they don't realize that they don't actually understand what was said. This can result from a failure to see through language's fog of ambiguity, or to 'see' the wrong thing entirely.
These reasons are not necessarily independent of those listed in the first section above. Some may have had the experience of working with others who comprehensively question assumptions—or at least, voice those questions—in some social circumstances but not in others.
For example, although violation of the Cooperative Principle may be a deliberate and acceptable strategy to improve communication (Bi 2019), it may also seem to be transgressive in that it is uncooperative. In combination with other factors listed above, some individuals may feel that the transgressive cost outweighs the informational benefit.
On the other side of the coin are those who frequently ask questions. Some individuals may be less inclined to adhere to the Cooperative Principle, because of their nature or preferred (possibly strategic) mode of interaction. This tendency may also be situational. Alternatively, some individuals may not very sensitive to the norms of cooperative social conventions.
Conversational implicature and the ambiguity that it encompasses provide a basis for understanding why individuals do not ask questions in some circumstances.
Benotti, L., and P. Blackburn. 2014. Context and Implicature. Context in Computing, Springer New York. pp.419-436.
Bi, M. 2019. Analysis of the Conversational Implicature of Violating Cooperative Principles in Daily Discourse. American Journal of History and Culture, 2:13.
Carter, A. J., A. Croft, D. Lukas, and G. M. Sandstrom. 2018. Women's visibility in academic seminars: Women ask fewer questions than men. PLOS ONE, Public Library of Science (PLoS), 2018, 14, e0212146.
Piantadosi, S. T., H. Tily, and E. Gibson. 2012. The communicative function of ambiguity in language. Cognition 122:280-291.
Schmidt, S. J. and J. R. Davenport, J. R. 2017. Who asks questions at astronomy meetings?
Nature Astronomy, 1, 0153.
Tannen, D. 1990. You Just Don't Understand: Women and Men in Conversation. Harper-Collins, New York.
Data that are stored in databases may be measurement results that are inherently imprecise. The precision of a measurement is often represented by the number of significant digits associated with the measurement. Calculations performed with such measurements should preserve, or propagate, the significant digits so that the precision of the calculated result is also represented appropriately.
Numeric data types supported by database management systems (DBMSs) do not have any attribute that allows the precision of each numeric value to be represented. If the precision of a measured value is to be represented by significant digits, and stored with the value, then ordinarily a separate table column must be used for the number of significant digits for each measurement.
There are other approaches to representing precision that can be used. One is to store all measured values as text. This is not a wholly adequate approach because, unless scientific notation is used, when there are multiple trailing zeroes to the left of the decimal point, the precision cannot be represented accurately. Also, this is an unworkable approach if any calculations are to be performed: text values must be cast to numeric values before calculation, which results in loss of the purely visual representation of the number of significant digits, and consequently no way to propagate precision estimates throughout calculation.
Also, precision may be represented in ways other than using significant digits. Recording a standard deviation for each measurement, instead of the significant digits, is one alternate approach. This may be appropriate when calibration data, or counting statistics for radioactivity measurements, provide a statistical estimate of the precision of each value.
This post is focused solely on the case where precision is represented by the number of significant digits for each measured value.
Among commonly used DBMSs, PostgreSQL is uniquely suited for managing imprecise measurement data because of the ease of creating composite data types, and operators and aggregate functions for those data types. The use of composite data types, operators, and aggregate functions for measurement data are illustrated in the following sections.
Postgres allows users to extend the built-in data types either with user-defined types that are defined using C or some other low-level language, or with composite types that are defined using the CREATE TYPE statement. The CREATE TYPE statement provides all the functionality that is needed to implement a measurement type that combines both a floating-point number and an integer value for the significant digits. The definition of this type is simply:
create type meas_value as ( value double precision, sig_figs integer);
Construction of, and access to, values of composite types like this are well described in the PostgreSQL documentation.
To perform calculations with values of this type, operators or aggregate types, or both, must be defined. In addition, to handle calculations with significant digits, other supporting functions are also needed.
Some types of calculations using measurement values require tracking of the least significant place of each measured value, where "place" refers to the one's place, 10's place, 100's place, etc. For example, the least significant place (LSP) of a sum of measured values is equal to the highest LSP of any of the summands.
The LSP of a measured value can be found by subtracting the number of significant digits from the most significant place (MSP). The MSP can be found by taking the base-10 logarithm of the value, and converting the result to an integer.
After a sum is calculated and its LSP found, the LSP must be converted back into the number of significant digits. This operation is almost exactly the same as the previous one, subtracting the LSP from the MSP. Functions named 'lsp()' and 'figs_from_lsp()' that perform these operations are:
-- Find the LSP for a measured value and sig. figs. CREATE FUNCTION lsp(in value double precision, in ppf integer) RETURNS integer language sql immutable leakproof strict parallel safe RETURN floor(log(abs(value))) + 1 - ppf; -- Calculate the sig. digits for a measured value and LSP CREATE FUNCTION figs_from_lsp(in value double precision, in ppf integer) RETURNS integer language plpgsql immutable leakproof strict parallel safe as $$ DECLARE figs integer; BEGIN figs := floor(log(abs(value))) + 1 - ppf; IF figs < 1 then figs := 1; END IF; RETURN figs; END; $$;
In Postgres, the 'log()' function is the base-10 logarithm. The logarithm is converted to an integer by taking its floor and adding one. It may seem that these two operations could be replaced by a single 'ceil()' (ceiling) function, but they cannot because the ceiling function produces integers for fractional powers of ten (0.1, 0.01, 0.001, etc.) that are inconsistent with the desired results.
The 'figs_from_lsp()' function performs essentially the same operation as the 'lsp()' function, but guarantees that at least one significant digit will be returned. Operations such as subtraction can lead to a total loss of precision, but as a practical matter even such results are considered (in this implementation) to have at least one significant digit. An alternative approach would be to raise an exception when there is a total loss of precision.
Although calculations should ordinarily be carried out using as many digits of precision as are available, rounding a measured value to the proper number of significant digits may be desirable before displaying or reporting the value. The following function will round the value field of a 'meas_value' data type to the given number of significant digits:
CREATE FUNCTION mv_round(mv meas_value) returns meas_value language plpgsql immutable leakproof strict parallel safe AS $$ DECLARE f double precision; BEGIN f := 10^lsp(mv.value, mv.sig_figs); return cast((round(mv.value / f) * f, mv.sig_figs) as meas_value); END; $$;
Note that this function does not do the same thing as Postgres' own 'round()' function. Postgres' function rounds a double-precision number to the nearest integer, whereas 'mv_round()' rounds to the given number of significant digits. However, the 'mv_round()' function uses Postgres' 'round()' function, and therefore follows its tie-breaking behavior.
Also note that because of the inherently imprecise nature of common implementations of floating-point numbers, rounding a measured value to a specific number of significant digits does not necessarily eliminate all 'noise' bits from the floating-point representation. Comparison of two rounded measured values should be done with the same caution as should be used for floating-point numbers in general. The equality operator described in the next section provides a robust means of comparing two data values of the 'meas_value' type.
Operators are binary and unary operators such as "+", "-", "*", and "/". A new operator for a custom type has two parts:
The infix operator of "+" (i.e., sum) for two measurement value ('meas_value') data types, is defined as follows:
CREATE FUNCTION mv_add_mv(mv1 meas_value, mv2 meas_value) RETURNS meas_value language plpgsql as $$ DECLARE rv meas_value; pl1 integer; pl2 integer; BEGIN rv.value = mv1.value + mv2.value; pl1 = lsp(mv1.value, mv1.sig_figs); pl2 = lsp(mv2.value, mv2.sig_figs); if pl1 > pl2 then rv.sig_figs = figs_from_lsp(rv.value, pl1); else rv.sig_figs = figs_from_lsp(rv.value, pl2); end if; RETURN rv; END; $$; CREATE OPERATOR + ( function = mv_add_mv, leftarg = meas_value, rightarg = meas_value );
The operator for the difference of two measured values ("-") is similar to that for addition:
CREATE FUNCTION mv_sub_mv(mv1 meas_value, mv2 meas_value) RETURNS meas_value language plpgsql as $$ DECLARE rv meas_value; pl1 integer; pl2 integer; BEGIN rv.value = mv1.value - mv2.value; pl1 = lsp(mv1.value, mv1.sig_figs); pl2 = lsp(mv2.value, mv2.sig_figs); if pl1 > pl2 then rv.sig_figs = figs_from_lsp(rv.value, pl1); else rv.sig_figs = figs_from_lsp(rv.value, pl2); end if; return rv; END; $$; CREATE OPERATOR - ( function = mv_sub_mv, leftarg = meas_value, rightarg = meas_value );
The multiplication and division operators only need to compare the number of significant digits for the two multiplicands, without any computation of the LSP. The result of multiplication and division is only as precise (has only the number of significant digits) as the least precise of the multiplicands. The multiplication operator can be defined as follows:
CREATE FUNCTION mv_prod_mv(mv1 meas_value, mv2 meas_value) RETURNS meas_value language plpgsql as $$ DECLARE rv meas_value; pl1 integer; pl2 integer; BEGIN rv.value = mv1.value * mv2.value; if mv1.sig_figs > mv2.sig_figs then rv.sig_figs = mv2.sig_figs; else rv.sig_figs = mv1.sig_figs; end if; return rv; END; $$; CREATE OPERATOR * ( function = mv_prod_mv, leftarg = meas_value, rightarg = meas_value );
The division operator is defined similarly to the multiplication operator:
CREATE FUNCTION mv_div_mv(mv1 meas_value, mv2 meas_value) RETURNS meas_value language plpgsql AS $$ DECLARE rv meas_value; pl1 integer; pl2 integer; BEGIN rv.value = mv1.value / mv2.value; if mv1.sig_figs > mv2.sig_figs then rv.sig_figs = mv2.sig_figs; else rv.sig_figs = mv1.sig_figs; end if; RETURN rv; END; $$; CREATE OPERATOR / ( function = mv_div_mv, leftarg = meas_value, rightarg = meas_value );
Operators for composite data types can return any data type. An example of such an operator is the equality operator, which returns a boolean data type. Equality between two imprecise measured values (as represented by the 'meas_value' data type) can be evaluated in several different ways:
The following implementation of an equality operator considers two measured values to be equal if a) they are equal up to the number of significant digits for each, and b) they both have the same number of significant digits. This implementation uses a double equal sign ("==") as an equality operator, rather than the single equal sign that is used for native data types. A single equal sign might be used by an alternate equality operator, such as one that does not require both values to have the same number of significant digits.
CREATE FUNCTION mv_eq_mv(mv1 meas_value, mv2 meas_value) RETURNS boolean language plpgsql as $$ DECLARE rv boolean; BEGIN if mv1.sig_figs <> mv2.sig_figs then rv := False; else if round(mv1.value/(10^lsp(mv1.value, mv1.sig_figs))) = round(mv2.value/(10^lsp(mv2.value, mv2.sig_figs))) then rv := True; else rv := False; end if; end if; return rv; end; $$; CREATE OPERATOR == ( function = mv_eq_mv, leftarg = meas_value, rightarg = meas_value );
Note that this equality test scales both double-precision values to integers prior to comparing them. This is similar to the 'mv_round()' function, but omits an unnecessary multiplication for the sake of better performance and robustness of the comparison.
Other operators can be defined in the same way. Comprehensive support for calculations with measured values should include equivalents to the "+", "-", "*", and "/" binary operators that take one 'meas_val' data type and one (precise) floating point number, in either order.
Although binary operations like addition and multiplication are important, similar operations across multiple rows of a data table are often called for. Native SQL aggregate functions such as 'sum()' and 'avg()' can be extended to use composite data types such as the 'meas_value' type. The names of the native functions can be used with the 'meas_value' data type because Postgres allows function overloading (multiple dispatch).
The 'sum()' and 'avg()' aggregate functions for the 'meas_value' type require that the highest LSP be tracked across all of the rows that are operated on. To minimize conversions back and forth between numbers of significant digits and LSP values, a second custom composite type is used to track ('accumulate') both the running sum and the highest LSP. This type is defined as follows:
create type meas_value_accum as ( value double precision, hlsp integer, nrows integer);
In this composite type, the 'value' field is used to store the running sum, the 'hlsp' field is used to record the highest LSP, and the 'nrows' field is used to record the number of rows operated on.
The definition of aggregate functions in PostgreSQL consists of several different parts, similar to the definition of operators. For aggregate functions, these parts are:
The 'sum()' aggregate function for the 'meas_value' data type requires all three of these elements:
The code for these is as follows:
CREATE FUNCTION mv_sum(mv_curr meas_value_accum, mv_next meas_value) RETURNS meas_value_accum language plpgsql AS $$ DECLARE lsp integer; BEGIN if mv_next.value is not null then mv_curr.value = mv_curr.value + mv_next.value; if mv_next.sig_figs is not null then lsp = lsp(mv_next.value, mv_next.sig_figs); if mv_curr.hlsp is null then mv_curr.hlsp = lsp; elsif lsp > mv_curr.hlsp then mv_curr.hlsp = lsp; end if; end if; end if; RETURN mv_curr; END; $$; CREATE FUNCTION mv_sum_final(mv_curr meas_value_accum) RETURNS meas_value language plpgsql AS $$ DECLARE sfigs integer; BEGIN sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp); RETURN cast((mv_curr.value, sfigs) as meas_value); END; $$; CREATE AGGREGATE sum(meas_value) ( SFUNC = mv_sum, STYPE = meas_value_accum, FINALFUNC = mv_sum_final, INITCOND = '(0, -99999, 0)' );
The definition of the 'avg()' aggregate function is almost identical to that of the 'sum()' aggregate function. The only additional actions that are required are to track the number of rows processed in the accumulator, and to divide the sum by the number of rows in the final function. The code for the 'avg()' aggregate function is as follows:
REATE FUNCTION mv_avg(mv_curr meas_value_accum, mv_next meas_value) RETURNS meas_value_accum language plpgsql AS $$ DECLARE lsp integer; BEGIN if mv_next.value is not null then mv_curr.value = mv_curr.value + mv_next.value; if mv_next.sig_figs is not null then lsp = lsp(mv_next.value, mv_next.sig_figs); if mv_curr.hlsp is null then mv_curr.hlsp = lsp; elsif lsp > mv_curr.hlsp then mv_curr.hlsp = lsp; end if; end if; end if; mv_curr.nrows = mv_curr.nrows + 1; RETURN mv_curr; END; $$; CREATE FUNCTION mv_avg_final(mv_curr meas_value_accum) RETURNS meas_value language plpgsql AS $$ DECLARE sfigs integer; BEGIN sfigs = figs_from_lsp(mv_curr.value, mv_curr.hlsp); RETURN (mv_curr.value / mv_curr.nrows, sfigs); END; $$; CREATE AGGREGATE avg(meas_value) ( SFUNC = mv_avg, STYPE = meas_value_accum, FINALFUNC = mv_avg_final , INITCOND = '(0, -99999, 0)' );
Measurement data are ordinarily imprecise, their precision is frequently represented by their number of significant digits, and calculations using measurements should propagate precision estimates to the final result. This post provides core code for propagating significant digits through calculations in PostgreSQL.
PostgreSQL allows straightforward creation of composite data types, operators, and aggregate functions to handle measurement data with associated significant digits. Overloading of these custom operators and aggregate functions on the native database features allow measurement data, including significant digits, to be operated on as easily as any native data type.
In combination with SQL's other features for data selection and summarization, the ability to easily perform calculations with imprecise measurement data enhances the value of PostgreSQL as a data analysis tool.
The following figure shows the number of new programming languages created per year since 1970. The data are from the Wikipedia page Timeline of Programming Languages.
There's a consistent decline since the mid-'90s. Have most of the best programming languages already been developed, so we don't need any more?
There are several different types of operations that are commonly applied to data before further analysis. These operations are applied to individual data values, but are typically based on the attributes of a group of data values. In a data matrix where columns are variables and rows are instances, the group that is used may be either all of the values in a row or all of the values in a column. Which type of group to use, as well as which type of operation to apply, depends on both the data themselves and the needs of the analysis to be conducted.
Different types of operations that are commonly applied to data are standardization, rescaling, normalization, and non-linear transformation. These terms are often used ambiguously and inconsistently. Misunderstanding what others have done, or misrepresenting what you have done, is easy when these operations are not clearly defined and distinguished.
The various operations that may be applied to data are:
Contributing to the ambiguity of these terms is that there is no single term that encompasses this group of operations as a whole. The word “transformation” is most commonly used for this purpose, although such usage potentially conflicts with use of the same term for different linear and non-linear operations upon the data. The word “normalization” is also often used as a catch-all for various types of operations on data. Because normalization (as defined above) results in the calculation of standard (Z) scores, normalization is often referred to as simply standardization. The word “normalization” is also frequently used when the actual operation referenced is either standardization or rescaling. Non-linear transformations may also be referred to as rescaling because they change the data from a linear to a non-linear scale.
Because of these ambiguities:
The following table summarizes how data distributions are affected by the different operations.
| Operation | Linearity preserved | Proportionality preserved | Additivity preserved | Sensitivity to statistical outliers (c) |
|---|---|---|---|---|
| Standardization | Yes (a) | Yes (a,b) | No | Moderate |
| Rescaling or range scaling | Yes | No | No | Extreme |
| Mean centering | Yes | No | No | Moderate |
| Normalization or z-transformation | Yes | No | No | Moderate |
| Non-linear transformation | No | No | No | Operation-dependent |
Notes
a) Only within each group for which the same divisor has been applied to all values.
b) Standardization cannot create proportional relationships; that it does can be a hidden assumption when it is used.
c) None of the operations eliminate outliers. Range scaling is very sensitive to outliers because an outlier will determine one end of the range. Some non-linear transformation, such as log transformation, may reduce the relative weight of outliers.
Some very general guidelines for application of these operations are:
Application of more than one of the data transformation operations may be appropriate for some data sets and some analyses.
There are a lot of web pages that present some interpretation of data management best practices. Most of them are, in my opinion, uselessly general. Herewith, then, is my list of specific recommended best practices.
1. Know the use of the data
a. For the overall project
b. For each data summary request
2. Use a single authoritative data store.
3. Back up important data.
4. Verify or characterize data quality.
5. Control access to sensitive data.
6. Track data changes.
7. Document data management practices.
a. Default practices
b. Project- or data-set-specific
8. Preserve original data.
9. Script all data revisions and data summarizations.
10. Use version control for things that may change.
a. Scripts
b. Regular or periodic data exports
11. Record metadata
a. For both incoming and outgoing data
b. Metadata includes
1. Provenance: Who created, provided, or produced the data.
2. Content: What the data set contains
3. Purpose: What the data are intended to be used for.
4. Method: How the data were generated or selected and summarized.
5. History: The history of any revisions made to the data or the data summarization method.
C. Forms of metadata include:
1. Copies of emails or other documents that transmit data or request data.
2. Header notes in scripts.
3. Metadata pages and glossary pages in data summaries.
4. Custom log files created by scripted data operations.
12. Date-tag directory and file names where the sequence of changes may affect their validity or interpretability.
PostgreSQL's tablefunc extension provides the crosstab function, which can be used to transform normalized data into a crosstabbed format. However, one of the drawbacks of the crosstab function is that it requires that the names of all of the column headers be listed in one of its arguments. In some cases (actually, pretty much every case when I have reason to use crosstab), the number of distinct values that are to be transformed into column headers is not fixed. Finding those and manually creating the appropriate crosstab argument is tedious and error-prone.
To eliminate this step and to substantially automate the task of crosstabbing data in Postgres, I wrote the following script for execsql. This uses Postgres' string_agg() aggregate function, Postgres' quote_ident() function, and an execsql substitution variable to dynamically create the list of column headers that is to be used as an argument to crosstab.
This script should be run using execsql's execute script metacommand.
-- ######################################################################## -- !x! BEGIN SCRIPT PG_XTAB with parameters (dtable, rowid, colhdr, value, value_type, xtable) -- Purpose: -- Crosstab any three columns from a data table. -- Parameters: -- dtable : The name of the input data table. -- rowid : The name of the column that should be kept as a row identifier. -- colhdr : The name of the colulmn containing values that will -- be used as column headers in the output. -- value : The name of the column containing values that will -- appear in each cell of the cross-table. -- value_type : The Postgres data type of the values. Types with spaces -- should be double-quoted (e.g., "double precision"). -- xtable : The name of the temporary table that is to be created. -- col_order : Optional: the name of a column in the input table -- with values that have a 1:1 relatiionship with -- values of 'colhdr' and that should be used to order -- the crosstabbed columns in the output. -- Notes: -- 1. There must be only one 'value' for each combination of 'rowid' -- and 'colhdr'. This constraint will be checked, and if it fails, -- the failing values will be displayed if a global 'verbose' variable -- is set, and you will be given an option to to continue or halt. -- If it fails and 'verbose' is not set, the script will halt. -- 2. The output table 'xtable' is deleted at the beginning of the script, -- so it will not exist if you choose to continue after an error. -- ========================================================================= -- Drop the table for return values so that it doesn't exist if this exits on error. drop table if exists !!#xtable!! cascade; -- ------------------------------------------------------------------------ -- QA Check: Unique combination of row, column, and value. -- ------------------------------------------------------------------------ drop view if exists pgx_ck1 cascade; create temporary view pgx_ck1 as select distinct !!#rowid!!, !!#colhdr!!, !!#value!! from !!#dtable!!; drop view if exists pgx_ck2 cascade; create temporary view pgx_ck2 as select !!#rowid!!, !!#colhdr!! from pgx_ck1 group by !!#rowid!!, !!#colhdr!! having count(*) > 1; -- !x! if(hasrows(pgx_ck2)) -- !x! if(sub_defined(verbose) and !!verbose!!) -- !x! prompt ask "There are multiple results per cell in the input to PG_XTAB, so the cross-table will not be created. Do you want to continue anyway?" sub pgx_continue compare pgx_ck1 and pgx_ck2 key (!!#rowid!!, !!#colhdr!!, !!#value!!) -- !x! if(not !!pgx_continue!!) -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script." display pgx_ck1 -- !x! endif -- !x! else -- !x! HALT message "There are multiple results per cell in the input to the PG_XTAB script." -- !x! endif -- !x! endif -- ------------------------------------------------------------------------ -- QA Check: Only one column order value for each column header. -- ------------------------------------------------------------------------ -- !x! if(sub_defined(#col_order)) drop view if exists pgx_ck3 cascade; create temporary view pgx_ck3 as select !!#colhdr!!, count(*) as ordering_values from (select distinct !!#colhdr!!, !!#col_order!! from !!#dtable!!) as vs group by !!#colhdr!! having count(*) > 1; -- !x! if(hasrows(pgx_ck3)) -- !x! HALT "There are multiple ordering values for each column header value in the input to PG_XTAB." -- !x! endif -- !x! endif -- ------------------------------------------------------------------------ -- Get the list of column headers -- ------------------------------------------------------------------------ drop view if exists pgx_hdrs cascade; -- !x! if(sub_defined(#col_order)) -- !x! sub orderclause order by !!#col_order!! -- !x! sub ordercol , !!#col_order!! -- !x! else -- !x! sub_empty orderclause -- !x! sub_empty ordercol -- !x! endif create temporary view pgx_hdrs as select hdr from ( select distinct quote_ident(!!#colhdr!!) as hdr !!ordercol!! from !!#dtable!! !!orderclause!! ) as hd !!orderclause!!; -- ------------------------------------------------------------------------ -- Get the three columns of data in a form to be crosstabbed. -- ------------------------------------------------------------------------ drop view if exists pgx_xinp cascade; create temporary view pgx_xinp as select !!#rowid!!, quote_ident(!!#colhdr!!) as hdr, !!#value!! as value from !!#dtable!! order by !!#rowid!!; -- ------------------------------------------------------------------------ -- Get the type declarations for the crosstabbed columns. -- ------------------------------------------------------------------------ drop view if exists pgx_type_decl cascade; create temporary view pgx_type_decl as select string_agg(hdr || ' !!#value_type!!', ', ') as coltypes from pgx_hdrs; -- !x! subdata ~coltypes pgx_type_decl -- ------------------------------------------------------------------------ -- Create the crosstab. -- ------------------------------------------------------------------------ create temporary table !!#xtable!! as select * from crosstab('select * from pgx_xinp', 'select * from pgx_hdrs') as ct(!!#rowid!! text, !!~coltypes!!); -- !x! END SCRIPT PG_XTAB -- ########################################################################
The following list of items focuses on points of potential weakness and improvement to consider during a code review of a SQL script.
The write() and writestr() methods of the ZipFile class in Python's zipfile library allow the addition of an entire file or a single string as a member file within a zip file. However, if a large amount of data will be generated dynamically, these methods do not allow separate data items to be written sequentially to a member file within a zip file.
The StreamableZipfile class in the following code snippet provides this missing capability.
import time
import zipfile
class StreamableZipfile(object):
def __init__(self, zipfile_name, mode='a'):
# Compression type and level are only available in Python
# since versions 3.3 and 3.7, respectively.
self.zf = zipfile.ZipFile(zipfile_name, mode,
compression=zipfile.ZIP_BZIP2, compresslevel=9)
def close(self):
self.zf.close()
def member_file(self, member_filename):
# Creates a ZipInfo object (file) within the zipfile
# and opens it for writing.
self.current_zinfo = zipfile.ZipInfo(filename=member_filename,
date_time=time.localtime(time.time())[:6])
self.current_zinfo.compress_type = self.zf.compression
self.current_zinfo._compresslevel = self.zf.compresslevel
self.current_zinfo.file_size = 0
self.current_handle = self.zf.open(self.current_zinfo, mode='w')
def close_member(self):
self.current_handle.close()
def write(self, str_data):
# Writes the given text to the currently open member.
data = str_data.encode("utf-8")
with self.zf._lock:
self.current_zinfo.file_size = self.current_zinfo.file_size + len(data)
self.current_handle.write(data)
This simple implementation does not include any error checking and always compresses the data with the bzip2 algorithm, using the highest compression level. A more robust and flexible implementation would eliminate these limitations. Modifications are also needed for versions of Python prior to 3.7.
Use of the StreamableZipfile class is illustrated by the following code, which creates a zip file containing two separate files, where lines are written sequentially to each of the files.
import os
zfname = "Test.zip"
if os.path.isfile(zfname):
os.remove(zfname)
# Open the streamable zip file and write several lines to a file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file1.txt')
zfile.write("This is file 1, line 1\n")
zfile.write("This is file 1, line 2\n")
zfile.write("This is file 1, line 3\n")
zfile.close_member()
zfile.close()
# Open the same zip file and write lines to another file within it.
zfile = StreamableZipfile(zfname)
zfile.member_file('file2.txt')
zfile.write("This is file 2, line 1\n")
zfile.write("This is file 2, line 2\n")
zfile.write("This is file 2, line 3\n")
zfile.close_member()
zfile.close()