Saturday, August 28, 2021

Data Transformation Operations and a Standard Terminology

Data standardization, normalization, rescaling, and transformation: what they are and when to apply them

There are several different types of operations that are commonly applied to data before further analysis.  These operations are applied to individual data values, but are typically based on the attributes of a group of data values.  In a data matrix where columns are variables and rows are instances, the group that is used may be either all of the values in a row or all of the values in a column.  Which type of group to use, as well as which type of operation to apply, depends on both the data themselves and the needs of the analysis to be conducted.

Different types of operations that are commonly applied to data are standardization, rescaling, normalization, and non-linear transformation.  These terms are often used ambiguously and inconsistently.  Misunderstanding what others have done, or misrepresenting what you have done, is easy when these operations are not clearly defined and distinguished.

The various operations that may be applied to data are:

  • Standardization: Dividing every value by another value or by a constant.  The denominator is the standard to which each individual value is referenced.
  • Rescaling or range scaling: Converting the data from the original range of the group to a fixed range such as -1 to 1 or 0 to 1.
  • Mean centering: Subtracting the mean of the group from each value.
  • Normalization or z-transformation: Subtracting the mean of the group from each value and then dividing each value by the standard deviation of the group.
  • Non-linear transformation: Any non-linear operation such as taking the logarithm, square root, or probit of each data value.

Contributing to the ambiguity of these terms is that there is no single term that encompasses this group of operations as a whole.  The word “transformation” is most commonly used for this purpose, although such usage potentially conflicts with use of the same term for different linear and non-linear operations upon the data.  The word “normalization” is also often used as a catch-all for various types of operations on data.  Because normalization (as defined above) results in the calculation of standard (Z) scores, normalization is often referred to as simply standardization.  The word “normalization” is also frequently used when the actual operation referenced is either standardization or rescaling.  Non-linear transformations may also be referred to as rescaling because they change the data from a linear to a non-linear scale.

Because of these ambiguities:

  •  Be specific when describing the operations that you have applied to the data.  Use the terms above as the basis for your descriptions.
  •  Be careful when reading others’ descriptions of the operations that they have applied.  If a description is not specific, assume that the author is not sensitive to ambiguities in terminology, and may be using different definitions than those above.

Summary of Data Relationships Under Different Data Transformation Operations

The following table summarizes how data distributions are affected by the different operations.

Operation Linearity preserved Proportionality preserved Additivity preserved Sensitivity to statistical outliers (c)
Standardization Yes (a) Yes (a,b) No Moderate
Rescaling or range scaling Yes No No Extreme
Mean centering Yes No No Moderate
Normalization or z-transformation Yes No No Moderate
Non-linear transformation No No No Operation-dependent


Notes

a) Only within each group for which the same divisor has been applied to all values.

b) Standardization cannot create proportional relationships; that it does can be a hidden assumption when it is used.

c) None of the operations eliminate outliers.  Range scaling is very sensitive to outliers because an outlier will determine one end of the range.  Some non-linear transformation, such as log transformation, may reduce the relative weight of outliers.

Guidelines for Application

Some very general guidelines for application of these operations are:

  • If a statistical analysis is to be conducted, non-linear transformation may be needed to ensure that the data satisfy requirements of the method.
  • Normalization (z-transformation) may be appropriate prior to regression analyses when the purpose is to interpret the regression coefficients as measures of the relative importance of different variables.  Regression is insensitive to operations that preserve linear relationship, so standardization or rescaling are not ordinarily needed.
  • If an analysis based on similarity is to be conducted (e.g., clustering), then non-linear transformation should be avoided.  Most similarity measures are based on linear relationships (e.g., Euclidean distance), so non-linear transforms will alter the estimates of similarities between instances.
  • If an analysis based on Pearson correlations is to be conducted, then non-linear transformations should be used with caution, because the results of the analysis will not apply to data in the original scale.
  • When each variable represents a component of a larger quantity (e.g., variables are PCB congeners and total PCBs is the larger quantity), then standardization of each value to the sum is frequently appropriate, to eliminate the effect of differences between instances.  This operation is commonly referred to as "row-sum normalization," although it is not strictly normalization following the definitions above.

Application of more than one of the data transformation operations may be appropriate for some data sets and some analyses.