Sunday, May 1, 2011

Surrogate Keys and Data Integrity Errors

Discussions of the use of surrogate keys versus natural keys rarely address the full range of data integrity errors that can occur when surrogate keys are used. This post illustrates how surrogate keys can lead to loss of data integrity in a way that is apparently not widely recognized.

There is one well-known type of data integrity error that can occur with surrogate keys. That is, when a table with a surrogate key also has a column (or combination of columns) that make up a natural key, duplicate data can be entered--and entity integrity therefore violated. This potential problem can be easily avoided, however, by creating a unique index on the columns making up the natural key.

The other, and less widely recognized, problem can lead to loss of relational integrity between tables. This problem can be illustrated using the data structure shown in Figure 1. This data structure represents a series of measurements made by a single organization (laboratory) using multiple instruments, as might be done when testing materials or procedures under a variety of different circumstances. Figure 1 shows the model data structure, which is the most direct representation of real-world data and relationships, using natural keys as primary keys. Only the primary key columns of each table are shown in the figure; in practice these tables would have additional attribute columns.

The DDL that would create this data structure is shown in Listing 1. Figure 1 and Listing 1 are provided primarily for reference, and for contrast with the approach to implementing this data model using surrogate keys.

Figure 1. Model data structure

Listing 1.  DDL for model data structure
create table laboratory (
 lab character varying (10) not null primary key
 );

create table instrument (
 lab character varying (10) not null,
 instrument character varying (10) not null,
 constraint pk_instrument primary key (lab, instrument),
 constraint fk_instumentlab foreign key (lab)
  references laboratory (lab)
  on update cascade on delete cascade
 );

create table run (
 lab character varying (10) not null,
 run_no character varying (10) not null,
 constraint pk_run primary key (lab, run_no)
 constraint fk_run_lab foreign key (lab)
  references laboratory (lab)
  on update cascade on delete cascade
 );

create table labdata (
 lab character varying (10) not null,
 run_no character varying (10) not null,
 instrument character varying (10) not null,
 material character varying (10) not null,
 constraint pk_labdata primary key (lab, run_no, instrument, material),
 constraint fk_labdata_run foreign key (lab, run_no)
  references run (lab, run_no)
  on update cascade on delete cascade,
 constraint fk_labdata_instr foreign key (lab, instrument)
  references instrument (lab, instrument)
  on update cascade on delete cascade
 );

The implementation of this model using surrogate keys is shown in Figure 2. Columns making up both the primary (surrogate) key and the candidate (natural) key are shown. Attribute columns, again, are not shown. The calibration table is not shown in Figure 2, as it is not needed for this example.

Figure 2. Model data structure with surrogate keys


DDL to implement this model using surrogate keys is shown in Listing 2. Because surrogate keys are not part of the SQL standard, the implemetation of surrogate keys is DBMS-specific. In this case (Figure 1 and Listing 2), the implementation is specific to SQL Server. SQL Server is used for this example because it cannot cascade updates and deletions through multiple pathways, and surrogate keys can eliminate the need for cascading updates. Thus, a SQL Server implementation of the model data structure is particularly likely to make use of surrogate keys. Three notable features of this DDL are:
  • Each table has a unique index on the natural key to prevent data integrity errors resulting from creation of duplicate rows with the same natural key.
  • There is no specification that updates be cascaded, in the expectation that the use of surrogate keys makes this unnecessary.
  • All deletions are cascaded using triggers instead of "on delete cascade" clauses. SQL Server cannot cascade deletions to the labdata table, so one at least of those cascading deletions has to be handled by a trigger. In that case, however, SQL Server also cannot automatically cascade deletions from the laboratory table to the child table because SQL Server does not support "before delete" triggers, and the "instead of delete" trigger that must be used prevents SQL Server from guaranteeing that the automatic cascading deletion from the laboratory table is completed successfully. Consequently, all deletions are cascaded using triggers.

Listing 2. DDL for model data structure with surrogate keys
create table laboratory (
 lab_id integer identity(1,1) not null primary key,
 lab character varying (10) not null
 );
create unique index ix_nk_lab on laboratory (lab);
GO

create table instrument (
 instrument_id integer identity(1,1) not null primary key,
 lab_id integer not null,
 instrument character varying (10) not null,
 constraint fk_instrlab foreign key (lab_id)
  references laboratory (lab_id)
 );
create unique index ix_nk_instrument on instrument (instrument);
GO

create table run (
 run_id integer identity(1,1) not null primary key,
 lab_id integer not null,
 run_no character varying (10) not null,
 constraint fk_runlab foreign key (lab_id)
  references laboratory (lab_id)
 );
create unique index ix_nk_run on run (lab_id, run_no);
GO

create trigger tr_lab_del on laboratory instead of delete as
 begin
 delete from instrument where lab_id in (select lab_id from Deleted);
 delete from run where lab_id in (select lab_id from Deleted);
 delete from laboratory where lab_id in (select lab_id from Deleted);
 end;
GO

create table labdata (
 labdata_id integer identity(1,1) not null primary key,
 run_id integer not null,
 instrument_id integer not null,
 material character varying (10) not null,
 constraint fk_labdatarun foreign key (run_id)
  references run (run_id) on delete cascade,
 constraint fk_labdatainst foreign key (instrument_id)
  references instrument (instrument_id)
 );
create unique index ix_nk_labdata on labdata (run_id, instrument_id, material);
GO

create trigger tr_instr_del on instrument instead of delete as
 begin
 delete from labdata where instrument_id in (select instrument_id from Deleted);
 delete from instrument where instrument_id in (select instrument_id from Deleted);
 end;

create trigger tr_run_del on run instead of delete as
 begin
 delete from labdata where run_id in (select run_id from Deleted);
 delete from run where run_id in (select run_id from Deleted);
 end;

Listing 3 contains DML to insert several rows into the laboratory, instrument, and run tables of the structure shown in Figure 2.

Listing 3. Load lab, instrument, and run data
insert into laboratory (lab) values ('LabA');
insert into laboratory (lab) values ('LabB');

insert into instrument (lab_id, instrument) values (
 (select lab_id from laboratory where lab='LabA'),
 'Instr1');

insert into run (lab_id, run_no) values (
 (select lab_id from laboratory where lab='LabB'),
 'Run001');


At this point the data tables have been created using surrogate keys, and valid data have been added to three of the tables. Data can now be added to the labdata table. Adding data to that table will illustrate how surrogate keys allow the introduction of data integrity errors.

Listing 4 shows the addition of a row to the labdata table. This insert statement should fail, because the row that is added references a run conducted by laboratory B, using an instrument at laboratory A. The insert statement succeeds, however, introducing invalid data into the labdata table. Such a statement would fail if natural keys were used for these tables. The statement succeeds, and introduces invalid data, because surrogate keys have been used.

Listing 4. Insert lab data
insert into labdata (run_id, instrument_id, material) values (
 (select run_id from run 
  where lab_id=(select lab_id from laboratory where lab='LabB') 
  and run_no='Run001'),
 (select instrument_id from instrument 
  where lab_id=(select lab_id from laboratory where lab='LabA')
  and instrument='Instr1'),
 'MaterialX'
 );

To alleviate this problem, a trigger needs to be created for the labdata table which, for insert or update, will compare the corresponding lab_id values in the instrument and run tables, and cancel the insert or update statement if the lab_id values do not match. In this case the trigger only needs to look back one level.  In other data models the trigger may need to look back through multiple levels of tables to check the data.

However, addition of such a trigger on the labdata table does not eliminate all the ways in which data integrity errors can be introduced. Data integrity errors can also be introduced via update statements to the instrument and run tables. Listing 5 shows an update to values in the run table that will introduce data integrity errors into the labdata table despite the new trigger on the labdata table. Updates to the instrument table can have the same effect.


Listing 5. Update data
insert into laboratory (lab) values ('LabC');
update run 
set lab_id=(select lab_id from laboratory where lab='LabC') 
where 
 lab_id=(select lab_id from laboratory where lab='LabB')
 and run_no='Run001';

The problem of update statements introducing data integrity errors can also be addressed by adding triggers. Conceptually, in this case a trigger should be defined for the run table (and also for the instrument table) that looks into the labresult table to check that conflicting data are not created. For this data model, assuming that there are not already invalid data in the labdata table, an update like that shown in Listing 5 would always create a conflict within the affected row of the labresult table, and therefore introduce invalid data. Therefore, as a practical matter, the trigger on the run table (and also on the instrument table) should simply prevent updates like that shown in Listing 5.

Updates like that shown in Listing 5 do not need to be prevented if natural keys are used. If natural keys are used, the update would fail if Lab C is not also referenced in the instrument table, but it would succeed if the instrument table contains compatible data.

This example illustrates that the use of surrogate keys with certain types of data models potentially allows the introduction of data integrity errors, and that the problem a) requires the creation of additional triggers for insert and update operations, and b) may prevent the execution of legitimate update operations.

Sunday, April 24, 2011

Converging and Cyclic Cascades in Database Systems

Enforcement of referential integrity is an essential feature of a database management system that will be used for non-trivial applications. However, among DBMSs that implement some referential integrity features there are important differences in the DBMS' abilities to propagate cascading updates and deletes through multiple pathways. Some commonly used DBMSs provide only partial, or no, support for cascading updates and deletes through multiple pathways. The lack of support for cascading updates and deletes means that the database cannot be used to accurately model some types of real-world relationships, or that artificial data structures must be used to work around the limitation.

There are two different basic data structures in which updates and deletes may be cascaded through multiple pathways. These two situations need to be distinguished because some DBMSs will cascade updates in one situation but not in the other. The two data structures are:

  • The primary key defined in one table (first level) is referenced in two child tables (second level), and both of the child tables serve as parents of a fourth table (third level) that has a foreign key into each of the third-level tables.
  • The primary key defined in one table is referenced by two different columns of a child table.

Both of these situations are illustrated in the following figure.


This is a simplified model of environmental sampling information. Samples may be collected as part of several different studies. Any study may be subdivided into different elements (representing, for example, different phases, teams, locations, or sample types). Some studies do not have distinct elements, so the study element table will not be populated. There are therefore two pathways from the study table to the sample table: one direct pathway and one indirect pathway via the study element table, where the latter pathway may or may not be complete. An update or deletion of a study identifier in the study table needs to be propagated through both pathways. The relationship between the study table and the sample table illustrates the first of the two types of structures where updates and deletions must be cascaded through multiple pathways.

The relationship between the length unit lookup table and the sample table illustrates the second of the two types of structures. Here the sample table has two distinct foreign keys into the length unit lookup table, each making up a distinct pathway between the two tables.

Several common and widely used DBMSs vary in their abilities to handle these types of data structures. The differences in support for cascading updates and deletions in data structures containing these features, including testing code and test results, are presented below for:

  • PostgreSQL 9.0.3,
  • MySQL 5.1.49,
  • SQL Server 2008 R2,
  • SQLite 3.7.2,
  • MS-Access 2007, and
  • LibreOffice Base 3.3.2.

Initializing the Databases

The DDL used to create the tables used for this test is:

create table d_study (study character varying(10) not null primary key);


create table d_studyelement (study character varying (10) not null,
  element character varying(10) not null,
  constraint pk_element primary key (study, element),
  constraint fk_elementstudy foreign key (study) references d_study (study)
    on update cascade on delete cascade);


create table l_lunit (length_unit character varying(10) not null primary key);


create table d_sample (study character varying (10) not null,
  sample character varying (10) not null,
  element character varying (10) null,
  surface_elevation double precision,
  elevation_units character varying (10),
  sample_depth double precision,
  depth_units character varying (10),
  constraint pk_sample primary key (study, sample),
  constraint fk_samplestudy foreign key (study) references d_study (study)
    on update cascade on delete cascade,
  constraint fk_sampleelem foreign key (study, element) references d_studyelement
    (study, element) on update cascade on delete cascade,
  constraint fk_sampleelev foreign key (elevation_units) references l_lunit (length_unit)
    on update cascade on delete cascade,
  constraint fk_sampledepth foreign key (depth_units) references l_lunit (length_unit)
    on update cascade on delete cascade);


This DDL can be executed directly by PostgreSQL, MySQL, SQLite, LibreOffice Base, and SQL Server. For MS-Access, the foreign key constraint specifications must be removed, and created using MS-Access' GUI instead. Also, for MS-Access, the character varying type must be replaced with a text type.

SQLite supports foreign keys (and cascading updates and deletions) only if PRAGMA foreign_keys = ON is specified. MySQL supports foreign keys only when the InnoDb engine is used.

The DML used to initialize these tables with testing data is:

insert into d_study (study) values ('StudyA');
insert into d_study (study) values ('StudyB');


insert into d_studyelement (study, element) values ('StudyA', 'Elem1');
insert into d_studyelement (study, element) values ('StudyA', 'Elem2');


insert into l_lunit (length_unit) values ('meters');
insert into l_lunit (length_unit) values ('cm');
insert into l_lunit (length_unit) values ('ft');


insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyA', 'Sample1', 'Elem1', 1234, 'meters', 2, 'cm');
insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyA', 'Sample2', 'Elem1', 1234, 'meters', 50, 'cm'); 
insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyA', 'Sample4', 'Elem1', 1234, 'meters', 1.2, 'meters');
insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyA', 'Sample5', 'Elem2', 602, 'ft', 2, 'meters');
insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyA', 'Sample6', 'Elem2', 602, 'ft', 4, 'meters');
insert into d_sample (study, sample, element, surface_elevation,
  elevation_units, sample_depth, depth_units)
  values ('StudyB', 'Sample1', null, 900, 'meters', 2, 'cm');


The same DML can be used for all databases, except for those cases (described below) where DBMS limitations do not allow data to be inserted.

Tests

The ability of each DBMS to cascade updates and deletes through multiple pathways can be illustrated using the following three test statements applied to databases initialized as described above.

1. Cascade an update through multiple paths

Changing the name of a study in the study table should result in that change being cascaded to the sample table through two pathways, both directly and via the study element table.

update d_study set study='StudyZ' where study='StudyA';


2. Cascade an update of multiple columns

Changing the code used for a length unit in the length unit table should result in that change being cascaded to both of the length unit columns in the sample table.

update l_lunit set length_unit='m' where length_unit='meters';


3. Cascade a deletion through multiple paths

Deleting a study from the d_study table should result in that change being cascaded to the sample table.

delete from d_study where study='StudyZ';


The statement above is appropriate when the first test of cascading updates has been completed successfully. When the first test is not passed, then the where clause for this test should reference 'StudyA' instead of 'StudyZ'.


DBMS Support for Converging Cascading Updates and Deletions

The following table summarizes the capabilities of the different DBMSs to cascade updates and deletions through multiple paths.

DBMSCreate
relationships
Add dataCascade update
though multiple paths
Cascade update
of multiple columns
Cascade
deletions
PostgreSQLYesYesYesYesYes
MySQL (InnoDb)YesYesNoYesYes
SQL ServerNoNoNoNoNo
SQLiteYesYesYesYesYes
MS-AccessYesPartial (see text)YesYesYes
LibreOffice BaseYesYesNoYesYes


PostgreSQL and SQLite are the only DBMSs that correctly handle both cascading updates and deletes for both types of multiple pathways.

MySQL and LibreOffice Base both fail to cascade updates through multiple pathways, but can cascade updates of multiple columns, and can cascade deletions through multiple pathways.

Microsoft Access can't add all of the example data. The sample for StudyB with a null element can't be added because MS-Access does not allow null columns in a foreign key relationship. This limitation prohibits the use of data structures in which one table is linked to several different parent tables by different columns, and at least two of those links contain data derived from a common parent. To ensure that every row in the table is linked to only one of the parents, all but one of the foreign keys in a row must be null, and this condition will prohibit data addition in MS-Access. In the case where the data does not contain nulls (as for 'StudyA' and 'Elem1' in the example), MS-Access successfully propagates updates and deletes through both types of data structures.

SQL Server is notable among the DBMSs tested here because it does not allow creation of the data structure to test cascading updates and deletions. Executing the DDL to create the sample table causes SQL Server to issue the error message:

Introducing FOREIGN KEY constraint 'fk_sampleelem' on table 'd_sample' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

If the foreign key from the sample table to the study element table is not specified (removed from the DDL or moved to the end of the CREATE TABLE statement, then SQL Server issues the error message

Introducing FOREIGN KEY constraint 'fk_sampledepth' on table 'd_sample' may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.

SQL Server does not allow either type of relationship to be created with foreign keys to automatically maintain referential integrity. This behavior is by design. Commonly recommended workarounds are to use surrogate keys and disable cascading updates, or to create custom triggers to propagate updates and deletions. One rationale for this limitation of SQL Server is that it protects the user from cascading operations through data structures containing cyclic relationships; presumably this protection is needed because such cascading operations would fail or introduce data integrity errors.

Data Structures with Cyclic Relationships

Are limitations on cascading updates and deletions through multiple pathways necessary to protect against infinite loops or data integrity errors? Such effects should not happen in the previous examples because there are no cycles in those data structures. However, perhaps the DBMSs that allow cascading through multiple paths are vulnerable to problems when updates and deletions are cascaded through data structures that actually do contain cyles. Tests with cyclic data were performed for all the DBMSs that allow the specification of cascading updates and deletions for such structures.

The data structure used for these tests is a single self-referential table, illustrated below.


The column names used in this example are representative of a data structure used for tree-structured data, but such self-referential structures are appropriate for a variety of types of data.

With this simple structure, two different types of data cycles were tested:

  1. A completely self-referential cycle, where a node is its own parent.
  2. An intertwined cross-reference, where two nodes each reference the other as their parent.

Both of these types of cycles may in some circumstances represent a logical error in the data. However, for those DBMSs that allow a self-referential structure to be created with automatic cascading of updates and deletions, these situations should nevertheless be handled sensibly.

Cascading updates and deletions were tested for all DBMSs except for SQL Server, which will not create this structure. Two deletion options were tested: "on delete cascade" and "on delete set null". All the DBMSs tested support the "on delete set null" option except for MS-Access.

Code to create the table and carry out cycle test #1 is below:

create table d_node (
  node_name character varying(10) not null,
  parent_node character varying(10),
  constraint pk_node primary key (node_name),
  constraint fk_node foreign key (parent_node)
    references d_node (node_name)
    on update cascade on delete cascade
  );


-- **** Test update
insert into d_node (node_name, parent_node) values ('NodeA', 'NodeA');
insert into d_node (node_name, parent_node) values ('NodeB', 'NodeA');
update d_node set node_name='NodeZ' where node_name='NodeA';
-- Check result
select * from d_node;


-- **** Test deletion with cascading deletions on deletes
delete from d_node;
insert into d_node (node_name, parent_node) values ('NodeA', 'NodeA');
insert into d_node (node_name, parent_node) values ('NodeB', 'NodeA');
delete from d_node where node_name='NodeA';
-- Check result
select * from d_node;


-- **** Test deletion with set null on deletes
delete from d_node;
alter table d_node drop constraint fk_node;
alter table d_node
add constraint fk_node foreign key (parent_node)
    references d_node (node_name)
    on update cascade on delete set null;
insert into d_node (node_name, parent_node) values ('NodeA', 'NodeA');
insert into d_node (node_name, parent_node) values ('NodeB', 'NodeA');
delete from d_node where node_name='NodeA';
-- Check result
select * from d_node;


The code to carry out cycle test #2 is similar, except that the following insert statements are used:

insert into d_node (node_name, parent_node) values ('NodeA', null);
insert into d_node (node_name, parent_node) values ('NodeB', 'NodeA');
update d_node set parent_node='NodeB' where node_name='NodeA';

The results of tests on the two types of cyclic data relationships are shown in the following table. A notation of "Yes" indicates that cascading updates and deletions were completed without error. The two entries for the deletion tests represent the results for "on delete cascade" and "on delete set null" options, respectively.

DBMSCycle Test 1
update
Cycle Test 1
delete
Cycle Test 2
update
Cycle Test 2
delete
PostgeSQLYesYes/YesYesYes/Yes
MySQL (InnoDb)NoYes/YesNoYes/Yes
SQLiteYesYes/YesYesYes/Yes
MS-AccessYesYes/NoYesYes/No
LibreOffice BaseYesYes/YesYesNo/Yes


Both PostgreSQL and SQLite complete all cascading updates and deletions without error.

The update operation on both tests fails in MySQL with an error message indicating that a foreign key constraint fails. However, deletions are cascaded correctly for both deletion options.

MS-Access correctly executes some update and deletion cascades for both tests. However, MS-Access does not support, and therefore cannot successfully complete, cascading deletions with a "set null" option.

LibreOffice Base correctly executes the update and delete operations of test 1, and the update operation of test 2. LibreOffice Base fails to complete the cascading deletion in test 2 with a stack overflow error. However, if the deletion cascade option is changed to "set null", LibreOffice Base completes the deletion without error. The results for cascading updates are interesting because although LibreOffice Base fails to cascade updates through multiple pathways, it successfully cascades updates through cyclic data structures.

Although usability assessments are not the focus of these tests, LibreOffice's user interface is notably ill suited to handling the cyclic data structure used here: the UI does not allow interactive creation of the self-join, and when data are edited interactively, the GUI display is not immediately updated to show the cascaded updates, with the consequence that the update appears not to have been performed correctly. There is no option to refresh the display, so the table must be closed and re-opened to see the effect of interactive changes that are cascaded through a self-join.

Conclusion

Although cascading updates and deletions are supported by many DBMSs, there are important variations in the robustness with which these features are implemented, particularly when there are multiple cascade pathways and cyclic structures. The differing capabilities may be important when selecting a DBMS with which to implement a particular data model. Most of the commonly used DBMSs tested failed to support cascading updates and deletions in one or more ways. However, there is currently at least one desktop DBMS (SQLite) and one client-server DBMS (PostgreSQL) that will successfully propagate cascading updates and deletions through multiple pathways and through cyclic data structures.