Saturday, December 9, 2017

Ordering Database Tables by Foreign Key Dependencies

Some database operations that affect multiple tables must touch the affected tables in an order corresponding to their dependencies. For example, data must be loaded into parent tables before data can be loaded into child (dependent) tables. Some deletion and update operations may also need to be carried out in dependency order. A list of all tables in dependency order can be useful to automate such operations. Several methods are shown here for generating a list of tables in dependency order; these methods are:
  • Python code
  • A CTE
  • An execsql script.

All of these methods take, as input, a table of dependencies in which each row contains a pair of table names corresponding to a parent:child relationship. The Python code produces a Python list containing tables listed in dependency order, and the other two methods produce a table consisting of one column containing table names and another column containing integers that specify the dependency order. The first items in these lists are the tables that have no parents, and the last items in these lists are the tables with the longest chain of dependencies.

Getting the Dependencies


When using a DBMS that supports INFORMATION_SCHEMA tables, the following SQL can be used to obtain a table containing all direct parent:child pairs

create table dependencies as
select 
        tc.table_name as child,
        tu.table_name as parent
from 
        information_schema.table_constraints as tc
        inner join information_schema.constraint_table_usage as tu
             on tu.constraint_name = tc.constraint_name
where 
        tc.constraint_type = 'FOREIGN KEY'
        and tc.table_name <> tu.table_name;

Additional constraints may be used, or needed, in the WHERE clause to limit the set of tables returned--for example, to eliminate system tables, or to select only the tables in a particular schema. If there are cyclic dependency relationships among tables, this SQL will not complete, and so in those cases at least one of the tables in the cycle of mutual dependencies should be omitted using an appropriate specification in the WHERE clause.

The table produced by this SQL will include only those tables that are part of some foreign key relationship. Standalone tables that are neither a parent nor a child will not be included. Standalone tables and tables that are omitted because of cyclic dependency relationships can be added to the output of the dependency-ordering step.

Ordering Tables by Dependency


Converting the set of parent:child dependencies into a list of tables in dependency order requires an iterative or recursive traversal of the tree of relationships that is rooted at the tables that have no parents. The table-ordering routines that follow use different approaches:
  • A loop over a list of unprocessed dependencies in Python, where that list is modified within the loop.
  • A recursive traversal of the tree using a CTE, terminating when the farthest leaves of the tree have been reached.
  • Looping using end recursion in the execsql script to traverse the tree in a manner similar to the CTE.

The algorithm used in the Python code is distinct, whereas the algorithms used with the recursive CTE and the execsql script are very similar. (The algorithm used by the Python code can also be implemented using an execsql script, but the code is considerably longer than the implementation shown below.)

Python


The input for the Python code to generate a dependency-ordered list of tables is a table of dependencies consisting of a list of two-element lists or tuples, each containing a child table name and a parent table name. This list might be generated by querying the database (e.g., using the SQL above) or from static configuration data. Given this table of dependencies, a Python list of all of the tables in dependency order can be generated with the following function

def dependency_order(dep_list):
    rem_tables = list(set([t[0] for t in dep_list] + [t[1] for t in dep_list]))
    rem_dep = copy.copy(dep_list)
    sortkey = 1
    ret_list = []
    while len(rem_dep) > 0:
        tbls = [tbl for tbl in rem_tables if tbl not in [dep[0] for dep in rem_dep] ]
        ret_list.extend([ (tb, sortkey) for tb in tbls ])
        rem_tables = [ tbl for tbl in rem_tables if tbl not in tbls ]
        rem_dep = [ dep for dep in rem_dep if dep[1] not in tbls ]
        sortkey += 1
    if len(rem_tables) > 0:
        ret_list.extend([(tb, sortkey) for tb in rem_tables])
    ret_list.sort(cmp=lambda x,y: cmp(x[1], y[1]))
    return [ item[0] for item in ret_list ]

SQL CTE


For DBMSs that support them, a recursive CTE can be used to convert a table of parent:child dependencies into a list of tables in dependency order. The input for the following code should be a table of such dependencies; that table should be named "dependencies". The output of the recursive CTE contains all parent tables, and the remaining tables (that are not parents to any other table) are added in the SELECT statement that uses the CTE.

with recursive dep_depth as (
 select
  dep.child,
  dep.parent,
  1 as lvl
 from
  dependencies as dep
 union all
 select
  dep.child,
  dep.parent,
  dd.lvl + 1 as lvl
 from
  dep_depth as dd
  inner join dependencies as dep on dep.parent = dd.child
 )
select
 table_name,
 table_order
from (
 select
  dd.parent as table_name,
  max(lvl) as table_order
 from
  dep_depth as dd
 group by
  table_name
 union
 select
  dd.child as table_name,
  max(lvl) + 1 as level
 from
  dep_depth as dd
  left join dependencies as dp on dp.parent = dd.child
 where
  dp.parent is null
 group by
  dd.child
 ) as all_levels;

Execsql Script


For DBMSs that don't support recursive CTEs, and when use of a client language like Python is not desired, the metacommands provided by execsql allow recursive traversal of the tree of dependencies, as shown in the following code. Explanations of the metacommands used in this code can be found in the on-line documentation.

As with the CTE implementation, the input for the following code should be a table of parent:child dependencies that is named "dependencies".

The following code increments a counter to track and assign the successive levels of recursion during traversal from the root to the leaves of the dependency tree. Because automatically-generated sequences and variables are DBMS-specific extensions to SQL, for the sake of generality, this implementation uses execsql counter variables and substitution variables. Thus, but for minor differences in SQL syntax, the following code should run in any DBMS.

-- ====================================================================
--  Initialize the tables used to summarize dependency order.
--  Table created:
--    dep_level: A copy of "dependencies" with an additional column
--               to store the level in the hierarchy.  The dependency
--               level is set by an execsql counter variable for
--               generality.
-- ====================================================================
-- !x! sub current_level !!$counter_530!!
select
    child,
    parent,
    !!current_level!! as lvl
into
    temporary table dep_level
from
    dependencies;


-- ====================================================================
--  Create a view to evaluate whether there are any remaining
--  dependencies to evaluate.
--  View created:
--    unprocessed: The number of parent tables whose children are
--                 not already listed as parents in the 'dep_level' table.
--  Tables used:
--    dep_level
--    dependencies
-- ====================================================================
create temporary view unprocessed as
select count(distinct dep.child) as unproc
from
    dep_level as dl
    inner join dependencies as dep on dl.child = dep.parent
where
    dl.lvl = (select max(lvl) from dep_level);


-- ====================================================================
--  Define an execsql sub-script to increment the dependency level.
-- ====================================================================
-- !x! begin script add_new_level
-- !x! sub last_level !!current_level!!
-- !x! sub current_level !!$counter_530!!
insert into dep_level
    (child, parent, lvl)
select distinct
    dep.child,
    dl.child as parent,
    !!current_level!!
from
    dep_level as dl
    inner join dependencies as dep on dl.child = dep.parent
where
    dl.lvl = !!last_level!!;
-- !x! end script


-- ====================================================================
--  Define and execute an execsql sub-script to increment the dependency
--  level as many times as necessary.
-- ====================================================================
-- !x! begin script add_levels
-- !x! subdata remaining unprocessed
-- !x! execute script add_new_level
-- !x! subdata remaining unprocessed
-- !x! if(is_gt(!!remaining!!, 0)) { execute script add_levels }
-- !x! end script

-- !x! execute script add_levels


-- ====================================================================
--  Convert the dependency levels into a table order.
-- ====================================================================
create temporary table dependency_order as
select
 table_name,
 table_order
from (
 select
  dd.parent as table_name,
  max(lvl) as table_order
 from
  dep_level as dd
 group by
  table_name
 union
 select
  dd.child as table_name,
  max(lvl) + 1 as level
 from
  dep_level as dd
  left join dependencies as dp on dp.parent = dd.child
 where
  dp.parent is null
 group by
  dd.child
 ) as all_levels;