Sunday, May 22, 2016

Using a Makefile with a Database

A makefile is a convenient and efficient way to run a multi-step process that creates or modifies files.  However, when one step of that process is to extract data from a server-based database, make has no inherent ability to determine when the data in the database has last been changed, and therefore, whether the data extraction step needs to be re-run.

The following Python program ( addresses this problem. It updates the modification date of a specific flag file (named db_updated.flag by default), creating the file if it does not already exist. This code is written to be used with a PostgreSQL database, but can be easily modified for other databases.

The program requires that the database contain at least one table with a column that contains a revision date (and time), either for that table or for the database as a whole. Command-line options can be used to specify the table(s) to be checked, and the name of the column to be checked. If neither is specified, a default table name of "auditlog" is used, and a default column name of "rev_time" is used.

Command-line options and arguments are: [-t table] [-c column] [-u user] server database

Multiple -t options may be specified; the latest revision date of any of the tables will be used to set the modification date of the db_updated.flag file. If a user name is specified, the program will prompt for a password.
import argparse
import psycopg2
import getpass
import os
import os.path
import time

FLAG_FILE = 'db_updated.flag'

def get_user(server_name, database_name):
 user_name = raw_input("User name for %s database %s: " % (server_name, database_name))
 return user_name

def get_password(server_name, database_name, user_name):
 passwd = getpass.getpass("Password for user %s on %s database %s: " % (user_name, server_name, database_name))
 return passwd

def clparser():
 desc_msg = "Create or update the  database flag file if the given database has been recently modified."
 parser = argparse.ArgumentParser(description=desc_msg)
 parser.add_argument('-u', '--db_user', dest='db_user', action='store')
 parser.add_argument('-t', '--table', dest='table', action='append')
 parser.add_argument('-c', '--column', dest='column', action='store')
 parser.add_argument('server', action='store')
 parser.add_argument('database', action='store')
 return parser

def main():
 parser = clparser()
 arg = parser.parse_args()
 if not os.path.exists(FLAG_FILE):
  open(FLAG_FILE, 'a').close()
 if arg.db_user:
  db_user = arg.db_user
  pw = get_password(arg.server, arg.database, db_user)
  conn = psycopg2.connect(host=arg.server, database=arg.database, port=5432, user=unicode(db_user), password=unicode(pw))
  conn = psycopg2.connect(host=arg.server, database=arg.database, port=5432)
 curs = conn.cursor()
 column = arg.column or "rev_time"
 if arg.table:
  tables = 0
  sql = ''
  for t in arg.table:
   if tables > 0:
    sql = sql + " union "
   tables += 1
   sql = sql + "select max(%s) as latest from %s" % (column, t)
  if tables > 1:
   sql = "select max(latest) from (%s) as all_t" % sql
  sql = sql + ";"
  sql = "select max(%s) from auditlog;" % column
 revtime = time.mktime(curs.fetchone()[0].timetuple())
 os.utime(FLAG_FILE, (revtime, revtime))


This can be used in a makefile like the example below. This example extracts data from a database and then runs an R script to further summarize the data.  The input and output file names for the R script are shown on the command line here, though encoding them in the script itself may be more convenient in many cases than parsing the command line in R.

.PHONY: all

all: db_updated.flag data_summary.csv -u user_name server_name db_name

data_summary.csv: data.csv summary_stats.R
 Rscript --vanilla summary_stats.R data.csv data_summary.csv

data.csv: db_updated.flag get_data.sql -tp -u user_name get_data.sql server_name db_name

db_updated.flag: -u user_name server_name db_name

In this example, neither a table name nor a column name are included as arguments to These may be needed in some circumstances, or the program modified so that the defaults are appropriate for the circumstances in which it is used.