Sunday, May 22, 2016

Using a Makefile with a Database

A makefile is a convenient and efficient way to run a multi-step process that creates or modifies files.  However, when one step of that process is to extract data from a server-based database, make has no inherent ability to determine when the data in the database has last been changed, and therefore, whether the data extraction step needs to be re-run.

The following Python program (db_updated.py) addresses this problem. It updates the modification date of a specific flag file (named db_updated.flag by default), creating the file if it does not already exist. This code is written to be used with a PostgreSQL database, but can be easily modified for other databases.

The program requires that the database contain at least one table with a column that contains a revision date (and time), either for that table or for the database as a whole. Command-line options can be used to specify the table(s) to be checked, and the name of the column to be checked. If neither is specified, a default table name of "auditlog" is used, and a default column name of "rev_time" is used.

Command-line options and arguments are:
db_updated.py [-t table] [-c column] [-u user] server database

Multiple -t options may be specified; the latest revision date of any of the tables will be used to set the modification date of the db_updated.flag file. If a user name is specified, the program will prompt for a password.


db_updated.py:
import argparse
import psycopg2
import getpass
import os
import os.path
import time

FLAG_FILE = 'db_updated.flag'

def get_user(server_name, database_name):
 user_name = raw_input("User name for %s database %s: " % (server_name, database_name))
 return user_name

def get_password(server_name, database_name, user_name):
 passwd = getpass.getpass("Password for user %s on %s database %s: " % (user_name, server_name, database_name))
 return passwd

def clparser():
 desc_msg = "Create or update the  database flag file if the given database has been recently modified."
 parser = argparse.ArgumentParser(description=desc_msg)
 parser.add_argument('-u', '--db_user', dest='db_user', action='store')
 parser.add_argument('-t', '--table', dest='table', action='append')
 parser.add_argument('-c', '--column', dest='column', action='store')
 parser.add_argument('server', action='store')
 parser.add_argument('database', action='store')
 return parser

def main():
 parser = clparser()
 arg = parser.parse_args()
 if not os.path.exists(FLAG_FILE):
  open(FLAG_FILE, 'a').close()
 if arg.db_user:
  db_user = arg.db_user
  pw = get_password(arg.server, arg.database, db_user)
  conn = psycopg2.connect(host=arg.server, database=arg.database, port=5432, user=unicode(db_user), password=unicode(pw))
 else:
  conn = psycopg2.connect(host=arg.server, database=arg.database, port=5432)
 curs = conn.cursor()
 column = arg.column or "rev_time"
 if arg.table:
  tables = 0
  sql = ''
  for t in arg.table:
   if tables > 0:
    sql = sql + " union "
   tables += 1
   sql = sql + "select max(%s) as latest from %s" % (column, t)
  if tables > 1:
   sql = "select max(latest) from (%s) as all_t" % sql
  sql = sql + ";"
 else:
  sql = "select max(%s) from auditlog;" % column
 curs.execute(sql)
 revtime = time.mktime(curs.fetchone()[0].timetuple())
 conn.close()
 os.utime(FLAG_FILE, (revtime, revtime))

main()

This can be used in a makefile like the example below. This example extracts data from a database and then runs an R script to further summarize the data.  The input and output file names for the R script are shown on the command line here, though encoding them in the script itself may be more convenient in many cases than parsing the command line in R.

.PHONY: all

all: db_updated.flag data_summary.csv
 db_updated.py -u user_name server_name db_name

data_summary.csv: data.csv summary_stats.R
 Rscript --vanilla summary_stats.R data.csv data_summary.csv

data.csv: db_updated.flag get_data.sql
 execsql.py -tp -u user_name get_data.sql server_name db_name

db_updated.flag:
 db_updated.py -u user_name server_name db_name

In this example, neither a table name nor a column name are included as arguments to db_updated.py. These may be needed in some circumstances, or the program modified so that the defaults are appropriate for the circumstances in which it is used.