Splinter of the Singularity: 2015

Friday, December 4, 2015

Likelihood Ratios to Assess Weight of Evidence for Site Recovery

Audio and slides of my 2015 SETAC presentation Quantitative Weight of Evidence for Sediment Recovery in the Portland Harbor Superfund Site are here: http://setac.sclivelearningcenter.com/index.aspx?PID=9484&SID=215420.

Monday, November 23, 2015

execsql

execsql is a Python application that reads a text file of SQL commands and executes them against a database. The supported databases are:

PostgreSQL
MS-Access
MS-SQL Server
SQLite
MySQL or MariaDB
Firebird
ODBC DSN connections.

In addition to executing SQL statements, execsql also implements a number of metacommands that allow:

Import of data from text files and OpenDocument spreadsheets
Export of data to delimited text, HTML, JSON, LaTeX tables, and OpenDocument spreadsheets
Copying of data between databases, even of different DBMS types
Conditional execution of SQL code and metacommands based on data values or user input.

execsql allows variables to be defined and used to dynamically modify SQL code and metacommands. An INCLUDE metacommand can be used to modularize SQL scripts and facilitate code re-use. Variables can be used to parameterize included scripts. Conditional inclusion of scripts can be used to implement loops. Automatically incrementing counter variables can be used to control loops or generate unique values to be used in input or output.

execsql is fundamentally a command-line application that can be used to automate database operations, either by itself or as part of a toolchain that may include other steps to either pre-process data before loading, or post-process or analyze data that are extracted from a database. However, execsql also has the ability to present data to the user in dialog boxes, request several types of input from the user either on the terminal or in GUI dialogs, and display messages to the user either on the terminal or in a GUI console. These features allow flexible interactive applications to be created.

A guiding principle in the development of execsql is documentation of all data operations. The use of SQL scripts, rather than ad-hoc interactive operations in a GUI, is key to meeting that goal. In addition, execsql logs all usage information, including databases used, script files executed (including script file version information), variable assignments, user input, and errors. Chain-of-custody procedures are used in some disciplines to ensure traceability from collection through (typically) laboratory analysis. However, no formal chain of custody procedures ordinarily are applied to data during or after entry into a database. SQL script files and execsql log files can be used to produce a complete record of data operations so that traceability of data can be extended into the database environment.

execsql and its documentation are available from the Python Package Index (PyPI). Note that the search functionality in PyPI and in pip search is broken, and will return links to obsolete versions of execsql that are no longer available. The direct link will take you to the latest available version.

Sunday, May 31, 2015

Calendar Figure for Zim Journal Page

The daily journal page in Zim is a useful place to record one's meeting schedule and task list during the morning planning session, and update these, plus other activities, during the day. I use a custom template for the journal page that has first-level headings of "Calendar", "Tasks", and "Activities". At work, where Outlook is used for calendaring, the "Calendar" section can be initialized by preparing to e-mail a copy of the daily calendar in Outlook, and then copying and pasting the table of schedule data from the e-mail to Zim. However, a graphic representation of the daily calendar is more useful than the textual representation obtained in this way.

The following R script can be used to embed a graphic representation of a daily calendar into a Zim page. The R script plugin must be enabled to use this script.

mtg.names <- c('Dummy meeting xyz', 'Dummy meeting abc', 'Dummy meeting pqr')
start.times <- c(8, 10, 13)
end.times <- c(8.5, 11, 14.5)
# HEIGHT = 100
# WIDTH = 640
mtg.len <- end.times - start.times
fill.times <- 17 - end.times
times <- matrix(c(rev(start.times), rev(mtg.len), rev(fill.times)), byrow=TRUE, ncol=length(start.times))
par(oma=c(0,0,0,0), mar=c(2,12,0,0))
barplot(height=times, space=0, horiz=TRUE, names.arg=rev(mtg.names), las=1, col=c("white", "skyblue", "white"), 
        xlim=c(8,18), xpd=FALSE, xaxp=c(8,17,9), xaxs="i", yaxs="i"
        )
abline(v=9:16)
abline(v=seq(8.5,16.5,1), lty=2, col="gray25")

The first four lines of this script, specifying the meeting names, start and end times, and the figure height should be edited as necessary. If more space for meeting names is needed, the left margin specification of "12" on line 9 should be increased as needed. Depending on your monitor resolution and how much screen space you devote to Zim, you may also want to adjust the "WIDTH" specification on line 5; the value is in pixels.

This script produces a figure like this:

Saturday, May 23, 2015

Changing Bullets to Checklist Items in Zim

If a checklist template is stored in Zim as a bulleted list, when the template is copied to a new page to create an actual checklist, the bulleted items need to be converted to checklist items. The custom tool script below will automatically perform this conversion for all bullets on the page.

#!/usr/bin/python
# zimbulletchecklist.py
#
# PURPOSE
# Convert leading bullets to checklist items in a Zim page. 
#
# NOTES
# 1. The name of the temporary Zim page file must be passed
#  as the (only) command-line argument.
# 2. All bullets (asterisks) with only leading whitespace
#  and followed by at least one whitespace character
#  will be converted to checklists.
#
# AUTHOR
# Dreas Nielsen (RDN)
#
# COPYRIGHT AND LICENSE
# Copyright (c) 2015, Dreas Nielsen
# License: GPL3
#
# HISTORY
#  Date   Remarks
# ---------- -----------------------------------------------------
# 2015-05-23 Created.  RDN.
# ==================================================================

_vdate = "2015-05-23"
_version = "1.0"

import sys
import os
import fileinput
import re

if len(sys.argv) != 2:
 sys.stderr.write("You must provide the temporary Zim page file name as a command-line argument.")
 sys.exit(1)
zimpage = sys.argv[1]
if not os.path.exists(zimpage):
 sys.stderr.write("The file %s does not exist." % zimpage)
 sys.exit(1)

bullet_rx = re.compile(r'^(\s*)\*\s(.*)$')

for line in fileinput.input(inplace=1):
 m = bullet_rx.match(line) 
 if m:
  sys.stdout.write(m.group(1)+'[ ] '+m.group(2)+'\n')
 else:
  sys.stdout.write(line)

Friday, May 15, 2015

A Markdown Table Generator for Zim

Zim's interface for table creation requires tables to be built up row by row. If you know how many rows you need in a table, Zim's 'Custom Tools' feature can be used to simplify the process of creating a table. Following is a Python script that can be used to create an empty Markdown table, with the desired number of rows and columns, at the bottom of a Zim wiki page. A custom tool should be created in Zim that calls this script with the "%f" parameter as the sole command-line argument. After this script is run you will see the table in the page as Markdown text; type Ctrl-R to reload the page, and Zim will format it like a table created with Zim's own table-creation tool.

#!/usr/bin/python
# zimtable.py
#
# PURPOSE
# Add an empty Markdown-formatted table to the end of a Zim
# page.
#
# NOTES
# 1. The name of the temporary Zim page file must be passed
#  as the (only) command-line argument.
# 2. This program will prompt for the number of rows and columns,
#  and the column width in tabs, and insert a pipe table
#  with those dimensions, plus one row for headers.
# 3. The separator line between headers and table cells
#  assumes 6 characters per tab.
#
# AUTHOR
# Dreas Nielsen (RDN)
#
# COPYRIGHT and LICENSE
# Copyright (c) 2015, Dreas Nielsen
# License: GPL3
#
# HISTORY
#  Date   Remarks
# ---------- -----------------------------------------------------
# 2015-05-13 Created without GUI to prompt for table size. RDN.
# 2015-05-15 Added GUI.  RDN.
# =====================================================================

_vdate = "2013-05-15"
_version = "1.0"

import sys
import os
import Tkinter as tk
import ttk


if len(sys.argv) != 2:
 sys.stderr.write("You must provide the temporary Zim page file name as a command-line argument.")
 sys.exit(1)
zimpage = sys.argv[1]
if not os.path.exists(zimpage):
 sys.stderr.write("The file %s does not exist." % zimpage)
 sys.exit(1)

def add_pipe_table(fn, rows, cols, colwidth=3):
 f = open(fn, "a")
 tabs = '\t' * colwidth
 sep = '-' * (6 * colwidth - 1)
 datarow = "|" + (cols) * ("%s%s" % (tabs, "|")) + '\n'
 seprow = "|" + (cols) * ("%s%s" % (sep, "|")) + '\n'
 f.write('\n')
 f.write(datarow)
 f.write(seprow)
 for r in range(rows):
  f.write(datarow)
 f.close()

def cancel_table(*args):
 ui.destroy()

def make_table(*args):
 add_pipe_table(zimpage, int(row_val.get()), int(col_val.get()), int(width_val.get()))
 ui.destroy()


ui = tk.Tk()
ui.title("Create Markdown Table in Zim")
# Frames
optframe = ttk.Frame(master=ui, padding="4 4 4 4")
btnframe = ttk.Frame(master=ui, padding="4 4 4 4")
optframe.grid(column=0, row=1, sticky=tk.EW)
btnframe.grid(column=0, row=2, sticky=tk.EW)
# Input frame contents
row_label = ttk.Label(master=optframe, text="Rows:")
row_val = tk.StringVar(optframe, value=10)
row_inp = tk.Spinbox(optframe, from_=1, to=50, textvariable=row_val)
col_label = ttk.Label(master=optframe, text="Columns:")
col_val = tk.StringVar(optframe, value=4)
col_inp = tk.Spinbox(optframe, from_=1, to=20, textvariable=col_val)
width_label = ttk.Label(master=optframe, text="Column width in tabs:")
width_val = tk.StringVar(optframe, value=3)
width_inp = tk.Spinbox(optframe, from_=1, to=5, textvariable=width_val)
row_label.grid(column=0, row=0, sticky=tk.E)
row_inp.grid(column=1, row=0, sticky=tk.W)
col_label.grid(column=0, row=1, sticky=tk.E)
col_inp.grid(column=1, row=1, sticky=tk.W)
width_label.grid(column=0, row=2, sticky=tk.E)
width_inp.grid(column=1, row=2, sticky=tk.W)
# Button frame contents
cancel_button = ttk.Button(btnframe, text="Cancel", command=cancel_table)
cancel_button.grid(column=0, row=0, sticky=tk.E, padx=3)
maketable_button = ttk.Button(btnframe, text="Make Table", command=make_table)
maketable_button.grid(column=1, row=0, sticky=tk.E, padx=3)

ui.bind("", cancel_table)
ui.bind("", make_table)

ui.eval('tk::PlaceWindow %s center' % ui.winfo_pathname(ui.winfo_id()))

ui.mainloop()

Sunday, April 12, 2015

An alternate Python CSV sniffer

I recently came across a case where the format sniffer in Python's csv module was not satisfactory. The problem is that the format sniffer always identifies a quote character, even when there is no quote character in the CSV file (e.g., in the case of a tab-delimited file). By default the format sniffer reports that a double-quote character is used. This is not a problem if you subsequently use the csv module itself to read the file. However, I wanted to use the csv module only if the file contained quote characters, and use a different method (the copy_from method of the psycopg module) if the file contained no quote characters. I could not make this distinction using the csv module's format sniffer, so I wrote the following one to use instead.

import re

class CsvDiagError(Exception):
 def __init__(self, msg):
  self.value = msg
 def __str__(self):
  return self.value

class CsvLine():
 escchar = u"\\"
 def __init__(self, line_text):
  self.text = line_text
  self.delim_counts = {}
  self.item_errors = []  # A list of error messages.
 def __str__(self):
  return u"; ".join([u"Text: <<%s>>" % self.text, \
   u"Delimiter counts: <<%s>>" % ", ".join([u"%s: %d" % (k, self.delim_counts[k]) for k in self.delim_counts.keys()]) ])
 def count_delim(self, delim):
  # If the delimiter is a space, consider multiple spaces to be equivalent
  # to a single delimiter, split on the space(s), and consider the delimiter
  # count to be one fewer than the items returned.
  if delim == u" ":
   self.delim_counts[delim] = max(0, len(re.split(r' +', self.text)) - 1)
  else:
   self.delim_counts[delim] = self.text.count(delim)
 def delim_count(self, delim):
  return self.delim_counts[delim]
 def _well_quoted(self, element, qchar):
  # A well-quoted element has either no quotes, a quote on each end and none
  # in the middle, or quotes on both ends and every internal quote is either
  # doubled or escaped.
  # Returns a tuple of three booleans; the first indicates whether the element is
  # well-quoted, the second indicates whether the quote character is used
  # at all, and the third indicates whether the escape character is used.
  if qchar not in element:
   return (True, False, False)
  if len(element) == 0:
   return (True, False, False)
  if element[0] == qchar and element[-1] == qchar and qchar not in element[1:-1]:
   return (True, True, False)
  # The element has quotes; if it doesn't have one on each end, it is not well-quoted.
  if not (element[0] == qchar and element[-1] == qchar):
   return (False, True, False)
  e = element[1:-1]
  # If there are no quotes left after removing doubled quotes, this is well-quoted.
  if qchar not in e.replace(qchar+qchar, u''):
   return (True, True, False)
  # if there are no quotes left after removing escaped quotes, this is well-quoted.
  if qchar not in e.replace(self.escchar+qchar, u''):
   return (True, True, True)
  return (False, True, False)
 def record_format_error(self, pos_no, errmsg):
  self.item_errors.append(u"%s in position %d." % (errmsg, pos_no))
 def items(self, delim, qchar):
  # Parses the line into a list of items, breaking it at delimiters that are not
  # within quoted stretches.  (This is a almost CSV parser, for valid delim and qchar,
  # except that it does not eliminate quote characters or reduce escaped quotes.)
  self.item_errors = []
  if qchar is None:
   if delim is None:
    return self.text
   else:
    if delim == u" ":
     return re.split(r' +', self.text)
    else:
     return self.text.split(delim)
  elements = []  # The list of items on the line that will be returned.
  eat_multiple_delims = delim == u" "
  # States of the FSM:
  # _IN_QUOTED: An opening quote has been seen, but no closing quote encountered.
  #  Actions / transition:
  #   quote: save char in escape buffer / _ESCAPED
  #   esc_char : save char in escape buffer / _ESCAPED
  #   delimiter: save char in element buffer / _IN_QUOTED
  #   other: save char in element buffer / _IN_QUOTED
  # _ESCAPED: An escape character has been seen while _IN_QUOTED (and is in the escape buffer).
  #  Actions / transitions
  #   quote: save escape buffer in element buffer, empty escape buffer, 
  #    save char in element buffer / _IN_QUOTED
  #   delimiter: save escape buffer in element buffer, empty escape buffer,
  #    save element buffer, empty element buffer / _BETWEEN
  #   other: save escape buffer in element buffer, empty escape buffer,
  #    save char in element buffer / _IN_QUOTED
  # _QUOTE_IN_QUOTED: A quote has been seen while _IN_QUOTED (and is in the escape buffer).
  #  Actions / transitions
  #   quote: save escape buffer in element buffer, empty escape buffer, 
  #    save char in element buffer / _IN_QUOTED
  #   delimiter: save escape buffer in element buffer, empty escape buffer,
  #    save element buffer, empty element buffer / _DELIMITED
  #   other: save escape buffer in element buffer, empty escape buffer,
  #    save char in element buffer / _IN_QUOTED
  #     (An 'other' character in this position represents a bad format:
  #     a quote not followed by another quote or a delimiter.)
  # _IN_UNQUOTED: A non-delimiter, non-quote has been seen.
  #  Actions / transitions
  #   quote: save char in element buffer / _IN_UNQUOTED
  #    (This represents a bad format.)
  #   delimiter: save element buffer, empty element buffer / _DELIMITED
  #   other: save char in element buffer / _IN_UNQUOTED
  # _BETWEEN: Not in an element, and a delimiter not seen.  This is the starting state,
  #   and the state following a closing quote but before a delimiter is seen.
  #  Actions / transition:
  #   quote: save char in element buffer / _IN_QUOTED
  #   delimiter: save element buffer, empty element buffer / _DELIMITED
  #    (The element buffer should be empty, representing a null data item.)
  #   other: save char in element buffer / _IN_UNQUOTED
  # _DELIMITED: A delimiter has been seen while not in a quoted item.
  #  Actions / transition:
  #   quote: save char in element buffer / _IN_QUOTED
  #   delimiter: if eat_multiple: no action / _DELIMITED
  #     if not eat_multiple: save element buffer, empty element buffer / _DELIMITED
  #   other: save char in element buffer / _IN_UNQUOTED
  # At end of line: save escape buffer in element buffer, save element buffer.  For a well-formed
  # line, these should be empty, but they may not be.
  #
  # Define the state constants, which will also be used as indexes into an execution vector.
  _IN_QUOTED, _ESCAPED, _QUOTE_IN_QUOTED, _IN_UNQUOTED, _BETWEEN, _DELIMITED = range(6)
  #
  # Because of Python 2.7's scoping rules:
  # * The escape buffer and current element are defined as mutable objects that will have their
  #  first elements modified, rather than as string variables.  (Python 2.x does not allow
  #  modification of a variable in an enclosing scope that is not the global scope, but
  #  mutable objects like lists can be altered.  Another approach would be to implement this
  #  as a class and use instance variables.)
  # * The action functions return the next state rather than assigning it directly to the 'state' variable.
  esc_buf = [u'']
  current_element = [u'']
  def in_quoted():
   if c == self.escchar:
    esc_buf[0] = c
    return _ESCAPED
   elif c == qchar:
    esc_buf[0] = c
    return _QUOTE_IN_QUOTED
   else:
    current_element[0] += c
    return _IN_QUOTED
  def escaped():
   if c == delim:
    current_element[0] += esc_buf[0]
    esc_buf[0] = u''
    elements.append(current_element[0])
    current_element[0] = u''
    return _BETWEEN
   else:
    current_element[0] += esc_buf[0]
    esc_buf[0] = u''
    current_element[0] += c
    return _IN_QUOTED
  def quote_in_quoted():
   if c == qchar:
    current_element[0] += esc_buf[0]
    esc_buf[0] = u''
    current_element[0] += c
    return _IN_QUOTED
   elif c == delim:
    current_element[0] += esc_buf[0]
    esc_buf[0] = u''
    elements.append(current_element[0])
    current_element[0] = u''
    return _DELIMITED
   else:
    current_element[0] += esc_buf[0]
    esc_buf[0] = u''
    current_element[0] += c
    self.record_format_error(i+1, "Unexpected character following a closing quote")
    return _IN_QUOTED
  def in_unquoted():
   if c == delim:
    elements.append(current_element[0])
    current_element[0] = u''
    return _DELIMITED
   else:
    current_element[0] += c
    return _IN_UNQUOTED
  def between():
   if c == qchar:
    current_element[0] += c
    return _IN_QUOTED
   elif c == delim:
    elements.append(current_element[0])
    current_element[0] = u''
    return _DELIMITED
   else:
    current_element[0] += c
    return _IN_UNQUOTED
  def delimited():
   if c == qchar:
    current_element[0] += c
    return _IN_QUOTED
   elif c == delim:
    if not eat_multiple_delims:
     elements.append(current_element[0])
     current_element[0] = u''
    return _DELIMITED
   else:
    current_element[0] += c
    return _IN_UNQUOTED
  # Functions in the execution vector must be ordered identically to the
  # indexes represented by the state constants.
  exec_vector = [ in_quoted, escaped, quote_in_quoted, in_unquoted, between, delimited ]
  # Set the starting state.
  state = _BETWEEN
  # Process the line of text.
  for i, c in enumerate(self.text):
   state = exec_vector[state]()
  # Process the end-of-line condition.
  if len(esc_buf[0]) > 0:
   current_element[0] += esc_buf[0]
  if len(current_element[0]) > 0:
   elements.append(current_element[0])
  return elements
 def well_quoted_line(self, delim, qchar):
  # Returns a tuple of boolean, int, and boolean, indicating: 1) whether the line is
  # well-quoted, 2) the number of elements for which the quote character is used,
  # and 3) whether the escape character is used.
  wq = [ self._well_quoted(el, qchar) for el in self.items(delim, qchar) ]
  return ( all([b[0] for b in wq]), sum([b[1] for b in wq]), any([b[2] for b in wq]) )


def diagnose_delim(linestream, possible_delimiters=None, possible_quotechars=None):
 # Returns a tuple consisting of the delimiter, quote character, and escape
 # character for quote characters within elements of a line.  All may be None.
 # If the escape character is not None, it will be u"\".
 # Arguments:
 # * linestream: An iterable file-like object with a 'next()' method that returns lines of text
 #  as bytes or unicode.
 # * possible_delimiters: A list of single characters that might be used to separate items on
 #  a line.  If not specified, the default consists of tab, comma, semicolon, and vertical rule.
 #  If a space character is included, multiple space characters will be treated as a single
 #  delimiter--so it's best if there are no missing values on space-delimited lines, though
 #  that is not necessarily a fatal flaw unless there is a very high fraction of missing values.
 # * possible_quotechars: A list of single characters that might be used to quote items on
 #  a line.  If not specified, the default consists of single and double quotes.
 if not possible_delimiters:
  possible_delimiters = [ u"\t", u",", u";", u"|"]
 if not possible_quotechars:
  possible_quotechars = [ u'"', u"'"]
 lines = []
 for i in range(100):
  try:
   ln = linestream.next()
  except StopIteration:
   break
  except:
   raise
  while len(ln) > 0 and ln[-1] in (u"\n", u"\r"):
   ln = ln[:-1]
  if len(ln) > 0:
   lines.append(CsvLine(ln))
 if len(lines) == 0:
  raise CsvDiagError(u"CSV diagnosis error: no lines read")
 for ln in lines:
  for d in possible_delimiters:
   ln.count_delim(d)
 # For each delimiter, find the minimum number of delimiters found on any line, and the number of lines 
 # with that minimum number
 delim_stats = {}
 for d in possible_delimiters:
  dcounts = [ ln.delim_count(d) for ln in lines ]
  min_count = min(dcounts)
  delim_stats[d] = (min_count, dcounts.count(min_count))
 # Remove delimiters that were never found.
 for k in delim_stats.keys():
  if delim_stats[k][0] == 0:
   del(delim_stats[k])
 def all_well_quoted(delim, qchar):
  # Returns a tuple of boolean, int, and boolean indicating: 1) whether the line is
  # well-quoted, 2) the total number of lines and elements for which the quote character
  # is used, and 3) the escape character used.
  wq = [ l.well_quoted_line(delim, qchar) for l in lines ]
  return ( all([b[0] for b in wq]), sum([b[1] for b in wq]), CsvLine.escchar if any([b[2] for b in wq]) else None )
 def eval_quotes(delim):
  # Returns a tuple of the form to be returned by 'diagnose_delim()'.
  ok_quotes = {}
  for q in possible_quotechars:
   allwq = all_well_quoted(delim, q)
   if allwq[0]:
    ok_quotes[q] = (allwq[1], allwq[2])
  if len(ok_quotes) == 0:
   return (delim, None, None) # No quotes, no escapechar
  else:
   max_use = max([ v[0] for v in ok_quotes.values() ])
   if max_use == 0:
    return (delim, None, None)
   # If multiple quote characters have the same usage, return (arbitrarily) the first one.
   for q in ok_quotes.keys():
    if ok_quotes[q][0] == max_use:
     return (delim, q, ok_quotes[q][1])
 if len(delim_stats) == 0:
  # None of the delimiters were found.  Some other delimiter may apply,
  # or the input may contain a single value on each line.
  # Identify possible quote characters.
  return eval_quotes(None)
 else:
  if len(delim_stats) > 1:
   # If one of them is a space, prefer the non-space
   if u" " in delim_stats.keys():
    del(delim_stats[u" "])
  if len(delim_stats) == 1:
   return eval_quotes(delim_stats.keys()[0])
  # Assign weights to the delimiters.  The weight is the square of the minimum number of delimiters
  # on a line times the number of lines with that delimiter.
  delim_wts = {}
  for d in delim_stats.keys():
   delim_wts[d] = delim_stats[d][0]**2 * delim_stats[d][1]
  # Evaluate quote usage for each delimiter, from most heavily weighted to least.
  # Return the first good pair where the quote character is used.
  delim_order = sorted(delim_wts, key=delim_wts.get, reverse=True)
  for d in delim_order:
   quote_check = eval_quotes(d)
   if quote_check[0] and quote_check[1]:
    return quote_check
  # There are no delimiters for which quotes are OK.
  return (delim_order[0], None, None)
 # Should never get here
 raise CsvDiagError(u"CSV diagnosis error: an untested set of conditions are present")

Splinter of the Singularity