Software Projects

Project Status

All of the research code that I write is available on GitHub under open source licenses. Here is a link to all of my software projects:

GitHub Repository


Miscellaneous Tools

In addition, certain tools or modules are worth mentioning:

Modules and reusable classes for machine learning that conform to the scikit-learn API.
Neutral Model in C++
Highly optimized C++ code, using OpenMP to simulate neutral genetic or cultural transmission. With Makefiles for the Intel C++ compiler, Clang, and GCC 5.X. Uses memory alignment and OpenMP directives to achieve high performance on Xeon processors. This is meant to serve as an engine for generating neutral variation in Approximate Bayesian Computation simulations, as a building block.
Slatkin Exact in Python
A python interface to a modified version of Montgomery Slatkin's original code for testing a set of item/allele counts against the Ewens Sampling Distribution for evidence of neutrality. Slatkin's original code has been modified only to render it suitable for use in a software library instead of a single-shot command line program (i.e., I worked on memory management and the data input interface). This module should compile on any Linux system with GCC or clang installed, and on OS X if the Xcode command line tools are present. Tested only against Python 2.7.
Numerical Functions for Apache Pig
Apache Pig is a high-level scripting language for data analysis which uses Hadoop to scale simple processing operations to handle very large datasets. Pig takes simple procedures which look like SQL and standard data formats (e.g., CSV), and transforms data step-by-step. If the data are very large, this process is transparently handed off to other computing nodes via Hadoop. These software modules are some useful numerical functions that I've been using in processing simulation output across many projects. An example is ensuring that floating point numbers do not get represented as scientific notation before they are input to R.
Greycite plugin for Jekyll
Greycite is a web service which takes any URL and produces a bibliographic citation for that URL, in a number of formats including BibTeX and RIS. This allows easy inclusion of web citations in publications. I use Jekyll for my web-based lab notebook, and this little plugin creates a Greycite citation on the fly for every page of the notebook. The formatting is controlled in your templates, so you if you use Jekyll, this will be useful for you as well.

Other software projects in my repository are specific to research projects. Please see the README for each project. Repositories starting with "experiment-" are generally analyses of data from specific simulation software (always located in a separate repository).

Reproducible Research Tools

Most of my research work involves a cycle of:
  • Defining a range of models whose behavior I intend to study
  • Generating model and simulation specifications for them
  • Simulating synthetic data from those models
  • Postprocessing the synthetic data sets
  • Performing statistical/ML analysis of those data
  • Preprocessing empirical data sets to match the format of synthetic data
  • Assess goodness of fit
Once this loop is underway, there is usually a paper, and a presentation that accompany. The experiment-template repository is a ready-to-use setup for doing the above analysis loop. Please use and customize as you see fit!

Note about Microsoft Windows

None of the software projects at GitHub Repository have been tested on Microsoft Windows. There is no reason that all of it will not work, especially those projects which are written in pure Python, R, or Ruby. I can advise by email if you are having trouble, but I will not routinely be able to test each release or check-in on a Windows environment.

The main challenges in making one of the major simulation projects, such as Axelrod-CT, are as follows:

  • The usual care with specifying filesystem paths, although Python ought to handle this well by default if you specify paths in configuration files correctly.
  • You will need a command line C/C++ compiler environment to compile some dependencies. Good options include MinGW, which I have not used, and Cygwin, which I have (and found it excellent). If you have Visual Studio installed, you can also use the native Microsoft compiler easily from the command line. Simple "getting started" instructions can be found starting here on MSDN.

Otherwise, there are no obstacles to running any of the simulation models discussed here in a Windows environment, especially since all output is logged to a MongoDB database instance, available for free with a Windows installer.