Skip to content

Kyle's Logbook

Kyle Barbary edited this page Jun 12, 2017 · 3 revisions

Converting from CVS repository to separate Git repositories

We want to exclude large data files from the import, so they're not in the git repository history. Useful command to find the largest files in all subdirectories:

find . -type f -printf '%s %P\n' | sort -nr

Here's an example command to actually do the import:

git cvsimport -v -d USER@CVSSERVER:/cvs/snovae/IFU -k -C ccd -S ".*\.fits" Ccd

This converts the CVS submodule (top-level directory) Ccd, creating a new git repository in a new directory ccd. It skips all files ending in .fits, which we had determined was all we needed to skip in the above step.

Problem with git cvsimport

I've realized that git cvsimport actually does a pretty bad job importing our CVS repositories. I noticed this after importing the IFU/IFU_C_iolibs CVS module into a git repository. What ended up being on the master branch was not the same as what you get with a simple cvs checkout. At least for some files on the master branch, there was a more recent commit on a different branch. With cvs checkout, you seem to get the most recent version of all files.

From looking at this post it sounds like git cvsimport just doesn't work that well for anything more than a simple linear CVS repository. The post recommends the cvs2git (aka cvs2svn) tool. However, that tool requires access to the root CVS repository, which I've already requested from IN2P3 and was refused. (They apparently have no mechanism for this.)

At this point, it seems like I should just ditch the history, at least for IFU_C_iolibs. The full history doesn't seem to actually be in CVS anyway, because at some point the code was imported (simply copied) from a different project.

IFU I/O C library imported into git

I'm starting with the IFU_C_iolibs code, because it is a dependency for the preprocessing code, which is the first step of the pipeline. I initially considered removing the dependency on IFU_C_iolibs in favor of just using raw cfitsio and wcslib calls. However, data structures from IFU_C_iolibs are integrated pretty deeply into the preprocessing codebase, so we do need IFU_C_iolibs, at least initially.

As noted above, I decided to ditch CVS history: I created a new repository https://github.com/snfactory/ifuio and copied the code in. I made some changes in the process:

  • flattened the directory structure
  • use a simple makefile rather than Autotools (we'll see how well this works)
  • change cfitsio and wcslib includes from, e.g., #include "fitsio.h" to #include <fitsio.h> so that installed headers are used

I did not import the code for reading and writing the "MIDAS" format. It doesn't seem to be used in preprocessing, so I hope we can do without it.

Choosing a workflow tool for a prototype

In working on integrating cubefit and extract_star, I thought it would be useful to use an existing workflow tool to run this specific (small) subsection of the pipeline at scale on NERSC. The idea is to start from calibrated cubes (brought over from CC) and run the cubefit and extract_star steps from that starting point. Then we could try integrating the two codes and compare the full output of the un-integrated and integrated codes. I brought the calibrated cubes over to NERSC under [...]/processing/0203-CABALLO/cubes-cal-corr/YY/DDD/...

There are many available tools (see, e.g., https://github.com/pditommaso/awesome-pipeline). I've read and played around with a few:

  • Makeflow: Appealing because it has make-inspired syntax, and targets SLURM (among other backends). However, the language doesn't support pattern matching rules, meaning that each concrete file rule needs to be written out separately. The website recommends that you can use "your favorite scripting language" to generate the makeflow file in cases where you have a lot of similar files. However, this seems to me to defeat the purpose: Essentially I have to come up with my own front-end language and also write a translator to the makeflow language. At that point, makeflow is only a SLURM scheduler for a given DAG. OK, but maybe there's something better.

  • Nextflow: Also targets slurm and has a relatively compact DSL for specifying the workflow, with pattern matching features. Has the concept of "channels" for intermediate data between steps (not clear if there is a way to specify that you want to keep some of these data). However, it crashed the first time I tried using it on edison and cori on their examples. Probably a missing Java library?

  • Common Workflow Language: Sounds like a good idea (to have a common language), but the actual language is incredibly verbose. By their own example, the equivalent of an 8-line makefile becomes over 80 lines long.

  • Swift: Supported at NERSC (module load swift on Cori, module load swift/0.96.2 on Edison) and has a compact and flexible language. Downsides: (1) It took over 10 minutes to run a simple example script. But this might be due to needing to specify the submission node host IP address (see the config note for Edison here. (2) It seems to not use cached file products, which I think I really want. It will restart from midway through execution, but only if the previous job failed and a restart is explicitly requested. (3) I think I want a make-style ability to create only certain targets, which I don't think swift can do.

  • Tigres: Questions and Notes: (1) Does it use cached file products by default? I can't tell by reading the docs. (2) Open source but development is still not open. Why not? (3) The SLURM backend looks overly simple (I think it is just one sbatch command per task).

  • Make: make v4.0 (available on Cori, but not Edison which has v3.81) has the ability to use a custom "shell" to run each task, and this can be set to srun. I tried a simple example and it seemed to work. However, I got a lot of messages from srun in the log, such as srun: Job step creation temporarily disabled, retrying and srun: Job step created. This seems to be due to too many srun commands in too short a time frame. This would be a problem running at very large scale when starting with a very parallel step.

The short answer: so far I don't really like any available tools. The Cori nodes have 32 cores and 128 GB memory. For testing cubefit/extractstar, I might just use a single Cori node and use Make with make -j 32.

Getting the extra information contained in the current DB

With all raw data in hand and the preprocess code compiling, I'm starting from the beginning of the existing pipeline with the step plan_file_quality. It adds some information to the headers of the raw FITS files and then runs preprocess. The extra header information comes from the database (the Target, Run, Exposure and Pose tables). This data seems to mostly originate from log files, which are parsed in SnfRun.py by the constructors of the classes SnfTarget, SnfRun, SnfExp and SnfPose. (The full chain is: SnfUpdate.py calls SnfData() which calls SnfRead() which calls the constructors). This all happens in the snf_db_make step of the pipeline.

So, to run preprocess, we need at least some (about 25 keywords worth) data from these database tables. I could try to skip the database, and just generate the headers directly from the log files. But there are a lot of subtle details and hacks in the log file-parsing code. So to start out, I'm going to try regenerating these tables in an sqlite3 database, starting from the existing code in SnfRun.py. Later, I can decide whether the database step is really needed. This will additionally give me a better idea of the structure of the Run/Exp/Pose/Target tables. I'm creating a Python package snfpipe in this repo that will contain at least the log-parsing, db-filling code.

A different approach: running the existing (suboptimal) pipeline at NERSC

Given how big an effort it is to do a full rewrite of the pipeline, we wanted to look at what it would take to simply install all the current pipeline software and database at NERSC and run there. The benefit would mainly be that NERSC has better user support and doesn't have the (frankly insane) "czar" system.

Possible roadblocks:

  1. Installing all the software:

    • For the Python code, this consists of determining all the Python code that is being used at CC. We could literally do the same thing as at CC: get the code to NERSC in a haphazard way and just add paths to the PYTHONPATH environment variable. But it would be nicer to actually install all the production code into a consistent virtual environment (or similar), instead of having it scattered all about. My idea so far was to use a conda environment to install everything (Python and not-Python) in a self-contained manner.
    • For compiled code, it is unclear if some code will compile at NERSC (32 bit vs 64 bit issues), but presumably we can do the same as at CC and compile everything in 32 bit compatibility mode. Still, if configure; make doesn't just work, there could be digging required.
  2. Everything in the current pipeline assumes that PBS/Torque (I think) is the queue-managing software, but NERSC uses SLURM. For example, The plan_*.py scripts write batch job scripts with #PBS comments to specify job options, but SLURM uses #SBATCH. Some of the actual options may be different. The higher-level scripts that run plan_*.py scripts probably assume the calls are qsub, not srun. But I'd have to check.

  3. One part of the pipeline (I forget which) I think submits jobs from a job running on the cluster. NERSC may have guards against this or get mad about it. Or they may not.

  4. One would have to set up the django front-end webserver to run on the NERSC science gateways in order to see the processing output on the web. This isn't necessarily a stumbling block, but is one more thing for someone to learn.

Understanding Python production software environment at CC

How is the current software stack "installed" at CC? For Python code, it seems to be largely by adding paths to snprod's PYTHONPATH, which includes:

  • .../cubefit/lib/python2.7/site-packages: Cubefit

  • .../SNFactory/Tasks/SNF-02-03/Processing/database/django: django site (processing -> processing_095, www -> www_095

  • .../SNFactory/Tasks/SNF-02-03/lib/python/site-packages: symlinks to "installed" python modules in .../SNFactory/Tasks/SNF-02-03/[Processing,Calibration,IDR].

  • .../SNFactory/Offline/SNF-02-03/lib/python: Stuff installed with build script in Offline: Toolbox, Prospect, PySNIFS modules, several other modules.

  • .../IFU/SNF-02-03/Snifs/user/bin: A few random Python modules, such as libExtinction.py.

  • .../throng/.../local/dislin-8.2/dislin/python: python wrappers around dislin (appears to be in source code dir)

  • .../throng/.../lib/python2.7/site-packages: external stuff, such as fitsio, PyFFTW, numpy, scipy, pyfits, matplotlib, and hacked in django 0.97.

Using a conda environment to install code

Rather than using PYTHONPATH, it would be nice to install all the code into a single location. conda seems like an appropriate tool for this, as it has the ability to install any software (not just Python) and manage multiple environments. One could name the conda environments by the processing tag, such as snf-02-03. Following are some commands to set up a conda environment with some of the dependencies. The build_* scripts in the CVS repository sometimes do similar things but are too specific to the environment at CC to be used directly.

TAG=snf-02-03

# create environment and switch to it
conda create -n $TAG python=2.7
source activate $TAG

# conda and pip packages
conda install numpy scipy matplotlib=1.5
conda install -c openastronomy fitsio
conda install -c conda-forge pyfftw
pip install pyfits==3.1.2

# download and patch old django
# (patch prevents md5 deprecation warning)
wget "http://snfactory.in2p3.fr/soft/PackageArchive/snf_django_svn_01.tar.gz"
tar -xzf snf_django_svn_01.tar.gz
rm snf_django_svn_01.tar.gz

wget "http://snfactory.in2p3.fr/soft/PackageArchive/django_hashlib.diff"
cd snf_django_svn_01; patch -p1 < ../django_hashlib.diff; cd ..
rm django_hashlib.diff

# install patched django into conda env by copying it to site-packages.
cp -r snf_django_svn_01/django $HOME/.conda/envs/$TAG/lib/python2.7/site-packages
rm -rf snf_django_svn_01

Setting up the database at NERSC

The Postgres database is central to the pipeline, so we need to set that up at NERSC. The general approach is to mirror the database schema from CC, but not copy any of the data. There is now an (empty) snfactry postgres database running at NERSC. The Django code in the snfactory CVS repository that defines the models may be able to set up the database. The SNFactory/Tasks/build_tasks script has the ability to initialize the database via django. It also can "optimize" it (adding some indicies) and fill it.

Converting SNFactory/Tasks/processing CVS module to git

Based on discussion with Saul and Greg, one useful thing will be to convert more of the CVS repository to git. In short, this will be useful because it will make it easier to (1) change the build/install scripts to be more sane and (2) work on changing code and testing it locally. To this end, I started looking at converting stuff in SNFactory/Tasks, which is used in most (if not all) steps of the pipeline.

What is the right unit to split into a git repo? All of SNFactory/Tasks is 1.3G, which is too big for a git repo. It includes a lot of stuff we don't want or need, like DDT. Most of the stuff that actually gets installed is in SNFactory/Tasks/Processing, so I think it makes sense to git-ize that.

Unfortunately, I can't make the repo public immediately because there might be passwords stored in code (particularly in the DB part of the code), or even in the CVS history. We need to ask around to make sure there are not passwords that are still in use, or read the code pretty exhaustively. ☹️