-
Notifications
You must be signed in to change notification settings - Fork 2
Kyle's Logbook
We want to exclude large data files from the import, so they're not in the git repository history. Useful command to find the largest files in all subdirectories:
find . -type f -printf '%s %P\n' | sort -nr
Here's an example command to actually do the import:
git cvsimport -v -d USER@CVSSERVER:/cvs/snovae/IFU -k -C ccd -S ".*\.fits" Ccd
This converts the CVS submodule (top-level directory) Ccd
, creating a new git repository in a new directory ccd
. It skips all files ending in .fits
, which we had determined was all we needed to skip in the above step.
I've realized that git cvsimport
actually does a pretty bad job importing our CVS repositories. I noticed
this after importing the IFU/IFU_C_iolibs
CVS module into a git repository. What ended up being on the master branch was not the same as what you get with a simple cvs checkout
. At least for some files on the master branch, there was a more recent commit on a different branch. With cvs checkout, you seem to get the most recent version of all files.
From looking at this post it sounds like git cvsimport
just doesn't work that well for anything more than a simple linear CVS repository. The post recommends the cvs2git
(aka cvs2svn
) tool. However, that tool requires access to the root CVS repository, which I've already requested from IN2P3 and was refused. (They apparently have no mechanism for this.)
At this point, it seems like I should just ditch the history, at least for IFU_C_iolibs
. The full history doesn't seem to actually be in CVS anyway, because at some point the code was imported (simply copied) from a different project.
I'm starting with the IFU_C_iolibs
code, because it is a dependency for the preprocessing code, which is the first step of the pipeline. I initially considered removing the dependency on IFU_C_iolibs
in favor of just using raw cfitsio
and wcslib
calls. However, data structures from IFU_C_iolibs
are integrated pretty deeply into the preprocessing codebase, so we do need IFU_C_iolibs
, at least initially.
As noted above, I decided to ditch CVS history: I created a new repository https://github.com/snfactory/ifuio and copied the code in. I made some changes in the process:
- flattened the directory structure
- use a simple
makefile
rather than Autotools (we'll see how well this works) - change cfitsio and wcslib includes from, e.g.,
#include "fitsio.h"
to#include <fitsio.h>
so that installed headers are used
I did not import the code for reading and writing the "MIDAS" format. It doesn't seem to be used in preprocessing, so I hope we can do without it.
In working on integrating cubefit
and extract_star
, I thought it would be useful to use an existing workflow tool to run this specific (small) subsection of the pipeline at scale on NERSC. The idea is to start from calibrated cubes (brought over from CC) and run the cubefit
and extract_star
steps from that starting point. Then we could try integrating the two codes and compare the full output of the un-integrated and integrated codes. I brought the calibrated cubes over to NERSC under [...]/processing/0203-CABALLO/cubes-cal-corr/YY/DDD/...
There are many available tools (see, e.g., https://github.com/pditommaso/awesome-pipeline). I've read and played around with a few:
-
Makeflow: Appealing because it has make-inspired syntax, and targets SLURM (among other backends). However, the language doesn't support pattern matching rules, meaning that each concrete file rule needs to be written out separately. The website recommends that you can use "your favorite scripting language" to generate the makeflow file in cases where you have a lot of similar files. However, this seems to me to defeat the purpose: Essentially I have to come up with my own front-end language and also write a translator to the makeflow language. At that point, makeflow is only a SLURM scheduler for a given DAG. OK, but maybe there's something better.
-
Nextflow: Also targets slurm and has a relatively compact DSL for specifying the workflow, with pattern matching features. Has the concept of "channels" for intermediate data between steps (not clear if there is a way to specify that you want to keep some of these data). However, it crashed the first time I tried using it on edison and cori on their examples. Probably a missing Java library?
-
Common Workflow Language: Sounds like a good idea (to have a common language), but the actual language is incredibly verbose. By their own example, the equivalent of an 8-line makefile becomes over 80 lines long.
-
Swift: Supported at NERSC (
module load swift
on Cori,module load swift/0.96.2
on Edison) and has a compact and flexible language. Downsides: (1) It took over 10 minutes to run a simple example script. But this might be due to needing to specify the submission node host IP address (see the config note for Edison here. (2) It seems to not use cached file products, which I think I really want. It will restart from midway through execution, but only if the previous job failed and a restart is explicitly requested. (3) I think I want a make-style ability to create only certain targets, which I don't think swift can do. -
Tigres: Questions and Notes: (1) Does it use cached file products by default? I can't tell by reading the docs. (2) Open source but development is still not open. Why not? (3) The SLURM backend looks overly simple (I think it is just one sbatch command per task).
-
Make: make v4.0 (available on Cori, but not Edison which has v3.81) has the ability to use a custom "shell" to run each task, and this can be set to
srun
. I tried a simple example and it seemed to work. However, I got a lot of messages fromsrun
in the log, such assrun: Job step creation temporarily disabled, retrying
andsrun: Job step created
. This seems to be due to too many srun commands in too short a time frame. This would be a problem running at very large scale when starting with a very parallel step.
The short answer: so far I don't really like any available tools. The Cori nodes have 32 cores and 128 GB memory. For testing cubefit/extractstar, I might just use a single Cori node and use Make with make -j 32
.
With all raw data in hand and the preprocess
code compiling, I'm starting from the beginning of the existing pipeline with the step plan_file_quality
. It adds some information to the headers of the raw FITS files and then runs preprocess
. The extra header information comes from the database (the Target
, Run
, Exposure
and Pose
tables). This data seems to mostly originate from log files, which are parsed in SnfRun.py
by the constructors of the classes SnfTarget
, SnfRun
, SnfExp
and SnfPose
. (The full chain is: SnfUpdate.py
calls SnfData()
which calls SnfRead()
which calls the constructors). This all happens in the snf_db_make
step of the pipeline.
So, to run preprocess, we need at least some (about 25 keywords worth) data from these database tables. I could try to skip the database, and just generate the headers directly from the log files. But there are a lot of subtle details and hacks in the log file-parsing code. So to start out, I'm going to try regenerating these tables in an sqlite3 database, starting from the existing code in SnfRun.py
. Later, I can decide whether the database step is really needed. This will additionally give me a better idea of the structure of the Run/Exp/Pose/Target tables. I'm creating a Python package snfpipe
in this repo that will contain at least the log-parsing, db-filling code.
Given how big an effort it is to do a full rewrite of the pipeline, we wanted to look at what it would take to simply install all the current pipeline software and database at NERSC and run there. The benefit would mainly be that NERSC has better user support and doesn't have the (frankly insane) "czar" system.
Possible roadblocks:
-
Installing all the software:
- For the Python code, this consists of determining all the Python code that is being used at CC. We could literally do the same thing as at CC: get the code to NERSC in a haphazard way and just add paths to the
PYTHONPATH
environment variable. But it would be nicer to actually install all the production code into a consistent virtual environment (or similar), instead of having it scattered all about. My idea so far was to use a conda environment to install everything (Python and not-Python) in a self-contained manner. - For compiled code, it is unclear if some code will compile at NERSC (32 bit vs 64 bit issues), but presumably we can do the same as at CC and compile everything in 32 bit compatibility mode. Still, if
configure; make
doesn't just work, there could be digging required.
- For the Python code, this consists of determining all the Python code that is being used at CC. We could literally do the same thing as at CC: get the code to NERSC in a haphazard way and just add paths to the
-
Everything in the current pipeline assumes that PBS/Torque (I think) is the queue-managing software, but NERSC uses SLURM. For example, The
plan_*.py
scripts write batch job scripts with#PBS
comments to specify job options, but SLURM uses#SBATCH
. Some of the actual options may be different. The higher-level scripts that runplan_*.py
scripts probably assume the calls areqsub
, notsrun
. But I'd have to check. -
One part of the pipeline (I forget which) I think submits jobs from a job running on the cluster. NERSC may have guards against this or get mad about it. Or they may not.
-
One would have to set up the django front-end webserver to run on the NERSC science gateways in order to see the processing output on the web. This isn't necessarily a stumbling block, but is one more thing for someone to learn.
How is the current software stack "installed" at CC? For Python code, it seems to be largely by adding paths to snprod
's PYTHONPATH
, which includes:
-
.../cubefit/lib/python2.7/site-packages
: Cubefit -
.../SNFactory/Tasks/SNF-02-03/Processing/database/django
: django site (processing -> processing_095
,www -> www_095
-
.../SNFactory/Tasks/SNF-02-03/lib/python/site-packages
: symlinks to "installed" python modules in.../SNFactory/Tasks/SNF-02-03/[Processing,Calibration,IDR]
. -
.../SNFactory/Offline/SNF-02-03/lib/python
: Stuff installed with build script inOffline
: Toolbox, Prospect, PySNIFS modules, several other modules. -
.../IFU/SNF-02-03/Snifs/user/bin
: A few random Python modules, such aslibExtinction.py
. -
.../throng/.../local/dislin-8.2/dislin/python
: python wrappers around dislin (appears to be in source code dir) -
.../throng/.../lib/python2.7/site-packages
: external stuff, such as fitsio, PyFFTW, numpy, scipy, pyfits, matplotlib, and hacked in django 0.97.
Rather than using PYTHONPATH
, it would be nice to install all the code into a single location. conda seems like an appropriate tool for this, as it has the ability to install any software (not just Python) and manage multiple environments. One could name the conda environments by the processing tag,
such as snf-02-03
. Following are some commands to set up a conda environment with some of the dependencies.
The build_*
scripts in the CVS repository sometimes do similar things but
are too specific to the environment at CC to be used directly.
TAG=snf-02-03
# create environment and switch to it
conda create -n $TAG python=2.7
source activate $TAG
# conda and pip packages
conda install numpy scipy matplotlib=1.5
conda install -c openastronomy fitsio
conda install -c conda-forge pyfftw
pip install pyfits==3.1.2
# download and patch old django
# (patch prevents md5 deprecation warning)
wget "http://snfactory.in2p3.fr/soft/PackageArchive/snf_django_svn_01.tar.gz"
tar -xzf snf_django_svn_01.tar.gz
rm snf_django_svn_01.tar.gz
wget "http://snfactory.in2p3.fr/soft/PackageArchive/django_hashlib.diff"
cd snf_django_svn_01; patch -p1 < ../django_hashlib.diff; cd ..
rm django_hashlib.diff
# install patched django into conda env by copying it to site-packages.
cp -r snf_django_svn_01/django $HOME/.conda/envs/$TAG/lib/python2.7/site-packages
rm -rf snf_django_svn_01
The Postgres database is central to the pipeline, so we need to set that up at NERSC. The general approach is to mirror the database schema from CC, but not copy any of the data. There is now an (empty) snfactry
postgres database running at NERSC. The Django code in the snfactory CVS repository that defines the models may be able to set up the database. The SNFactory/Tasks/build_tasks
script has the ability to initialize the database via django. It also can "optimize" it (adding some indicies) and fill it.
Based on discussion with Saul and Greg, one useful thing will be to convert more of the CVS repository to git. In short, this will be useful because it will make it easier to (1) change the build/install scripts to be more sane and (2) work on changing code and testing it locally. To this end, I started looking at converting stuff in SNFactory/Tasks
, which is used in most (if not all) steps of the pipeline.
What is the right unit to split into a git repo? All of SNFactory/Tasks
is 1.3G, which is too big for a git repo. It includes a lot of stuff we don't want or need, like DDT. Most of the stuff that actually gets installed is in SNFactory/Tasks/Processing
, so I think it makes sense to git-ize that.
Unfortunately, I can't make the repo public immediately because there might be passwords stored in code (particularly in the DB part of the code), or even in the CVS history. We need to ask around to make sure there are not passwords that are still in use, or read the code pretty exhaustively.