ngsarchiver
is a small Python package that
provides a utility for managing the archiving of
directories containing Next Generation Sequencing
(NGS) data within the local data repository of
the Bioinformatics Core Facility (BCF) at the
University of Manchester.
Specifically it provides a single executable called
archiver
which can be used to:
- Archive and compress sequencing and analysis data on a per-run (i.e. per-directory) basis, to produce compressed archive directories
- Copy arbitrary directories to an archive location without compression or other manipulations, to produce copy archive directories
In addition for compressed archive directories it can also be used to:
- Interrogate contents of archived directories
- Restore archived runs and recover subsets of archived data
The formats of the two types of archive directory are described in the relevant sections below.
Note that this README
comprises the whole of the
usage documentation.
ngsarchiver
exists primarily as a tool for
archiving data within the BCF datastore.
This datastore exists as a hierarchy of
directories and files residing across a number of
Linux filesystem. The top levels of the hierarchy
act to differentiate data from different years and
which originate from different sequencing platforms
or data sources (e.g. external facilities),
however the exact structure is not relevant within
the context of ngsarchiver
: the key point is that
within this hierarchy each distinct sequencing run or
external dataset has its own directory, which is
referred to a "run directory", and archiving is
performed at the run directory level.
"Archiving" in this context implies that the run directory has reached a point in its lifecycle where no new data will be added to it in future.
Two types of archiving operation are supported by
ngsarchiver
, either:
- Compressing the data where possible (to reduce the overall space required to store it), and converting to an essentially read-only format which can be restored in whole or in part (so archived data can be recovered back to its original form), or
- Copying the data to an archive location (optionally also transforming it in order to allow as full a copy as is possible).
There are broad conventions within the BCF about how
data is structured within each run directory; however,
these conventions can vary over time, between platforms
and even individual analysts. As a result there is a
degree of non-homogeneity across run directories
within the datastore. For this reason ngsarchiver
adopts a largely agnostic approach which makes minimal
assumptions about the internal structure of data within
each run directory. Specifically each run directory
it encounters is automatically classified as one of
three types:
GenericRun
is a directory with a mixture of files and subdirectories at the top-level (and which isn't one of the other types); all content is placed in a single subarchive.MultiSubdirRun
is a directory with only subdirectories at the top-level; each subdirectory has its own subarchive.MultiProjectRun
is a directory with aprojects.info
file at the top-level along with a mixture of other files and subdirectories; each project subdirectory has its own subarchive, with the non-project content grouped into an additional subarchive.
ngsarchiver
recognises two additional types:
ArchiveDirectory
is a compressed archive directory created byngsarchiver
ArchiveCopyDirectory
is a copy archive directory created byngsarchiver
These last two types cannot be further archived; their formats are described in the relevant sections below (Compressed archive directory format and Copy archive directory format respectively).
The ngsarchiver
package is intended to provide
a set of simple zero-configuration tools with minimal
dependencies, that can be used to create archive
directories, and to check and restore data from
them. However (in the case of compressed archives) it
should also be possible to verify and recover data
manually with some additional effort using just the
standard Linux command line tools tar
, gzip
and md5sum
.
Finally, no functionality is provided for the management of the wider datastore beyond the creation of the archive directories; it simply creates archive directories and enables their verification and the recovery of data from them. It is up to the user to manage where the generated archive directories are stored and what should happen to the original run directories after they have been archived.
See the CHANGES file for version history.
The minimal installation is to unpack the tar.gz
archive file containing the Python files; it should be
possible to run the archiver
utility without any
further configuration by specifying the path to
the executable in the unpacked bin
subdirectory.
The only requirement is that python3
should be
available on the PATH
.
Alternatively the code can be installed locally,
system-wide, or into a virtual environment using
pip
with the appropriate options.
Archiver operations are invoked by specifying the relevant subcommand, using command lines of the form:
archiver SUBCOMMAND args
for example:
archiver info /data/runs/230608_SB1266_0008_AHXXBYA
The specific required and optional arguments will
depend on the subcommand and are outlined in the sections
below; alternatively, specifying -h
or --help
with
a subcommand will bring all available options for that
command (or with no subcommand, all the available
subcommands).
Note
The archiver is expected to be run by a non-root user.
Examines one or more directories and report characteristics such as total size, type (as outlined in the section Archive directory format below) and whether the directory contains external and/or broken symbolic links, hard-linked files and so on.
The simplest form of usage is:
archiver info /PATH/TO/DIR
Multiple directories can be supplied:
archiver info /PATH/TO/DIR1 /PATH/TO/DIR2 ...
Including the --list
argument will provide more
detailed information on any "problem" files found
within the directory, which can then be addressed
prior to archiving.
Alternatively the --tsv
argument will print the
basic information in a single tab-delimited line
for each directory. (Note that this option is not
compatible with the --list
option).
Makes a compressed archive directory from the specified directory, for example in its simplest form:
archiver archive /PATH/TO/DIR
The resulting archive directory will be named
DIR.archive
and will be created in the current
working directory by default. Note that an existing
archive directory will not be overwritten.
The source directory is unchanged by the creation of the archive director and must pass a number of checks before the archive is created. These checks are to identify potential issues that could arise later with the generated archive (see the section Problem situations below).
If any check fails then the archive will not be
created unless the --force
argument is also
specified (in which case the archive will be
created regardless of the checks). Specifying the
-c
argument performs the checks without the
archive creation.
When --force
is specified then unreadable files
and directories will be omitted from the archive.
The format of the archive directory is described below in a separate section (see Compressed archive directory format). The archiver will refuse to make an archive of an archive directory.
By default there is no limit on the size of tar.gz
files created within the archive; the --size
argument allows a limit to be set (e.g.
--size 50G
), in which case multiple tar.gz
files will be created which will not exceed
this size (aka "multi-volume archives").
By default the archiving uses gzip
compression
level 6 (the same default as Linux gzip
);
this is found to give a reasonable trade-off
between speed and amount of compression. The
--compress-level
argument allows the
compression level to be explicitly set on the
command line if a higher or lower level of
compression is required.
Makes a copy archive directory from the specified directory, essentially copying the directory and its contents to another location without performing compression.
At its most basic this is a straight copy of the source directory, with some metadata files added.
For example:
archiver copy /PATH/TO/SRC_DIR /PATH/TO/ARCHIVES_DIR
This will create a copy of SRC_DIR
as
/PATH/TO/ARCHIVES_DIR/SRC_DIR
and verify the
contents against the original version.
The format of the archive directory is described below in a separate section (see Copy archive directory format). The archiver will refuse to make an archive of a copy archive directory.
A directory called SRC_DIR
must not already exist in
the target location.
If the destination directory is not explicitly specified on the command line then it defaults to the current directory.
The copy will be aborted unconditionally for the following cases:
- The original directory contains files or directories which cannot be read by the user running the copy operation
- The original directory contains files or directories
where case sensitivity is required to differentiate
them (e.g.
myfile.txt
andmyFile.txt
), but the target filesystem doesn't support case sensitive file names. - The original directory is already some form of archive directory
There is no way to override this behaviour; for
unreadable files, the solution is to fix the permissions
in the source directory. For case-sensitive filenames,
either use a target filesystem which does support case
sensitivity, rename the files in the source directory,
or use compressed archives (via the archive
command)
instead.
Other situations will also prevent the copy from being performed but can be overridden:
- The source directory contains broken or otherwise unresolvable symlinks, or symlinks to files outside the source directory (unresolvable symlinks include things like symlink loops)
- The source directory contains hard linked files
- The source directory contains files or directories where the owner or grop UIDs don't match a user on the current system.
In these cases the archiver can still be forced to
perform the copy by specifying the --force
option:
- Symlinks will be copied as-is (i.e. preserving their targets); this may result in broken symlinks in the copy
- Each instance of a hard linked file will be copied as a separate file (i.e. hard links are not preserved); this may result in multiple identical copies of each hard linked file
The --check
option will check for the above problems
without attempting to perform the copy.
There are also a set of options for handling symbolic links:
--replace-symlinks
will replace symlinks by their targets, provided that the target exists (i.e. is not a broken link, see--transform-broken-symlinks
below) and that it's not a directory (see--follow-dirlinks
)--transform-broken-symlinks
will replace broken and unresolvable symbolic links with a file containing the name of the link target--follow-dirlinks
will replace symlinked directories with actual directories, and recursively copy the contents of each directory
Symlink replacement may be necessary when copying to a file system which doesn't support the creation of symbolic links.
Note that if using --follow-dirlinks
, that the
copied directories are not checked before starting the
copy operation, and so may contain "problem" entities
which can cause the operation to fail.
Checks the integrity of an archive directory created
by the archive
command, for example:
archiver verify /PATH/TO/ARCHIVE_DIR
Restores a complete copy of the original directory from an archive directory, for example in its simplest form:
archiver unpack /PATH/TO/ARCHIVE_DIR
By default the restored copy will be created in the current working directory. Note that an existing directory with the same name will not be overwritten.
The restored archive contents are also verified using their original checksums as part of the unpacking.
The timestamps and permissions of the contents are also restored (with the caveat that all restored content will have read-write permission added for the user unpacking the archive, regardless of the permissions of the original files).
Ownership information is not restored (unless the archiving and unpacking operations are both performed by superuser).
If only a subset of files need to be restored from
the archive then the extract
command is recommended
instead of the full unpack
.
Locates files within one or more compressed achive
directories using shell-style pattern matching based
loosely on that available in the Linux find
command.
For example to search for all gzipped Fastq files:
archiver search -name "*.fastq.gz" /PATH/TO/ARCHIVE_DIR
Using -name
only considers the filename part of
the archived files; alternatively -path
can be
used to include whole paths, for example:
archiver search -path "*/*.fastq.gz" /PATH/TO/ARCHIVE_DIR
Multiple archive directories can also be specified in
a single search
command invocation, in which case
the search will be performed across all the specified
archives.
Restores a subset of files from a compressed archive directory using shell-style pattern matching.
For example to extract all gzipped Fastq files:
archiver extract -name "*.fastq.gz" /PATH/TO/ARCHIVE_DIR
By default the matching files will be extracted to
the current working directory with their leading
paths removed; to keep the full paths for the
extracted files use the -k
option.
Note that existing files with the same name will not be overwritten.
Note also that the -name
option operates slightly
differently to the search
command, as in this
case it will match both filenames and paths.
Extracted files will have the same timestamps and permissions as the originals (with the caveat that all restored content will have read-write permission added for the user extracting the files, regardless of the permissions of the originals).
Compares the contents of two directories against each other, and raises an error if the contents differ.
compare
is provided primarily to enable a restored
archive to be checked against the original directory
(for example before it is removed from the system):
archiver compare /PATH/TO/DIR1 /PATH/TO/DIR2
The comparison checks for missing and extra files, and that files have the same checksums.
(Note however that it doesn't check timestamps, permissions or ownership.)
Compressed archive directories are regular directories
named after the source directory with the suffix
.archive
appended, which are created using the
archive
command.
Individual files and directories within the source
directory are bundled together in one or more tar
files which are compressed using gzip
, and MD5
checksums are generated both for the original files (so
they can be checked when restored) and for the
archive components (providing an integrity check
on the archive itself).
Within a compressed archive directory there will be:
- one or more
.tar.gz
archive files; - none or more regular files;
- a set of MD5 checksum files with the file extension
.md5
, with one checksum file for each.tar.gz
file and each regular file (these checksum files contain the MD5 sums for each of the files inside the.tar.gz
files); - a subdirectory called
ARCHIVE_METADATA
(or a hidden subdirectory.ngsarchiver
, for legacy compressed archives) which contains additional metadata files (for example a JSON file with metadata items, an MD5 file with checksums for each of the "visible" archive components for integrity verification, and a file which lists the original username and group associated with each file). If files were excluded from the archive (e.g. because they were unreadable) then these will be listed in an additional file; - a file called
ARCHIVE_README.txt
which provides a human-readable explanation of the archive directory structure and contents, and how to recover the original data; - a file called
ARCHIVE_FILELIST.txt
which lists the relative paths of all the files and directories in the source directory; - a file called
ARCHIVE_TREE.txt
which provides a visual tree representation of the source directory structure (similar to that produced by the Linuxtree
command line utility).
The .tar.gz
archives and regular files together
are sufficient to recover the contents of the original
directory; the MD5 checksum files can be used to verify
that the recovered files match the originals when they
are unpacked.
.tar.gz
files with the same basename are referred
to as subarchives. A subarchive can consist of a
single .tar.gz
file (e.g. subdir.tar.gz
), or
a collection of .tar.gz
files with an incrementing
number component (e.g. subdir.00.tar.gz
,
subdir.01.tar.gz
etc), referred to as a
multi-volume archive.
The exact number and naming of the .tar.gz
files
and the present or otherwise of additional regular files
depends on both the archiving mode used to create the
archive directory and the "type" of the source directory.
Multi-volume archives are created when the archive
command is run specifying a maximum volume size, and
are intended to mitigate potential issues with creating
extremely large .tar.gz
archives.
Copy archive directories are created using the
copy
command. A copy archive will have the same
name as the source directory.
Individual files and directories from the source directory are copied directly as-is to the archive directory; by default symbolic links are also copied as-is, alternatively they may be transformed in the copy depending on the options specified when the copy is made.
A copy archive directory will contain an additional
subdirctory created by the archiver called
ARCHIVE_METADATA
, which in turn contains the
following files:
manifest
: a manifest file listing the owner and group associated with the original fileschecksums.md5
: MD5 checksum file with checksums generated from the source filesarchiver_metadata.json
: metadata about the archiver, user and creation date of the copy.
Additional files may be present dependent on the contents of the source directory:
symlinks
: a tab-delimted file listing each of the symlinks in the source directory along with their targets, and the path that the target resolved to (only present if the source contained symlinks)broken_symlinks
: a file with the same format as thesymlinks
file above, but only containing information on the broken symlinks in the source directory (only present if the source contained broken symlinks)unresolvable_symlinks
: a tab-delimited file listing each of the unresolvable symlinks in the source directory along with their targets (only present if the source contained unresolvable symlinks)
By default a copy archive directory should be same
as the source directory (with the addition of the
ARCHIVE_METADATA
subdirectory). However the
copy may differ if any of the --replace-symlinks
,
--transform-broken-symlinks
or
--follow-dirlinks
options were specified (see
the copy
command for more details).
There are a number of problems that can be encountered when creating an archive:
- Unreadable files: the presence of files or directories in the source where the user running the archiving doesn't have read access means that those files cannot be included in the archive.
- Hard links: depending on the archiving mode, the presence of hard links can result in bloating of the archive directory, as the hard linked file may be included multiple times either within different subarchives or within different volumes of a single subarchive (or both). The worst case scenario in this case means that both the archive and the unpacked version could be substantially larger than the source.
Additionally the following situations may cause issues when archives are restored:
- External symlinks: these are symbolic links which point to files or directories which are outside of the source directory, which can potentially result in broken links when the symlinks are restored from the archive.
Other situations are highlighted but are unlikely to cause problems in themselves when data are restored:
- Broken symlinks: these are symbolic links which point to targets that no longer exist on the filesystem.
- Unresolvable symlinks: these are symbolic links which cannot be resolved for some reason (for example if by following the link it ends up pointing back to itself).
- Unknown user IDs: where the user name is replaced by a number (user ID aka UID) which doesn't correspond to a known user on the system.
There are currently no workarounds within the archiver for
any of these issues when using the archive
command: it
is recommended that where possible steps are taken to address
them in the source directory prior to creating the archive;
alternatively they can be ignored using the --force
option of the archive
command (with the consequences
outlined above).
Similarly the --force
option is also available for the
archiver's copy
command, however there are also some
mitigations available for some of the issues:
- Working symbolic links can be replaced by their target
files or directories using the
--replace-symlinks
and--follow-dirlinks
options respectively; - Broken and unresolvable files can be replaced with
placeholder files using the
--transform-broken-symlinks
option.
Note that replacing symbolic links and following directory links can result in significant bloating of the size of the copy compared to the original.
The following bash script provides an example recipe for archiving to the compressed format:
#!/usr/bin/bash # Move to scratch area cd /scratch/$USER # Set environment variables export RUN_DIR=/path/to/run_dir export ARCHIVE_DIR=$(pwd)/$(basename $RUN_DIR).archive # Check run directory archiver archive --check $RUN_DIR if [ $? -ne 0 ] ; then echo Checks failed >&2 exit 1 fi # Create archive directory in scratch archiver archive $RUN_DIR if [ $? -ne 0 ] ; then echo Failed to create archive dir >&2 exit 1 fi # Unpack and check against original archiver unpack $ARCHIVE_DIR archiver compare $RUN_DIR $(pwd)/$(basename $RUN_DIR) if [ $? -ne 0 ] ; then echo Unpacked archive differs from original >&2 exit 1 fi # Relocate archive dir to final location mv $ARCHIVE_DIR /path/to/final/dir/ # Verify relocated archive directory archiver verify /path/to/final/dir/$(basename $ARCHIVE_DIR) if [ $? -ne 0 ] ; then echo Failed to verify archive dir >&2 exit 1 fi
The code was tested on a set of real runs and the following initial observations have been made:
- Typically we saw compressed archived run directories were around 70-80% of the size of the original run. A significant number showed greater reductions, evenly distributed in the range 30-70% of the original size.
- There was no difference in the final size between single-volume and multi-volume archives in the benchmarking data, indicating that choice of volume size doesn't significantly affect the amount of compression overall.
- There is relatively little correlation between the amount of compression versus the size of the original run.
- As a rule of thumb it appeared that the percentage of pre-existing compressed content in a run predicted the minimum degree of overall compression. For example, for a run where 80% of the contents are already compressed we would expect to see the final archive no larger than 80% of the original size (although the actual compression could be greater). It is not clear why this is, or whether it is generally true however.
Data from running the archiver (with the run names redacted) can be found in the file benchmarking_redacted.tsv; this gives details of the sequencing platform, total size of the run (and amount of those data that are already compressed), the time taken to create archives for different choices of volume sizes along with the archive size and compression ratio, and the time taken to restore the data from each archive.
Note
These data are from running the code on our systems at Manchester; it is likely that timings etc may differ for other systems.
This software is licensed under the 3-Clause BSD License (BSD-3-Clause).