Skip to content
/ refman Public

A command-line manager for bespoke reference datasets used in bioinformatic analyses

License

Notifications You must be signed in to change notification settings

nrminor/refman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RefMan: a simple biological reference manager

Bioinformatics involves juggling lots of files, particularly reference datasets (FASTA, GenBank, EMBL, Oh My!) with associated annotation and genomic range data. refman evolved out of our internal desire to simplify accessing references from many sources--both official and bespoke--as well as my own desire to write more Rust🦀.

refman can be thought of as a simpler and less general-purpose implementation of what SciDataFlow does. For uses cases beyond getting a few reference datasets from disparate places, I highly recommend giving SciDataFlow a try. But if you're like me and your head is spinning with all the different combinations of all the reference datasets each of your projects needs, and you want a fast way of pulling these combinations together, refman is for you!

Installation

Precompiled Binary Releases - Coming soon!

Precompiled static binaries for a variety of platforms are will soon be available in refman's Github releases.

Crates.io

Since v1.0.0, refman has been available on crates.io. Assuming you have the Rust toolchain installed, simply install it with cargo install refman.

Build from source

If you have git available, are on a unix system, and have the Rust toolchain installed, you can download and compile refman from source with the following:

# download the source code with git clone
git clone https://github.com/nrminor/refman

# change into the project root directory
cd refman

# compile and install it onto your $PATH with cargo
cargo install --path="."

Use it through Python

refman also has a Python API that can soon be installed with pip install py-refman. See the py-refman Github repository for more information about its usage.

Quick Start

refman centers around a workflow of three subcommands: refman init, refman register, and refman download. These commands initialize a project with metadata, register dataset URLs, and download datasets respectively. Use --help or -h on each subcommand to explore the command line interface. The top-level interface will look like this when you run refman -h:


░       ░░░        ░░        ░░  ░░░░  ░░░      ░░░   ░░░  ░
▒  ▒▒▒▒  ▒▒  ▒▒▒▒▒▒▒▒  ▒▒▒▒▒▒▒▒   ▒▒   ▒▒  ▒▒▒▒  ▒▒    ▒▒  ▒
▓       ▓▓▓      ▓▓▓▓      ▓▓▓▓        ▓▓  ▓▓▓▓  ▓▓  ▓  ▓  ▓
█  ███  ███  ████████  ████████  █  █  ██        ██  ██    █
█  ████  ██        ██  ████████  ████  ██  ████  ██  ███   █

refman (v1.0.0)
------------------------------------------------------------
`refman` is a simple command-line tool for managing biological reference datasets often
used in bioinformatics. These datasets may include raw sequence files, files encoding
annotations on those sequences, etc. `refman` makes it easier to manage and download
these kinds of files globally on the user's machine, or on a per-project basis. It
uses a human-readable TOML file to track which files it's managing, which can be shared
between users to aid scientific reproducibility.


Usage: refman [OPTIONS] [COMMAND]

Commands:
  init      Initialize a registry for the current project without registering any datasets. [aliases: i, new]
  register  Register a new file or set of files with a given dataset label. [aliases: r, reg]
  remove    Remove the files associated with a given dataset label [aliases: rm]
  list      List all previously registered reference datasets [aliases: l]
  download  Download one or many reference datasets registered in the refman registry. [aliases: d, dl, down, get, fetch]
  help      Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...  Increase logging verbosity
  -q, --quiet...    Decrease logging verbosity
  -h, --help        Print help (see more with '--help')
  -V, --version     Print version

Detailed Usage

refman's first deployment was in the bioinformatic pipeline oneroof, which is run routinely with different reference datasets depending on the input data. These datasets were registered in the pipeline's refman.toml file with the same series of commands that would be used with any project. To demonstrate this workflow, those commands are reproduced here.

Project Initialization

First, to use refman as part of oneroof's setup, we initialize a project, like so:

refman init -t oneroof -d "Reference files needed for routine runs on viral amplicon data from SARS-CoV-2 and H5N1"

This will create a refman.toml that looks like this:

[project]
title = "oneroof"
description = "Reference files needed for routine runs on viral amplicon data from SARS-CoV-2 and H5N1"
last_modified = "2025-03-19T17:24:04.673076Z"
global = false
datasets = []

No datasets have been registered yet. This refman.toml also uses the tool's default behavior, which includes making each refman.toml local to a project instead of global to a machine. Support for global usage will increase with time, but in general we recommend that refman is used on a per-project basis to avoid confusion.

Registering URLs with Datasets

Next, datasets for a few oneroof configurations were registered, like so:

# first, datasets for running oneroof on SARS-CoV-2 amplicons with the QIASeq Direct Enhanced Boosted primer set
refman register -l "sc2-qiaseq" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3.fasta" \
--genbank "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3.gbk" \
--gff "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/MN908947.3_corrected_orf1.gff" \
--bed "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/sars-cov-2/qiaseq_direct_boosted.bed"

# second, datasets for H5N1 amplicons from our own bespoke H5N1 tiled primer set
refman register -l "h5n1-B-custom" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/custom_reference.fasta?contentDisposition=attachment" \
--genbank "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/annotation-custom.gbk?contentDisposition=attachment" \
--bed "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/final_truth_no_dashes.bed?contentDisposition=attachment"

# and third, a simpler dataset for H5N1 whole-segment amplicons for sequencing on Oxford Nanopore instruments
refman register -l "h5n1-B-segmental" \
--fasta "https://dholk.primate.wisc.edu/_webdav/dho/public/DHO%20Lab%20Bespoke%20Reference%20Dataset%20Registry/Pathogen%20Genomics/%40files/H5N1-B.3.13/h5_cattle_genome_root_segments.fasta?contentDisposition=attachment"

Note that, as documented in the help menu for refman, reg and r are aliases for the register subcommand. All refman subcommands have shorthand aliases. Also, URLs must be provided between quotes.

Before completing the registration process, refman uses the Rust lychee library to check that each provided URL is valid and points to a resource that exists. This prevents invalid entries to refman.toml when managed through the command-line interface.

Downloading Datasets

Once these datasets are registered, they can later be deserialized from refman.toml and used to download those resources. For oneroof, this most often involves downloading datasets for a SARS-CoV-2 run, like so:

refman download sc2-qiaseq -d assets

This will download all the files in the dataset labeled "sc2-qiaseq" (registered above) and place them in a destination directory called "assets". Like in the register subcommand, URLs will be checked for validity before being used to download files. Note that the dataset label used is case-sensitive and must exactly match a dataset registered with refman.

Keep in mind that if you're coming to a new project with datasets managed with refman, you can always list what's available with refman list, and list full URLs for particular projects with refman list <LABEL>.

Roadmap

refman reached v1.0.0 as a minimum viable product, but it's still a work in progress. Features on the roadmap include:

  • more black-box-, white-box-, and especially doc-tests
  • link-checking with lychee_lib
  • download progress bars
  • a GitHub workflow for generating releases so that static binaries are available
  • publication on crates.io
  • the ability to symlink locations in the local filesystem as opposed to only pulling from the internet
  • expanded metadata fields or file formats, e.g., VCFs
  • validation that an entry in a given file format is actually that format
  • tasks/rules that tell refman to do some operation on a file once it's downloaded, potentially in an embedded scripting language like Lua or Gluon
  • stable config file format
  • a global dotfile format with higher precedence than the tool's current defaults
  • a python API with a slightly restricted feature set that is pip-installable

If you're interested in speeding any of these or other features along, or find any bugs, please reach out in the repo's issues!

Non-goals

  1. Maximal performance. The refman code contains a few clones here and there where it's convenient, though never for large amounts of data.
  2. Minimal dependencies. I've used this project in part to explore some interesting crates from the Rust ecosystem that I haven't used previously.

Citation(s)

Coming soon.

About

A command-line manager for bespoke reference datasets used in bioinformatic analyses

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages