diff --git a/README.md b/README.md index f868a65..36065db 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,32 @@ # biocommons.seqrepo -SeqRepo is a Python package for storing and reading a local collection of biological sequences. The -repository is non-redundant, compressed, and journalled, making it efficient to store and transfer -multiple snapshots. +SeqRepo is a Python package for storing and reading a local collection of +biological sequences. The repository is non-redundant, compressed, and +journalled, making it efficient to store and transfer multiple snapshots. ## Introduction -Specific, named biological sequences provide the reference and coordinate sysstem for communicating -variation and consequential phenotypic changes. Several databases of sequences exist, with -significant overlap, all using distinct names. Furthermore, these systems are often difficult to -install locally. - -SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences. -Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1, -or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions -of large sequences. - -A "fully-qualified" identifier includes a namespace to disambiguate accessions from different -origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo -uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique -sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error. - -SeqRepo favors namespaces from [identifiers.org](https://identifiers.org) whenever available. -Examples include [refseq]() and +Specific, named biological sequences provide the reference and coordinate +system for communicating variation and consequential phenotypic changes. +Several databases of sequences exist, with significant overlap, all using +distinct names. Furthermore, these systems are often difficult to install +locally. + +SeqRepo provides an efficient, non-redundant and indexed storage system for +biological sequences. Clients refer to sequences and metadata using familiar +identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based +identifiers. The interface supports fast slicing of arbitrary regions of large +sequences. + +A "fully-qualified" identifier includes a namespace to disambiguate accessions +from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the +namespace is provided, seqrepo uses it as-is; if the namespace is not provided +and the unqualified identifier refers to a unique sequence, it is returned; +otherwise, the use of ambiguous identifiers raise an error. + +SeqRepo favors namespaces from [identifiers.org](https://identifiers.org) +whenever available. Examples include +[refseq]() and [ensembl](). [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service) provides a REST interface @@ -39,82 +43,82 @@ Released under the Apache License, 2.0. ## Citation -Hart RK, Prlić A (2020). **SeqRepo: A system for managing local collections of biological -sequences.** PLoS ONE 15(12): e0239883. +Hart RK, Prlić A (2020). **SeqRepo: A system for managing local collections of +biological sequences.** PLoS ONE 15(12): e0239883. + ## Features -- Timestamped, read-only snapshots. -- Space-efficient storage of sequences within a single snapshot and across snapshots. -- Bandwidth-efficient transfer incremental updates. -- Fast fetching of sequence slices on chromosome-scale sequences. -- Precomputed digests that may be used as sequence aliases. -- Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences. +- Timestamped, read-only snapshots. +- Space-efficient storage of sequences within a single snapshot and across snapshots. +- Bandwidth-efficient transfer incremental updates. +- Fast fetching of sequence slices on chromosome-scale sequences. +- Precomputed digests that may be used as sequence aliases. +- Mappings of external aliases (i.e., accessions or identifiers like + `NM_013305.4`) to sequences. ## Deployments Scenarios -- Local read-only archive, mirrored from public site, accessed via Python API (see [Mirroring - documentation](docs/mirror.rst)) -- Local read-write archive, maintained with command line utility - and/or API (see [Command Line Interface - documentation](docs/cli.rst)). -- Docker data-only container that may be linked to application container. -- SeqRepo and refget REST API for local or remote access (see +- Local read-only archive, mirrored from public site, accessed via Python API + (see [Mirroring documentation](docs/mirror.rst)) +- Local read-write archive, maintained with command line utility and/or API (see + [Command Line Interface documentation](docs/cli.rst)). +- Docker data-only container that may be linked to application container. +- SeqRepo and refget REST API for local or remote access (see [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service)) ## Technical Quick Peek -Within a single snapshot, sequences are stored *non-redundantly* and *compressed* in an add-only -journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an -internal id. (The digest is truncated for space efficiency.) +Within a single snapshot, sequences are stored *non-redundantly* and +*compressed* in an add-only journalled filesystem structure. A truncated SHA-512 +hash is used to assess uniquness and as an internal id. (The digest is truncated +for space efficiency.) Sequences are compressed using the Block GZipped Format -([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam to provide fast -random access to compressed sequences. (Variable compression typically makes random access -impossible.) +([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam +to provide fast random access to compressed sequences. (Variable compression +typically makes random access impossible.) -Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating -redundant transfers (e.g., with rsync). +Sequence files are immutable, thereby enabling the use of hardlinks across +snapshots and eliminating redundant transfers (e.g., with `rsync`). -Each sequence id is associated with a namespaced alias in a sqlite database. Such as -``, ``, ``, -``, ``. The sqlite database is mutable -across releases. +Each sequence id is associated with a namespaced alias in a sqlite database. +Such as ``, ``, +``, ``, ``. +The sqlite database is mutable across releases. -For calibration, recent releases that include 3 human genome assemblies (including patches), and -full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size -for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked). +For calibration, recent releases that include 3 human genome assemblies +(including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes +approximately 8GB. The minimum marginal size for additional snapshots is +approximately 2GB (for the sqlite database, which is not hardlinked). For more information, see [docs/design.rst](docs/design.rst). ## Requirements -Reading a sequence repository requires several Python packages, all of which are available from -pypi. Installation should be as simple as [pip install biocommons.seqrepo]{.title-ref}. +Reading a sequence repository requires several Python packages, all of which are +available from pypi. Installation should be as simple as `pip install +biocommons.seqrepo`. *Writing* sequence files also requires `bgzip`, which provided in the -[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install the `tabix` package -with `sudo apt install tabix`. - -Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get -other systems working would be welcomed. +[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install +the `tabix` package with `sudo apt install tabix`. -**Mac Developers** If you get "xcrun: error: invalid active developer path", you need to install -XCode. See this [StackOverflow -answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a). +Development and deployments are on Ubuntu. Other systems may work but are not +tested. Patches to get other systems working would be welcomed. ## Quick Start -### OSX +### OS X $ brew install python libpq -### On Ubuntu 16.04 +### Ubuntu $ sudo apt install -y python3-dev gcc zlib1g-dev tabix ### All platforms - + $ python -m venv venv $ source venv/bin/activate $ pip install seqrepo @@ -155,21 +159,25 @@ See [Installation](docs/installation.rst) and ## Environment Variables -SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite -query response caching. It defaults to 1 million but can also be set to -"none" to be unlimited. +SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query +response caching. It defaults to 1 million but can also be set to "none" to be +unlimited. -SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during FASTA sequence retrievals. -It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using -a moderate value (>10) will greatly increase performance of sequence retrieval. +SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during +FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be +set to a specific value or "none" to be unlimited. Using a moderate value (>10) +will greatly increase performance of sequence retrieval. ## Developing -### OSX +### Developing on OS X brew install python libpq bash -### Ubuntu +If you get "xcrun: error: invalid active developer path", you need to install +XCode. See this [StackOverflow answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a). + +### Developing on Ubuntu sudo apt install -y python3-dev gcc zlib1g-dev tabix @@ -181,11 +189,13 @@ Here's how to get started developing: ## Building a docker image -Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. Tags correspond to the -version of data, not the version of seqrepo, because the intent is to make it easy to depend on a -local version of seqrepo *files*. Each docker image is an installation of seqrepo that downloads -the corresponding version of seqrepo data. When used in conjunction with docker volumes for -persistence, this provides an easy way to incorporate seqrepo data into a docker stack. +Docker images are available at https://hub.docker.com/r/biocommons/seqrepo. +Tags correspond to the version of data, not the version of seqrepo, because the +intent is to make it easy to depend on a local version of seqrepo *files*. Each +docker image is an installation of seqrepo that downloads the corresponding +version of seqrepo data. When used in conjunction with docker volumes for +persistence, this provides an easy way to incorporate seqrepo data into a docker +stack. ### Building