Skip to content

Commit

Permalink
Merge pull request #41 from compbiocore/develop
Browse files Browse the repository at this point in the history
Develop
fernandogelin authored Sep 4, 2019
2 parents 205c31e + f66584a commit 04e216e
Showing 29 changed files with 3,757 additions and 911 deletions.
160 changes: 160 additions & 0 deletions docs/assets/git_commit_usecase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
197 changes: 197 additions & 0 deletions docs/assets/git_push_usecase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/assets/menu-filtered.png
Binary file not shown.
Binary file removed docs/assets/menu-full.png
Binary file not shown.
215 changes: 215 additions & 0 deletions docs/assets/newyaml_usecase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
250 changes: 250 additions & 0 deletions docs/assets/noexecute_usecase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,354 changes: 1,354 additions & 0 deletions docs/assets/refchef-cook_and_refchef-menu.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion docs/assets/refchef-diagram.svg

This file was deleted.

Binary file modified docs/assets/refchef-serve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
732 changes: 732 additions & 0 deletions docs/assets/refchef_overview.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
174 changes: 174 additions & 0 deletions docs/assets/refchefmenu_usecase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 68 additions & 0 deletions docs/folders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@

RefChef creates folders to store your references. The names of these folders is based on:

1. The [`master.yaml`](./specs.md#master.yaml) key (which should match the 'name' entry under 'metadata' in `master.yaml`).

2. The 'component' entry under 'levels' in [`master.yaml`](./specs.md#master.yaml).

Here is the collapsed file tree that refchef created from the Tutorial part of the documentation and what the directory names are based on:

```bash
./Users/jwalla12/references #this directory is specified in refchef-cook or the config files
└── S_cerevisiae #this is named after the 'key' and the 'name' entry under 'metadata' in master.yaml
├── bowtie2_index #this folder is created in the master.yaml `commands` section.
├── bwa_index #this folder is created in the master.yaml `commands` section.
├── gtf #this folder is created in the master.yaml `commands` section.
└── primary #this is named after the 'component' entry under 'levels' in master.yaml
```

Here is the expanded file tree:

```bash
./Users/jwalla12/references
└── S_cerevisiae
├── bowtie2_index
│   └── metadata.txt
├── bwa_index
│   └── metadata.txt
├── gtf
│   ├── CHECKSUMS
│   ├── Saccharomyces_cerevisiae.R64-1-1.87.gtf
│   ├── final_checksums.md5
│   ├── metadata.txt
│   └── postdownload-checksums.md5
└── primary
├── CHECKSUMS
├── Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
├── bowtie2_index -> /Users/jwalla12/references/S_cerevisiae/bowtie2_index
├── bwa_index -> /Users/jwalla12/references/S_cerevisiae/bwa_index
├── final_checksums.md5
├── metadata.txt
└── postdownload-checksums.md5
```
This indicates that refchef has created symlinked directories for bowtie2 and bwa indices in `/Users/jwalla12/references/S_cerevisiae/primary`. This process (linking reference and index) is triggered by:
1. The addition of the `src:` line in bowtie2.yaml and bwa.yaml
2. Specifying the master.yaml `levels` are `indices:` in the master.yaml

If we look at the output from [`refchef-menu`](./usage.md#refchef-menu), we see the UUID for the primary reference file, which is `dff337a6-9a1d-3313-8ced-dc6f3bfc9689`.

```bash
┌ 🐶 RefChef Menu ────────────────────────┬───────────┬───────────────────────────────────────────┬──────────────────────────────────────┐
│ name │ organism │ component │ description │ uuid │
├──────────────┼──────────────────────────┼───────────┼───────────────────────────────────────────┼──────────────────────────────────────┤
│ S_cerevisiae │ Saccharomyces cerevisiae │ primary │ corresponds to ganbank id GCA_000146045.2 │ dff337a6-9a1d-3313-8ced-dc6f3bfc9689 │
└──────────────┴──────────────────────────┴───────────┴───────────────────────────────────────────┴──────────────────────────────────────┘
```
In this clipping from bowtie2.yaml, note that the UUID was indicated in the `src:` entry under `component`, `indices`, and `levels`.

```yaml
S_cerevisiae:
levels:
indices:
- component: bowtie2_index
complete:
status: false
src: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
```
This indicates which primary reference was used to create the index file.
9 changes: 8 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -8,5 +8,12 @@

---

`RefChef` is a reference management system that includes additional tools to record the provenance of reference sequences, indices, and annotations. It was created to enable reproducible research.

RefChef is a reference management tool used to: (1) document the exact steps undertaken in the retrieval of genomic references; (2) maintain the associated metadata; (3) provide a mechanism for automatically reproducing retrieval and creation of an exact copy of genomic references.
`RefChef` will:

1. Document the exact steps undertaken in the retrieval and processing of genomic references
2. Maintain the associated metadata
3. Provide a mechanism for automatically reproducing retrieval and creation of an exact copy of genomic references

![Diagram](assets/refchef_overview.svg)
112 changes: 112 additions & 0 deletions docs/inputs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---

###**master.yaml** <a name="master.yaml"></a>

**overview**
Refchef uses YAML files that are composed of nested entry and value pairs -- for example, the entry and value pair `common_name`: `yeast`. The spacing and indentation of the entries and values are meaningful - Refchef uses the convention of using 2 spaces to indent each subsequent level of the entries and values in the YAML and a `:` and space are between each entry and value. Some entries in the yaml will have a preceeding `-` and a space before them (such as `- component:` and the commands under the `commands` header), which are required for Refchef to properly process the YAML.

See the [`master.yaml` file specifications](./specs.md#master.yaml) for more information.

Example `master.yaml` before processing:
```yaml
S_cerevisiae:
metadata:
name: S_cerevisiae
common_name: yeast
ncbi_taxon_id: 4932
organism: Saccharomyces cerevisiae
organization: ensembl
custom: no
description: corresponds to genbank id GCA_000146045.2
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank:
refseq:
levels:
references:
- component: primary
complete:
status: false
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
```

The string of text entered in the `key` field (`S_cerevisiae` in the above example) will be used to create a folder inside the directory you specify as your output in your config file (`cfg.ini` or `cfg.yaml`) or `refchef-cook` arguments. In the previous quickstart example, we used `/Users/jwalla12/references` as the output directory for `refchef-cook`. Here is the collapsed file tree that refchef created, note that the folder containing the primary reference is nested inside a folder named `S_cerevisiae` based on the `key`.

```bash
./Users/jwalla12/references #this directory is specified in refchef-cook or the config files
└── S_cerevisiae
├── bowtie2_index
├── bwa_index
├── gtf
└── primary
```

**master.yaml metadata**
The `metadata` section of `master.yaml` contains information about the references, including the organism name, taxon_id, etc.

!!! Caution
When running a new YAML file to add additional information to a primary reference, metadata entries present in the initial [`master.yaml`](#master.yaml) file can be omitted (for example, `ncbi_taxon_id:`, `common_name:`). When adding indices or annotations to a primary reference already in [`master.yaml`](#master.yaml), the metadata in [`master.yaml`](#master.yaml) will be overwritten by the metadata in the new.yaml file. This could be helpful in situations where you want to update the metadata fields.

**master.yaml levels**
The `levels` section contains higher level information about the references, including when they were downloaded and the exact commands used to download and process the references.

!!! Caution
The entry `status` must be set to `false` for Refchef to exeecute the commands in the code block. If it is set to `true`, the code will not execute (even if the -e flag is set). After a code block is executed, the `false` flag will flip to `true` automatically and the `time:` entry will appear under the `status` header. The `time:` header will be populated with the datetime stamp the reference was downloaded.

**master.yaml commands**
This portion of the `master.yaml` should be populated with the specific commands you want to execute to download and process your reference. Each command should be prepended with a `-` and a space.

!!! Caution
Each time files are processed using a set of commands in the YAML, the last command must run `md5` on all of the files and direct the output to a file called `final_checksums.md5`.

---

### **cfg.yaml** <a name="cfg.yaml"></a>
**overview**
Refchef requires configuration information, which can be passed as arguments or specified in a configuration file. A `cfg.yaml` is one option for configuration and should contain the following fields. Also indicated below: If filling out the field is required, their expected format, and a brief description of their contents.


See the [`cfg.yaml` file specifications](./specs.md#cfg.yaml) for more information.

**example:**
```yaml
config-yaml:
path-settings:
reference-directory: /Users/jwalla12/references
git-directory: /Users/jwalla12/remote_references
remote-repository: jrwallace/remote_references
log-settings:
log: 'yes'
```
---
### **cfg.ini** <a name="cfg.ini"></a>
**overview**
Refchef requires configuration information, which can be passed as arguments or specified in a configuration file. A `cfg.ini` is one option for configuration and should contain the following fields. Also indicated below: If filling out the field is required, their expected format, and a brief description of their contents.

See the [`cfg.ini` file specifications](./specs.md#cfg.ini) for more information.

**example:**

```toml
[path-settings]
reference-directory=/Users/jwalla12/references
git-directory=/Users/jwalla12/remote_references
remote-repository=jrwallace/remote_references
[log-settings]
log=yes
[runtime-settings]
break-on-error=yes
verbose=yes
```


53 changes: 41 additions & 12 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,48 @@
### Install RefChef

To install from PyPI using **pip**:
`pip install refchef`

To install using **Anaconda Python**:
`conda install -c compbiocore refchef`

### Set up Git and GitHub
RefChef uses Git repositories for version control of the `master.yaml` file, which contains a list of all the references on the system and their provenance. You can also use GitHub to remotely host your repositories, but this is optional.

Before using RefChef, set up [git](https://help.github.com/en/articles/set-up-git).

If you want to use GitHub to host your repositories, create a GitHub account and set up an [access token](https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line).
![](assets/github_token.png)

Additionally, create a [`.gitignore` file](https://help.github.com/en/articles/ignoring-files)...

```bash
touch .gitignore
```

...and add `.env` to the `.gitignore` by pasting the following into the `.gitignore` file.

```bash
# ignore env files
*.env
```

Now create a `.env` file...
```bash
touch .env
```

... and paste the contents of the `.env.template` file in the `RefChef` home directory into the `.env` file, which will now look like this:

```bash
GITHUB_TOKEN=
```

Then, paste the GitHub access token into the `GITHUB_TOKEN=` line copied over from the `env.template` file. For example, your `.env` file might now look like this:

```bash
GITHUB_TOKEN=5c25370fcf7db4a676d98d72700e2922654485ed
```
### Development
To install a **development version** from the current directory:
```bash
@@ -16,14 +54,6 @@ pip install -e .
Run unit tests as:
`python setup.py test`

### Set up `.env` file with GitHub Access Token
Sensitive environment variables are stored in the .env file. This file is included in .gitignore intentionally, so that it is never committed.
- Create a `.env` file and copy into it the contents of `.env.template`
- Get your [GitHub Access Token](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/) and add to the `.env` file.
- Make sure to add the GH_TOKEN variable to the environment of the CI provider you use.

![](assets/github_token.png)

## Contributing

Contributions consistent with the style and quality of existing code are
@@ -33,7 +63,6 @@ Check the issues page of this repository for available work.

### Committing


This project uses [commitizen](https://pypi.org/project/commitizen/)
to ensure that commit messages remain well-formatted and consistent
across different contributors.
@@ -47,17 +76,17 @@ pip install commitizen
```

To start work on a new change, pull the latest `develop` and create a
new *topic branch* (e.g. feature-resume-model`,
new *topic branch* (e.g. `feature-resume-model`,
`chore-test-update`, `bugfix-bad-bug`).

Add your changes to the current branch.
```bash
git add .
```

To commit, run the following command (instead of ``git commit``) and
To commit your changes, run the following command (instead of `git commit`) and
follow the directions:


```bash
cz commit
```
26 changes: 26 additions & 0 deletions docs/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
**RefChef comes with two commands:**

[**`refchef-cook`**](./usage.md#refchef-cook):
Will read recipes and execute the commands that will retrieve the references, indices, or annotations based on the contents of [`master.yaml`](./inputs.md#master.yaml).

[**`refchef-menu`**](./usage.md#refchef-menu):
Provides a way for the user to list all references present in the system, based on [`master.yaml`](./inputs.md#master.yaml), as well as filter the list of references based on metadata options.

![Diagram](assets/refchef-cook_and_refchef-menu.svg)

**RefChef requires a [`master.yaml`](./inputs.md#master.yaml) file:**

In addition to the [`refchef-cook`](./usage.md#refchef-cook) and [`refchef-menu`](./usage.md#refchef-menu) commands, RefChef requires a [`master.yaml`](./inputs.md#master.yaml) containing a list of references, indices, annotations, and metadata, as well as the commands necessary to download and process the files.
When [`refchef-cook`](./usage.md#refchef-cook) is executed, RefChef will append the [`master.yaml`](./inputs.md#master.yaml) to change the `complete` option from `false` to `true`and will also add a `uuid` for each reference, the date the files were downloaded and their location, as well as a complete list of files downloaded.
Based on the arguments you pass to [`refchef-cook`](./usage.md#refchef-cook), it will either commit those changes to [`master.yaml`](./inputs.md#master.yaml) to a local repository or commit and push the changes to a remote repository.

**RefChef requires configuration information:**

[`refchef-cook`](./usage.md#refchef-cook) and [`refchef-menu`](./usage.md#refchef-menu) both require some configuration information, including:

1. Where you'd like the references to be saved
2. The local git repository for version control of references
3. The remote github repository for version control of reference
sequences (optional).

This information can be specified in a [`cfg.yaml`](./inputs.md#cfg.yaml) file, a [`cfg.ini`](./inputs.md#cfg.ini) file, or it can be passed as arguments to [`refchef-cook`](./usage.md#refchef-cook).
255 changes: 255 additions & 0 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
This quickstart assumes that [bwa](http://bio-bwa.sourceforge.net/) and [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) are installed and in your current path.

Create a [remote repository](https://help.github.com/en/articles/creating-a-new-repository) and [clone it](https://help.github.com/en/articles/cloning-a-repository).

Create a directory for refchef to save your references.

Create a [`master.yaml`](./inputs.md#master.yaml) file and save it in your local git repository directory. Here is a [`master.yaml`](./inputs.md#master.yaml) file that will download a yeast genome from Ensembl:

```yaml
S_cerevisiae:
metadata:
name: S_cerevisiae
common_name: yeast
ncbi_taxon_id: 4932
organism: Saccharomyces cerevisiae
organization: ensembl
custom: no
description: corresponds to genbank id GCA_000146045.2
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank:
refseq:
levels:
references:
- component: primary
complete:
status: false
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
```
Pass the configuration arguments in a config file or directly to [`refchef-cook`](./usage.md#refchef-cook) (as seen in the following example):

```
refchef-cook -e -o /Users/jwalla12/references -gl /Users/jwalla12/remote_references -gr jrwallace/remote_references --git commit -l
```
After [`refchef-cook`](./usage.md#refchef-cook) is run, [`master.yaml`](./inputs.md#master.yaml) will reflect that you have downloaded the reference and it will now look like this:
```yaml
S_cerevisiae:
metadata:
name: S_cerevisiae
common_name: yeast
ncbi_taxon_id: 4932
organism: Saccharomyces cerevisiae
organization: ensembl
custom: false
description: corresponds to genbank id GCA_000146045.2
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank: null
refseq: null
levels:
references:
- component: primary
complete:
status: true
time: '2019-07-25 09:08:37.478553'
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
location: /Users/jwalla12/references/S_cerevisiae/primary
files:
- metadata.txt
- postdownload-checksums.md5
- CHECKSUMS
- final_checksums.md5
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
uuid: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
```

Make another .yaml file to create a bowtie2 index of this genome, call the file `bowtie2.yaml`.

```yaml
S_cerevisiae:
levels:
indices:
- component: bowtie2_index
complete:
status: false
src: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
commands:
- mkdir /Users/jwalla12/references/S_cerevisiae/bowtie2_index
- cd /Users/jwalla12/references/S_cerevisiae/bowtie2_index
- ln -s /Users/jwalla12/references/S_cerevisiae/primary/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa ./Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
- bowtie2-build Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa S_cerevisiae
- md5 ./*.* > ./final_checksums.md5
```
Then use [`refchef-cook`](./usage.md#refchef-cook) and specify the new yaml to add to [`master.yaml`](./inputs.md#master.yaml).

```
refchef-cook -e -o /Users/jwalla12/references -gl /Users/jwalla12/remote_references -gr jrwallace/remote_references -n /Users/jwalla12/remote_references/bowtie2.yaml -g commit -l
```
Make another .yaml file to create a bwa index of this genome, call the file `bwa.yaml`.
```yaml
S_cerevisiae:
levels:
indices:
- component: bwa_index
complete:
status: false
src: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
commands:
- mkdir /Users/jwalla12/references/S_cerevisiae/bwa_index
- cd /Users/jwalla12/references/S_cerevisiae/bwa_index
- ln -s /Users/jwalla12/references/S_cerevisiae/primary/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa ./Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
- bwa index Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa -p S_cerevisiae
- md5 ./*.* > ./final_checksums.md5
```

Then use [`refchef-cook`](./usage.md#refchef-cook) and specify the new yaml to add to [`master.yaml`](./inputs.md#master.yaml).

```
refchef-cook -e -o /Users/jwalla12/references -gl /Users/jwalla12/remote_references -gr jrwallace/remote_references -n /Users/jwalla12/remote_references/bwa.yaml -g commit -l
```

We can also track annotation files for the reference genome. Make the following .yaml file and call it `gtf.yaml`:

```yaml
S_cerevisiae:
levels:
annotations:
- component: gtf
complete:
status: false
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.87.gtf.gz
- wget ftp://ftp.ensembl.org/pub/release-87/gtf/saccharomyces_cerevisiae/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
```
Then use [`refchef-cook`](./usage.md#refchef-cook) and specify the new yaml to add to [`master.yaml`](./inputs.md#master.yaml).

```
refchef-cook -e -o /Users/jwalla12/references -gl /Users/jwalla12/remote_references -gr jrwallace/remote_references -n /Users/jwalla12/remote_references/gtf.yaml -g commit -l
```
We can see what references are available using [`refchef-menu`](./usage.md#refchef-menu):
```
refchef-menu -f /Users/jwalla12/remote_references/master.yaml
```
```
┌ 🐶 RefChef Menu ────────────────────────┬───────────────┬───────────────────────────────────────────┬──────────────────────────────────────┐
│ name │ organism │ component │ description │ uuid │
├──────────────┼──────────────────────────┼───────────────┼───────────────────────────────────────────┼──────────────────────────────────────┤
│ S_cerevisiae │ Saccharomyces cerevisiae │ gtf │ corresponds to genbank id GCA_000146045.2 │ 5f7ae94c-2e51-3cc6-bcbf-6e251c75ef2f │
│ S_cerevisiae │ Saccharomyces cerevisiae │ bowtie2_index │ corresponds to genbank id GCA_000146045.2 │ 93393699-cb40-3ad7-ac07-ae4bdb1efd3e │
│ S_cerevisiae │ Saccharomyces cerevisiae │ bwa_index │ corresponds to genbank id GCA_000146045.2 │ dff337a6-9a1d-3313-8ced-dc6f3bfc9689 │
│ S_cerevisiae │ Saccharomyces cerevisiae │ primary │ corresponds to genbank id GCA_000146045.2 │ dff337a6-9a1d-3313-8ced-dc6f3bfc9689 │
└──────────────┴──────────────────────────┴───────────────┴───────────────────────────────────────────┴──────────────────────────────────────┘
```
We can also get this information if we look at [`master.yaml`](./inputs.md#master.yaml):
```yaml
S_cerevisiae:
metadata:
name: S_cerevisiae
common_name: yeast
ncbi_taxon_id: 4932
organism: Saccharomyces cerevisiae
organization: ensembl
custom: false
description: corresponds to genbank id GCA_000146045.2
downloader: joselynn wallace
ensembl_release_number: 87
accession:
genbank: null
refseq: null
levels:
references:
- component: primary
complete:
status: true
time: '2019-07-25 16:26:42.700668'
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/saccharomyces_cerevisiae/dna/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
location: /Users/jwalla12/references/S_cerevisiae/primary
files:
- metadata.txt
- postdownload-checksums.md5
- CHECKSUMS
- final_checksums.md5
- Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
uuid: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
indices:
- component: bowtie2_index
complete:
status: true
time: '2019-07-25 16:26:43.971349'
src: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
commands:
- mkdir /Users/jwalla12/references/yeast_refs/bowtie2_index
- cd /Users/jwalla12/references/yeast_refs/bowtie2_index
- ln -s /Users/jwalla12/references/yeast_refs/primary/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
/Users/jwalla12/references/yeast_refs/bowtie2_index/
- bowtie2-build /Users/jwalla12/references/yeast_refs/bowtie2_index/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
S_cerevisiae
- md5 /Users/jwalla12/references/yeast_refs/bowtie2_index/*.* > /Users/jwalla12/references/yeast_refs/bowtie2_index/final_checksums.md5
location: /Users/jwalla12/references/S_cerevisiae/bowtie2_index
files:
- metadata.txt
uuid: 84928c3e-af1a-11e9-a45e-8c8590bd206d
- component: bwa_index
complete:
status: true
time: '2019-07-25 16:26:45.183284'
src: dff337a6-9a1d-3313-8ced-dc6f3bfc9689
commands:
- mkdir /Users/jwalla12/references/yeast_refs/bwa_index
- cd /Users/jwalla12/references/yeast_refs/bwa_index
- ln -s /Users/jwalla12/references/yeast_refs/primary/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
/Users/jwalla12/references/yeast_refs/bwa_index/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
- bwa index /Users/jwalla12/references/yeast_refs/bwa_index/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
> /Users/jwalla12/references/yeast_refs/bwa_index/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
- md5 /Users/jwalla12/references/yeast_refs/bwa_index/*.* > /Users/jwalla12/references/yeast_refs/bwa_index/final_checksums.md5
location: /Users/jwalla12/references/S_cerevisiae/bwa_index
files:
- metadata.txt
uuid: 854b7780-af1a-11e9-a9f8-8c8590bd206d
annotations:
- component: gtf
complete:
status: true
time: '2019-07-25 16:26:54.326082'
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.87.gtf.gz
- wget ftp://ftp.ensembl.org/pub/release-87/gtf/saccharomyces_cerevisiae/CHECKSUMS
- md5 *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5 *.* > final_checksums.md5
location: /Users/jwalla12/references/S_cerevisiae/gtf
files:
- metadata.txt
- postdownload-checksums.md5
- Saccharomyces_cerevisiae.R64-1-1.87.gtf
- CHECKSUMS
- final_checksums.md5
uuid: 5f7ae94c-2e51-3cc6-bcbf-6e251c75ef2f
```
169 changes: 113 additions & 56 deletions docs/specs.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,128 @@
# Specifications for `master.yaml`
### `master.yaml` <a name="master.yaml"></a>

The [`master.yaml`](./inputs.md#master.yaml) file is the main source of information that RefChef uses to retrieve references, indices, and annotations. It is composed of sequences of code blocks that correspond to each reference. Each code block in [`master.yaml`](./inputs.md#master.yaml) starts with a `key`, followed by `metadata` and `levels`.

See the [`master.yaml` overview and usage](./inputs.md#master.yaml) for more information.

---
```yaml
reference_test1:
metadata:
name: reference_test1
species: mouse
organization: ucsc
downloader: fgelin
levels:
references:
- component: primary
complete:
status: false
commands:
- wget -nv https://s3.us-east-2.amazonaws.com/refchef-tests/chr1.fa.gz
- md5 *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5 *.fa > final_checksums.md5
```
The `master.yaml` file is the main source of information that RefChef uses to retrieve references, indices, and annotations.

### Specifications

The `key` section consists of:

`<reference_name>:`
Expected format: String where <reference_name\> is the name of the reference.

---

Each block has a key with the name of the reference, index, or annotation.
The `metadata` section consists of:

>`metadata.name`
>Expected format: <reference_name\> string, should be the same as the block's `key`
>`metadata.common_name`
>Expected format: string
>`metadata.ncbi_taxon_id`
>Expected format: integer, based on [NCBI taxon ID](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)
>`metadata.organism`
>Expected format: string
>`metadata.organization`
>Expected format: string
>`metadata.custom`
>Expected format: string
>`metadata.description`
>Expected format: string
>`metadata.downloader`
>Expected format: string
>`metadata.ensembl_release_number`
>Expected format: integer
>>`metadata.accession.genbank`
>>Expected format: string
>>`metadata.accession.refseq`
>>Expected format: string
---

`reference_name.metadata`
Expected format: key - value mapping
The `levels` section consists of:

`reference_name.metadata.name`
Expected format: <reference_name> string, should be the same as the block's key
>`levels.<type>`
>Where <type\>: `references`, `annotations`, or `indices`
>>`levels.<type>.- component`
>>Expected format: string
>>>`levels.<type>.complete.status`
>>>Expected format: boolean (note that if `complete.status` is set to `true` RefChef will skip the current block and not retrieve any file. RefChef automatically changes the status to `true` after retrieving files for the first time.)
>>`levels.<type>.src`
Expected format: UUID string from existing reference, when adding an index file for a reference RefChef will create a symlink to the index files in the reference folder.

>>`levels.<type>.commands`
Expected format: Each command should start with `- `, this section is a list of commands to download and process each reference.

After [`refchef-cook`](./usage.md#refchef-cook) is run and references are downloaded, `levels.<type>.complete.status: false` will change to `levels.<type>.complete.status: true` and the following fields will be added to `master.yaml`

>>>`levels.<type>.complete.time`
>>>Expected format: RefChef will autopopulate this field with the date and time stamp the reference was downloaded if `levels.<type>.complete.status: true`
>>`levels.<type>.location`
Expected format: Refchef will autopopulate this field with the directory where downloaded files are stored if `levels.<type>.complete.status: true`
>>`levels.<type>.files`
Expected format: Refchef will autopopulate this field with a list of files that were downloaded if `levels.<type>.complete.status: true`
>>`levels.<type>.uuid`
Expected format: Refchef will autopopulate this field with a UUID for your reference file if `levels.<type>.complete.status: true`
---

### `cfg.yaml` <a name="cfg.yaml"></a>

If using a `cfg.yaml` file, the `cfg.yaml` file should follow the following specs:

>>`config-yaml.path-settings.reference-directory`
Expected format: String, path to reference storage directory

>>`config-yaml.path-settings.git-directory`
Expected format: String, path to local git repository

>>`config-yaml.path-settings.remote-repository`
Expected format: String, remote git repository, should be in the format of `user/repo`

>>`config-yaml.log-settings.log`
Expected format: String, should be either 'yes' or 'no' in single quotes, indicating whether or not log files will be made

Also see the [`cfg.yaml` overview and example.](./usage.md#cfg.yaml)

---
### `cfg.ini` <a name="cfg.ini"></a>

`reference_name.metadata.species`
Expected format: string
If using a `cfg.ini` file, the `cfg.ini` file should follow the following specs:

`reference_name.metadata.organization`
Expected format: string
`[path-settings].reference-directory=`
Expected format: String, path to reference storage directory

`reference_name.metadata.downloader`
Expected format: string
`[path-settings].git-directory=`
Expected format: String, path to local git repository

`reference_name.levels`
Expected format: key - value mapping
`[path-settings].remote-repository=`
Expected format: String, remote git repository, should be in the format of `user/repo`

`reference_name.levels.<type>`
Where <type\>: `references`, `annotations`, or `indices`
Expected format: list of key - value mappings
`[log-settings].log=`
Expected format: String, should be either 'yes' or 'no', indicating whether or not log files will be made

> `reference_name.levels.<type>.-`
`[runtime-settings].break-on-error=`
Expected format: String, should be either 'yes' or 'no', indicating how RefChef should respond when encountering an error

> `component`
Expected format: string
`complete.status`
Expected formate: boolean (note that if `complete.status` is set to `true` RefChef will skip the current block and not retrieve any file. RefChef automatically changes the status to true after retrieving files for the first time.)
`src`
Expected format: UUID v4, or string. If a UUID of an existing reference is entered, RefChef will create a symlink to the index files from the reference folder.
`commands`
Expected format: list of strings
`[runtime-settings].verbose=`
Expected format: String, should be either 'yes' or 'no', toggles between verbosity output settings

After RefChef runs and retrieves the files, the following fields will be appended the following fields to `master.yaml`:
Also see the [`cfg.ini` overview and example.](./usage.md#cfg.ini)

>`reference_name.levels.<type>.-`

> `location`
Expected format: string
`files`
Expected format: list of strings
`uuid`
Expected format: UUID v4
301 changes: 0 additions & 301 deletions docs/tutorials/quickstart.md

This file was deleted.

527 changes: 9 additions & 518 deletions docs/usage.md

Large diffs are not rendered by default.

10 changes: 10 additions & 0 deletions docs/usecases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
###**Download reference, local repository `master.yaml` version control:**
![Diagram](assets/git_commit_usecase.svg)
###**Download reference, remote repository `master.yaml` version control:**
![Diagram](assets/git_push_usecase.svg)
###**Download new reference, local repository `master.yaml` version control:**
![Diagram](assets/newyaml_usecase.svg)
###**Add manually downloaded reference, append commands to master.yaml, do not execute commands, local repository `master.yaml` version control:**
![Diagram](assets/noexecute_usecase.svg)
###**refchef-menu to view references available on the system:**
![Diagram](assets/refchefmenu_usecase.svg)
9 changes: 6 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -28,8 +28,11 @@ markdown_extensions:
nav:
- Home: 'index.md'
- Installation: 'installation.md'
- Overview: 'overview.md'
- Usage: 'usage.md'
- YAML specs: 'specs.md'
- Inputs: 'inputs.md'
- Folders: 'folders.md'
- File specifications: 'specs.md'
- Quickstart: 'quickstart.md'
- RefChef serve: 'serve.md'
- Tutorials:
- QuickStart: tutorials/quickstart.md
- Refchef use cases: 'usecases.md'
28 changes: 16 additions & 12 deletions scripts/refchef-cook
Original file line number Diff line number Diff line change
@@ -108,18 +108,22 @@ def main():
master = read_menu(conf)

for r in master.keys():
for i in master[r]['levels']['references']:
if not i['complete']['status']:
logging.info(u"""
-------------------------------------------
The folowing references will be downloaded:
- {0}
===========================================
""".format(r))
else:
logging.info("""
No references to download.
""")
for type in ['references', 'indices', 'annotations']:
try:
for i in master[r]['levels'][type]:
if not i['complete']['status']:
logging.info(u"""
-------------------------------------------
The folowing references will be downloaded:
- {0}
===========================================
""".format(r))
else:
logging.info("""
No references to download.
""")
except:
pass

## Execute, commit and push steps.
if arguments.execute:
2 changes: 1 addition & 1 deletion scripts/refchef-serve
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ from refchef import config

parser = argparse.ArgumentParser(description='Get and filter references available in the system.')

parser.add_argument('--master', '-m', type=str, help='Path do to master.yaml')
parser.add_argument('--master', '-f', type=str, help='Path do to master.yaml')
parser.add_argument('--config', '-c', type=str, help='Path do to config file in .yaml or .ini format.')

arguments = parser.parse_args()
4 changes: 2 additions & 2 deletions scripts/templates/table.html
Original file line number Diff line number Diff line change
@@ -99,7 +99,7 @@ <h1 class="title">{{ title }}</h1>
<input type="text" id="referenceInput" list="references" onkeyup="search('referenceInput', 0)" placeholder="Reference">
</th>
<th>
<input type="text" id="speciesInput" list="species" onkeyup="search('speciesInput', 1)" placeholder="Species">
<input type="text" id="speciesInput" list="species" onkeyup="search('speciesInput', 1)" placeholder="Organism">
</th>
<th>
<input type="text" id="orgInput" list="organizations" onkeyup="search('orgInput', 2)" placeholder="Organization">
@@ -120,7 +120,7 @@ <h1 class="title">{{ title }}</h1>
{% for entry in items %}
<tr>
<td><p>{{ entry.name }}</p></td>
<td><p>{{ entry.species }}</p></td>
<td><p>{{ entry.organism }}</p></td>
<td><p>{{ entry.organization }}</p></td>
<td><p>{{ entry.component }}</p></td>
<td><code>{{ entry.location }}</code></td>
2 changes: 2 additions & 0 deletions tests/data/new_linux.yaml
Original file line number Diff line number Diff line change
@@ -22,6 +22,7 @@ index_test1:
- md5sum *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5sum *.fa.gz > final_checksums.md5
src: 8040b09f-3844-3c42-b765-1f6a32614895
- component: bwa_index_2
complete:
status: false
@@ -30,3 +31,4 @@ index_test1:
- md5sum *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5sum *.fa.gz > final_checksums.md5
src: 'web'
2 changes: 2 additions & 0 deletions tests/data/new_osx.yaml
Original file line number Diff line number Diff line change
@@ -22,6 +22,7 @@ index_test1:
- md5sum *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5sum *.fa.gz > final_checksums.md5
src: 8040b09f-3844-3c42-b765-1f6a32614895
- component: bwa_index_2
complete:
status: false
@@ -30,3 +31,4 @@ index_test1:
- md5sum *.fa.gz > postdownload_checksums.md5
- gunzip *.gz
- md5sum *.fa.gz > final_checksums.md5
src: 'web'
2 changes: 1 addition & 1 deletion tests/test_references.py
Original file line number Diff line number Diff line change
@@ -111,7 +111,7 @@ def test_index_ref_link(conf, master):
file_name = 'new_linux.yaml'
ori = os.path.join(conf.git_local, file_name)
des = os.path.join(conf.git_local, master)
append_yaml(ori, des)
merge_yaml(des, ori)

execute(conf, master)
path_1 = os.path.join(conf.reference_dir, 'reference_test1', 'primary', 'bwa_index')
6 changes: 3 additions & 3 deletions tests/test_table_utils.py
Original file line number Diff line number Diff line change
@@ -23,11 +23,11 @@ def test_split_filter():
assert t[1] == "2"

def test_table_columns(menu): #takes the fixture created above as an argument.
assert menu.shape == (2,14)
assert menu.shape == (3,14)

def test_filter(menu):
filtered = filter_menu(menu, "organism", "mouse")
assert filtered.shape == (2,14)
assert filtered.shape == (3,14)
for i in list(filtered["organism"]):
assert i == "mouse"

@@ -40,7 +40,7 @@ def test_multiple_filter(menu):
s2 = "organism:mouse,type:references"

f1 = multiple_filter(menu, s1)
assert f1.shape == (2,14)
assert f1.shape == (3,14)
for i in list(f1["organism"]):
assert i == "mouse"

0 comments on commit 04e216e

Please sign in to comment.