Skip to content

02 Generating Data

Vineet Bansal edited this page Aug 24, 2024 · 7 revisions

For development and testing purposes, the sacCer3 organism is sufficient, since this is the smallest database/index to download or generate. This document thus deals with only `sacCer3 as an example.

A Snakemake workflow is used to initialize databases and generate required data for this project. These are the steps run in the workflow:

Snakemake DAG

Run Snakemake workflow

Navigate to the docker/snakemake folder.

  • To generate the full databases, use the following command:

     snakemake --cores 1 --use-conda

    Running this command will generate bam files for all available organisms and enzymes listed in config.json. Keep in mind that this process will be time and resource-intensive due to the large amount of data involved (it can take days!).

  • Alternatively, users can customize the databases using the --config flag. You can specify the desired organisms, enzymes, and max_kmers to generate partial databases:

    • max_kmers (int): Defines the number of kmers to generate. The default is inf.
    • organisms (list): Generates database(s) for the specified organism(s). The default is ["sacCer3", "hg38", "ce11", "dm6", "mm10", "mm39", "rn6", "t2t_chm13"].
    • enzymes (list): Generates database(s) for the specified enzyme(s). The default is ["cas9", "cpf1"].

    For example, if you want to generate the sacCer3/cas9 databases with only the first 1000 kmers, use the following command:

     snakemake --cores 1 --use-conda --config max_kmers=1000 organisms=[\"sacCer3\"] enzymes=[\"cas9\"]

Folder Structure

After running the workflow with --config organisms=[\"sacCer3\"], the output data folder will have the following structure. In this case, set GUIDESCAN_BAM_PATH to the absolute path to the databases folder, and GUIDESCAN_INDEX_PATH to the absolute path to the indices folder.

├── databases
│   ├── cas9
│   │   ├── sacCer3.bam
│   │   ├── sacCer3.bam.bai	
│   │   ├── sacCer3.bam.sorted
│   │   ├── sacCer3.bam.sorted.bai
│   │   └── sacCer3.sam
│   └── cpf1
│       ├── sacCer3.bam
│       ├── sacCer3.bam.bai	
│       ├── sacCer3.bam.sorted
│       ├── sacCer3.bam.sorted.bai
│       └── sacCer3.sam
├── indices
│   ├── sacCer3.index.forward
│   ├──
│   └── sacCer3.index.reverse
├── job_status
│   ├── add_sacCer3.txt
│   └── init_db.txt
├── kmers
│   ├── cas9
│   │   └── sacCer3.csv
│   └── cpf1
│       └── sacCer3.csv
└── raw
    ├── sacCer3_chr2acc
    ├── sacCer3.fna
    ├── sacCer3.fna.forward.dna
    ├── sacCer3.fna.reverse.dna
    └── sacCer3.gtf.gz

Downloading pre-generated data

As an alternative to generating data, you can use guidescan download to download data directly from our website.

  • To see what is available to download
guidescan download --type database --show item
guidescan download --type index --show item
  • To download BAM file for a particular organism/enzyme combination:
guidescan download --type database --item sacCer3_cas9
  • To download index files for a particular organism:
guidescan download --type index --item sacCer3

Folder Structure

After downloading/unzipping the required files, you will want to rename files and arrange them in the following folder structure:

├── databases
│   └── cas9
│       └── sacCer3.bam.sorted
└── indices
    ├── sacCer3.index.forward
    └── sacCer3.index.reverse

In particular, BAM files should be named <organism>.bam.sorted and arranged by <enzyme> folder. Index files should be named <organism>.index.<extension>. In the case shown above, set GUIDESCAN_BAM_PATH to the absolute path to the databases folder, and GUIDESCAN_INDEX_PATH to the absolute path to the indices folder.