Skip to content

DanteV19/adaptive_sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

BOSS-RUNS and Readfish: adaptive sampling algorithms for Nanopore sequencing

There are several adaptive sampling algorithms which have the potential to be utilized for pediatric acute myeloid leukemia. Some of them are:

For this project, we are gonna test BOSS-RUNS and Readfish using playback sequencing.

Requirements

There are several requirements in order to perform adaptive sampling on playback sequencing:

  • Server with GPU(s) available (for GPU basecalling)
  • Human reference genome (.mmi)
  • Bed and toml files (for Readfish and BOSS-RUNS)
  • Bulk file (generated from earlier sequencing experiments)

Requesting GPUs

Initially, if you want to open a shell with GPU access you need to request GPUs and sufficient temporary space for installations when requesting nodes, since BOSS-RUNS and Readfish use GPUs for basecalling.

To request a GPU on the HPC, enter the command:

srun --partition=gpu --nodes=1 --gpus=1 --gres=tmpspace:20G --ntasks=1 --mem=100GB --time=02:00:00 --pty bash

If on a Spider server, use the following command instead:

srun --partition=gpu_a100_7c --time=06:00:00 --gpus a100:4 --pty bash

Preparing human reference data

You can download the most recent human reference genome (FASTA) from the NCBI:

. /miniconda3/bin/activate ncbi_datasets
datasets download genome accession GCF_000001405.40 --include genome
unzip ncbi_dataset.zip

After the download is finished, the location of the genome assembly will be: ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna

You should adapt the headers of the FASTA assembly file so the sequences can be recognized as the reference when executing BOSS-RUNS. BOSS-RUNS reads only at the start of the header, ending with a space. So, for example, the sequence with the header: ">chr1 NC_12345.1 ..." will only be recognized as "chr1".

To adapt the reference you can execute the bash scripts:

bash /copy_to_container/get_sequence_per_chromosome.bash ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna > pre_small_genome.fa
bash /copy_to_container/add_chromosome_name_to_header.bash pre_small_genome.fa > small_genome.fa
rm pre_small_genome.fa

Bed and toml files

A bed file contains genomic coordinates from Regions of Interest. The bed file to target four AML-associated genes is already available in the directory: \copy_to_container\targets_4genes.bed

Besides, two toml files are also available for configurations of adaptive sampling:

  • \copy_to_container\R10_prom_4genes_boss_and_readfish.toml (for Readfish config)
  • \copy_to_container\boss_prom_rf_and_boss.toml (for BOSS-RUNS config)

You will need to give the correct path to the human reference genome/index in the tomls so the reference genome can be recognized during adaptive sampling:

  • copy_to_container/boss_prom_rf_and_boss.toml (TODO: provide correct absolute path to reference/index file for ref=[path/to/reference] and/or mmi=[path/to/index])
  • copy_to_container/R10_prom_4genes_boss_and_readfish.toml (TODO: provide correct absolute path to reference/index file for fn_idx_in=[path/to/reference_or_index])

Bulk data

To run playback sequencing you will need a bulk file (FAST5) with previously recorded sequencing data to generate new sequences with playback sequencing. A FAST5 file contains raw sequencing data like raw electrical data corresponding to ONT reads that have been sequenced by the nanopore.

The Readfish GitHub page provides links to download a bulk R9 or R10 file. The files consist of the raw electrical data including metadata generated by the respective nanopore version (R9 or R10). The MinION bulk R10 file can be downloaded with:

wget -O R10_bulk.fast5 https://s3.amazonaws.com/nanopore-human-wgs/bulkfile/GXB02001_20230509_1250_FAW79338_X3_sequencing_run_NA12878_B1_19382aa5_ef4362cd.fast5

While the PromethION bulk R10 file can be downloaded with:

https://s3.amazonaws.com/nanopore-human-wgs/bulkfile/PC24B243_20220512_1516_PAK21362_3H_sequencing_run_NA12878_sheared20kb_3d5147fc.fast5

If you prefer to generate bulk files instead of using available ones you can select "bulk" on the sequencing settings on an ONT device. To generate a valid bulk file for playback sequencing you need to select all three advanced options:

  • Events
  • Read table
  • Raw.

Once the bulk file is available, its path must be referenced to launch playback within the container.

Start building a singularity container from the definition file

I have developed a definition file (bossruns.def) to build a Singularity container (https://apptainer.org/docs/user/latest/) to ensure reproducibility when running BOSS-RUNS or Readfish.

To build the container enter the following:

singularity build -F bossruns.sif bossruns.def

Additionally, to set configurations for playback you should use an overlay filesystem, which can be created with:

singularity overlay create --size 10240 overlay.img

The shell for the container can then be run with:

singularity shell --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif

Using the --nv argument you will be able to use the requested GPU(s) in the shell of the container.

Start adaptive sampling on playback sequencing

Playback sequencing is a method where previously recorded sequencing data is replayed through a sequencing platform. It can be used for evaluation of the performance of bioinformatics tools or sequencing strategies. In the context of nanopore sequencing, playback sequencing involves re-analyzing sequencing data from a previous experiment to assess the efficiency and accuracy of new tools or algorithms. This approach allows researchers to compare different tools under controlled conditions and evaluate their performance without the need for additional sequencing experiments.

The code from https://github.com/W-L/simION/tree/main has been adapted to be used to launch playback with a simulated PromethION device.

Once the container has been built you can perform adaptive sampling using playback sequencing in two ways: sbatch jobs or manually.

Regardless of which approach you take make sure the MinKNOW configurations are in order first:

singularity exec --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif /opt/ont/minknow/bin/config_editor --conf user --filename /opt/ont/minknow/conf/user_conf --set output_dirs.logs="logs" --set output_dirs.base="$(pwd)/minknow_run"

Sbatch jobs for adaptive sampling on playback sequencing

Sbatch jobs are easier for adaptive sampling since all the scripts to run adaptive sampling and its analysis are already available. Before creating sbatch jobs make sure there are GPUs available on your server.

You should create one sbatch job at the same time by submitting the available sbatch scripts one by one in the same directory with the container (bossruns.sif) in the following order:

# Peforming adaptive sampling using playback with the given path to the bulk file
sbatch copy_to_container/sbatch_playback_7c.sh [path/to/bulk]
# Generating readfish stats results from adaptive sampling output with the given paths to the output directory of adaptive sampling
sbatch copy_to_container/sbatch_readfish_stats.sh [paths/to/outputdirs_of_adaptive_sampling]
# Generating plots for analysis of adaptive sampling results with the given path to the human reference genome (.mmi) and paths to the output directory of adaptive sampling
sbatch copy_to_container/sbatch_all_plots_boss_readfish.sh [path/to/humanref] [paths/to/outputdirs_of_adaptive_sampling]

NOTE: you may need to edit the sbatch scripts depending on the GPU partitions available on your server

Manual adaptive sampling on playback sequencing

Another approach to adaptive sampling with a container is by entering commands manually within the shell of the container:

singularity shell --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif

Set ONT configs

Before you can start playback you have to specify the location for the output directory of MinKNOW with the user configuration and set up ONT configs while you are in the shell of the container:

/opt/ont/minknow/bin/config_editor --conf user --filename /opt/ont/minknow/conf/user_conf --set output_dirs.logs="logs" --set output_dirs.base="$(pwd)/minknow_run"
/opt/ont/minknow/bin/config_editor --conf application --filename /opt/ont/minknow/conf/app_conf --set disk_space_warnings.reads.minimum_space_mb=1000

Running playback ONT sequencing

Playback ONT sequencing is usually done with a MinKNOW GUI. However, since we are using a singularity container the playback will be run through the terminal instead.

We will run playback sequencing from a shell spawned from a singularity container. Check out the simION github page for more information https://github.com/W-L/simION/tree/main.

To launch playback enter the following command:

bash /simION/code/launch_playback_prom.sh [path/to/bulk]

If everything goes alright, you should see the following output being generated on the terminal:

...

PHASE_INITIALISING

PHASE_MUX_SCAN

Till you see the message indicating the actual sequencing has started:

PHASE_SEQUENCING

Now playback sequencing is running!

Note: If you are done using BOSS-RUNS, you can stop the playback by entering the command:

python /simION/code/stop_minknow_run.py

!!! Do not stop playback sequencing before using BOSS-RUNS as you will be using sequence data generated from the playback to execute BOSS-RUNS.

Performing targeted enrichment on four AML-associated genes with BOSS-RUNS/Readfish

While sequencing data is being generated by playback sequencing, you can change the behaviour of data acquisition by performing adaptive sampling with BOSS-RUNS/Readfish.

Once playback is running, you can perform adaptive sampling using a bed file and two toml files:

  • Bed file with gene coordinates
  • Toml for BOSS-RUNS
  • Toml for Readfish

Perform the targeted enrichment with BOSS-RUNS:

boss --toml /copy_to_container/boss_prom_rf_and_boss.toml --toml_readfish /copy_to_container/R10_prom_4genes_boss_and_readfish.toml

or with Readfish:

readfish targets --toml /copy_to_container/R10_prom_4genes_only_readfish.toml --device 1A --log-file test_${monthday}_targets.log --experiment-name hum_test

Once you have sufficient sequencing data you can end playback with:

python3 /simION/code/stop_minknow_run.py

Analysis of sequencing data with readfish stats

Finally, you will have sequencing data enriched by BOSS-RUNS/Readfish. The format of the sequencing data is either in FASTQ (at fastq_pass/) or in POD5 (at pod5_pass/). The location of the output is in the form: minknow_run/no_group/no_sample/[DATE]_[TIME]_A1_[IDENTIFIER]/ For which the date and time are from when the playback sequencing started.

You can generate results from the adaptive sampling with the readfish stats subcommand of Readfish. For example:

readfish stats --toml copy_to_container/R10_prom_4genes_only_readfish.toml --fastq-directory minknow_run/no_group/no_sample/20250101_1234_A1__abcd1234/fastq_pass/

Troubleshooting

Below are several possible errors listed with their solution.

The build of the container from the definition file could not be completed due to error

If an error occurred during the build of the container then depending on the error the version of Readfish or BOSS-RUNS might have been deprecated due to newer MinKNOW software and the respective authors need to be contacted.

Error during playback

If you get the following error during the launch of playback sequencing:

Traceback (most recent call last):
  File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/tools/protocols.py", line 54, in find_protocol
    response = device_connection.protocol.list_protocols(force_reload=force_reload)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/protocol_service.py", line 701, in list_protocols
    return run_with_retry(self._stub.list_protocols,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/protocol_service.py", line 120, in run_with_retry
    result = MessageWrapper(method(message, timeout=timeout), unwraps=unwraps)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/simion/lib/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/simion/lib/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Failed to parse response from protocol selector: Connection timed out"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:8002 {created_time:"2024-11-13T12:05:14.33745037+01:00", grpc_status:2, grpc_message:"Failed to parse response from protocol selector: Connection timed out"}"

You could solve this by setting the ONT configs (again). It can take some time before the configurations have been set.

If the error still persists, the absolute path to the bulk file might not be correctly referenced or the bulk file is in the wrong format.

Corrupted overlay

If an error occurred that included the following:

The following error occurred loading a configuration file: rapidjson internal assertion failure: IsObject()

It means that the overlay has been corrupted. So, you will need to create a new overlay and open a shell with a new overlay instead.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published