There are several adaptive sampling algorithms which have the potential to be utilized for pediatric acute myeloid leukemia. Some of them are:
- BOSS-RUNS https://doi.org/10.1038/s41587-022-01580-z
- Readfish https://doi.org/10.1038/s41587-020-00746-x
For this project, we are gonna test BOSS-RUNS and Readfish using playback sequencing.
There are several requirements in order to perform adaptive sampling on playback sequencing:
- Server with GPU(s) available (for GPU basecalling)
- Human reference genome (.mmi)
- Bed and toml files (for Readfish and BOSS-RUNS)
- Bulk file (generated from earlier sequencing experiments)
Initially, if you want to open a shell with GPU access you need to request GPUs and sufficient temporary space for installations when requesting nodes, since BOSS-RUNS and Readfish use GPUs for basecalling.
To request a GPU on the HPC, enter the command:
srun --partition=gpu --nodes=1 --gpus=1 --gres=tmpspace:20G --ntasks=1 --mem=100GB --time=02:00:00 --pty bash
If on a Spider server, use the following command instead:
srun --partition=gpu_a100_7c --time=06:00:00 --gpus a100:4 --pty bash
You can download the most recent human reference genome (FASTA) from the NCBI:
. /miniconda3/bin/activate ncbi_datasets
datasets download genome accession GCF_000001405.40 --include genome
unzip ncbi_dataset.zip
After the download is finished, the location of the genome assembly will be: ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna
You should adapt the headers of the FASTA assembly file so the sequences can be recognized as the reference when executing BOSS-RUNS. BOSS-RUNS reads only at the start of the header, ending with a space. So, for example, the sequence with the header: ">chr1 NC_12345.1 ..." will only be recognized as "chr1".
To adapt the reference you can execute the bash scripts:
bash /copy_to_container/get_sequence_per_chromosome.bash ncbi_dataset/data/GCF_000001405.40/GCF_000001405.40_GRCh38.p14_genomic.fna > pre_small_genome.fa
bash /copy_to_container/add_chromosome_name_to_header.bash pre_small_genome.fa > small_genome.fa
rm pre_small_genome.fa
A bed file contains genomic coordinates from Regions of Interest. The bed file to target four AML-associated genes is already available in the directory: \copy_to_container\targets_4genes.bed
Besides, two toml files are also available for configurations of adaptive sampling:
- \copy_to_container\R10_prom_4genes_boss_and_readfish.toml (for Readfish config)
- \copy_to_container\boss_prom_rf_and_boss.toml (for BOSS-RUNS config)
You will need to give the correct path to the human reference genome/index in the tomls so the reference genome can be recognized during adaptive sampling:
- copy_to_container/boss_prom_rf_and_boss.toml (TODO: provide correct absolute path to reference/index file for ref=[path/to/reference] and/or mmi=[path/to/index])
- copy_to_container/R10_prom_4genes_boss_and_readfish.toml (TODO: provide correct absolute path to reference/index file for fn_idx_in=[path/to/reference_or_index])
To run playback sequencing you will need a bulk file (FAST5) with previously recorded sequencing data to generate new sequences with playback sequencing. A FAST5 file contains raw sequencing data like raw electrical data corresponding to ONT reads that have been sequenced by the nanopore.
The Readfish GitHub page provides links to download a bulk R9 or R10 file. The files consist of the raw electrical data including metadata generated by the respective nanopore version (R9 or R10). The MinION bulk R10 file can be downloaded with:
wget -O R10_bulk.fast5 https://s3.amazonaws.com/nanopore-human-wgs/bulkfile/GXB02001_20230509_1250_FAW79338_X3_sequencing_run_NA12878_B1_19382aa5_ef4362cd.fast5
While the PromethION bulk R10 file can be downloaded with:
https://s3.amazonaws.com/nanopore-human-wgs/bulkfile/PC24B243_20220512_1516_PAK21362_3H_sequencing_run_NA12878_sheared20kb_3d5147fc.fast5
If you prefer to generate bulk files instead of using available ones you can select "bulk" on the sequencing settings on an ONT device. To generate a valid bulk file for playback sequencing you need to select all three advanced options:
- Events
- Read table
- Raw.
Once the bulk file is available, its path must be referenced to launch playback within the container.
I have developed a definition file (bossruns.def) to build a Singularity container (https://apptainer.org/docs/user/latest/) to ensure reproducibility when running BOSS-RUNS or Readfish.
To build the container enter the following:
singularity build -F bossruns.sif bossruns.def
Additionally, to set configurations for playback you should use an overlay filesystem, which can be created with:
singularity overlay create --size 10240 overlay.img
The shell for the container can then be run with:
singularity shell --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif
Using the --nv argument you will be able to use the requested GPU(s) in the shell of the container.
Playback sequencing is a method where previously recorded sequencing data is replayed through a sequencing platform. It can be used for evaluation of the performance of bioinformatics tools or sequencing strategies. In the context of nanopore sequencing, playback sequencing involves re-analyzing sequencing data from a previous experiment to assess the efficiency and accuracy of new tools or algorithms. This approach allows researchers to compare different tools under controlled conditions and evaluate their performance without the need for additional sequencing experiments.
The code from https://github.com/W-L/simION/tree/main has been adapted to be used to launch playback with a simulated PromethION device.
Once the container has been built you can perform adaptive sampling using playback sequencing in two ways: sbatch jobs or manually.
Regardless of which approach you take make sure the MinKNOW configurations are in order first:
singularity exec --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif /opt/ont/minknow/bin/config_editor --conf user --filename /opt/ont/minknow/conf/user_conf --set output_dirs.logs="logs" --set output_dirs.base="$(pwd)/minknow_run"
Sbatch jobs are easier for adaptive sampling since all the scripts to run adaptive sampling and its analysis are already available. Before creating sbatch jobs make sure there are GPUs available on your server.
You should create one sbatch job at the same time by submitting the available sbatch scripts one by one in the same directory with the container (bossruns.sif) in the following order:
# Peforming adaptive sampling using playback with the given path to the bulk file
sbatch copy_to_container/sbatch_playback_7c.sh [path/to/bulk]
# Generating readfish stats results from adaptive sampling output with the given paths to the output directory of adaptive sampling
sbatch copy_to_container/sbatch_readfish_stats.sh [paths/to/outputdirs_of_adaptive_sampling]
# Generating plots for analysis of adaptive sampling results with the given path to the human reference genome (.mmi) and paths to the output directory of adaptive sampling
sbatch copy_to_container/sbatch_all_plots_boss_readfish.sh [path/to/humanref] [paths/to/outputdirs_of_adaptive_sampling]
NOTE: you may need to edit the sbatch scripts depending on the GPU partitions available on your server
Another approach to adaptive sampling with a container is by entering commands manually within the shell of the container:
singularity shell --userns --bind $TMPDIR:/tmp --nv --overlay overlay.img bossruns.sif
Before you can start playback you have to specify the location for the output directory of MinKNOW with the user configuration and set up ONT configs while you are in the shell of the container:
/opt/ont/minknow/bin/config_editor --conf user --filename /opt/ont/minknow/conf/user_conf --set output_dirs.logs="logs" --set output_dirs.base="$(pwd)/minknow_run"
/opt/ont/minknow/bin/config_editor --conf application --filename /opt/ont/minknow/conf/app_conf --set disk_space_warnings.reads.minimum_space_mb=1000
Playback ONT sequencing is usually done with a MinKNOW GUI. However, since we are using a singularity container the playback will be run through the terminal instead.
We will run playback sequencing from a shell spawned from a singularity container. Check out the simION github page for more information https://github.com/W-L/simION/tree/main.
To launch playback enter the following command:
bash /simION/code/launch_playback_prom.sh [path/to/bulk]
If everything goes alright, you should see the following output being generated on the terminal:
...
PHASE_INITIALISING
PHASE_MUX_SCAN
Till you see the message indicating the actual sequencing has started:
PHASE_SEQUENCING
Now playback sequencing is running!
Note: If you are done using BOSS-RUNS, you can stop the playback by entering the command:
python /simION/code/stop_minknow_run.py
!!! Do not stop playback sequencing before using BOSS-RUNS as you will be using sequence data generated from the playback to execute BOSS-RUNS.
While sequencing data is being generated by playback sequencing, you can change the behaviour of data acquisition by performing adaptive sampling with BOSS-RUNS/Readfish.
Once playback is running, you can perform adaptive sampling using a bed file and two toml files:
- Bed file with gene coordinates
- Toml for BOSS-RUNS
- Toml for Readfish
Perform the targeted enrichment with BOSS-RUNS:
boss --toml /copy_to_container/boss_prom_rf_and_boss.toml --toml_readfish /copy_to_container/R10_prom_4genes_boss_and_readfish.toml
or with Readfish:
readfish targets --toml /copy_to_container/R10_prom_4genes_only_readfish.toml --device 1A --log-file test_${monthday}_targets.log --experiment-name hum_test
Once you have sufficient sequencing data you can end playback with:
python3 /simION/code/stop_minknow_run.py
Finally, you will have sequencing data enriched by BOSS-RUNS/Readfish. The format of the sequencing data is either in FASTQ (at fastq_pass/) or in POD5 (at pod5_pass/). The location of the output is in the form: minknow_run/no_group/no_sample/[DATE]_[TIME]_A1_[IDENTIFIER]/ For which the date and time are from when the playback sequencing started.
You can generate results from the adaptive sampling with the readfish stats subcommand of Readfish. For example:
readfish stats --toml copy_to_container/R10_prom_4genes_only_readfish.toml --fastq-directory minknow_run/no_group/no_sample/20250101_1234_A1__abcd1234/fastq_pass/
Below are several possible errors listed with their solution.
If an error occurred during the build of the container then depending on the error the version of Readfish or BOSS-RUNS might have been deprecated due to newer MinKNOW software and the respective authors need to be contacted.
If you get the following error during the launch of playback sequencing:
Traceback (most recent call last):
File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/tools/protocols.py", line 54, in find_protocol
response = device_connection.protocol.list_protocols(force_reload=force_reload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/protocol_service.py", line 701, in list_protocols
return run_with_retry(self._stub.list_protocols,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/simion/lib/python3.11/site-packages/minknow_api/protocol_service.py", line 120, in run_with_retry
result = MessageWrapper(method(message, timeout=timeout), unwraps=unwraps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/simion/lib/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/simion/lib/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Failed to parse response from protocol selector: Connection timed out"
debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:8002 {created_time:"2024-11-13T12:05:14.33745037+01:00", grpc_status:2, grpc_message:"Failed to parse response from protocol selector: Connection timed out"}"
You could solve this by setting the ONT configs (again). It can take some time before the configurations have been set.
If the error still persists, the absolute path to the bulk file might not be correctly referenced or the bulk file is in the wrong format.
If an error occurred that included the following:
The following error occurred loading a configuration file: rapidjson internal assertion failure: IsObject()
It means that the overlay has been corrupted. So, you will need to create a new overlay and open a shell with a new overlay instead.