test: #7 read sequencer #27

ninsch3000 · 2023-10-27T09:35:40Z

README description

Read Sequencer

Overview

Read Sequencer is a python package to simulate sequencing.
It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file.

Installation from github

Read Sequencer requires Python 3.9 or later.

Install Read Sequencer from Github using:

git clone https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer.git
cd read-sequencer
pip install .

Usage

usage: read-sequencer [-h] [-i INPUT] [-r READ_LENGTH] [-n N_RANDOM] [-s CHUNK_SIZE] output 
Simulates sequencing of DNA sequences specified by an FASTA file.

positional arguments:
  output                path to FASTA file

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to FASTA file
  -r READ_LENGTH, --read-length READ_LENGTH
                        read length for sequencing
  -n N_RANDOM, --n_random N_RANDOM
                        n random sequences. Just used if input fasta file is not specified.
  -s CHUNK_SIZE, --chunk-size CHUNK_SIZE
                        chunk_size for batch processing

Docker

The docker image is available on docker hub: https://hub.docker.com/r/grrchrr/readsequencer

docker pull grrchrr/readsequencer
docker run readsequencer readsequencer --help

Contributors and Contact Information

Christoph Harmel - [email protected]
Michael Sandholzer - [email protected]
Clara Serger - [email protected]

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/7

Read sequencing

Simulate the sequencing of reads on the template of terminal fragments. Reads are copies of fixed length starting from the 5' end of fragments. If the desired read length is larger than the fragment length, sequencing would in principle proceed into the 3' adaptor and then would perhaps yield random bases. For simplicity, here we assume that random nucleotides are introduced in this case.

Input:

Fasta-formatted file of sequences of terminal fragments from transcripts
Number of reads to sample
Read length (number of sequencing cycles)
Dictionary of nucleotide frequencies used to pad the read if the input fragment is too short.

Output:
Fasta-formatted file of reads of identical length, representing 5’ ends of the terminal fragments.

To generate each read, a terminal fragment is chosen from input 1, with replacement. Then a segment of the specified read length (input 3) is extracted from the terminal fragment. If the terminal fragment is shorter than the read length, then random nucleotides are added to the 3' end according to the probabilities given in input 4, until the read length is reached. A unique name should be created for each read, and the name and read should be written to the output file in fasta format. The process is repeated for the specified number of reads (input 2).

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
The terminal fragments from the previous step are sampled according to input #5, to pick a fragment for sequencing. Then a piece of length input #8 is taken fromm the 5' end of the fragment to form a read. If the fragment is shorter than the read length (input #8), the fragment is padded with random sequence, given a vector of relative probability for A,C,G,T to appear in the random sequence (input #8). The output of this step will be a fasta file with "sequenced reads", which is the output of the simulation.

Project design plan

https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer/-/issues/1

Project design: read_sequencer

Input:

- FASTA: terminal fragment sequences
- total number of reads
- read length
- padding nucleotide frequencies

Output:

- FASTA: sequenced reads

Function design:

- read_in_fasta(file_path)
    - reads lines of the FASTA into dictionary of strings or pandas dataframe
    - option to generate synthetic sequences that include primers and variable length
    
- simulate_sequencing(n_reads, sequences, read_length, padding_probabilities):
    - initiate results dict
    - wrapper function that iterates over reads:
        - per read do read_sequence():
            - sample one sequence from the pool of given sequences according to relative 
            - locate position of primer sequence
            - from this position read sequence
                - if: read_length > length_sequence:
                    - add random nucleotides according padding_probabilities to the end of
                      the sequence until read_length is reached
                - will the 'sequencing' be affected by leading/lagging nucleotides (markov chains etc)
                  which can affect the correct sequencing result?      
            - store sequenced reads as string to result dictionary
        - return results dict which contains all sequencing reads as a FASTA file 

- needed lower level functions:
    - generate_dummy_data() / load_dummy_data()
    - read_sequence()
    - add_nucleotides_to_end()
    - sample_sequence()
    - locate_primer_site()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: #7 read sequencer #27

test: #7 read sequencer #27

ninsch3000 commented Oct 27, 2023 •

edited by balajtimate

Loading

test: #7 read sequencer #27

test: #7 read sequencer #27

Comments

ninsch3000 commented Oct 27, 2023 • edited by balajtimate Loading

README description

Read Sequencer

Overview

Installation from github

Usage

Docker

Contributors and Contact Information

Original issue description

Read sequencing

Pipeline overview description

Project design plan

Project design: read_sequencer

Input:

Output:

Function design:

ninsch3000 commented Oct 27, 2023 •

edited by balajtimate

Loading