You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Read Sequencer is a python package to simulate sequencing.
It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file.
Installation from github
Read Sequencer requires Python 3.9 or later.
Install Read Sequencer from Github using:
git clone https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer.git
cd read-sequencer
pip install .
Usage
usage: read-sequencer [-h] [-i INPUT] [-r READ_LENGTH] [-n N_RANDOM] [-s CHUNK_SIZE] output
Simulates sequencing of DNA sequences specified by an FASTA file.
positional arguments:
output path to FASTA file
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
path to FASTA file
-r READ_LENGTH, --read-length READ_LENGTH
read length for sequencing
-n N_RANDOM, --n_random N_RANDOM
n random sequences. Just used if input fasta file is not specified.
-s CHUNK_SIZE, --chunk-size CHUNK_SIZE
chunk_size for batch processing
Simulate the sequencing of reads on the template of terminal fragments. Reads are copies of fixed length starting from the 5' end of fragments. If the desired read length is larger than the fragment length, sequencing would in principle proceed into the 3' adaptor and then would perhaps yield random bases. For simplicity, here we assume that random nucleotides are introduced in this case.
Input:
Fasta-formatted file of sequences of terminal fragments from transcripts
Number of reads to sample
Read length (number of sequencing cycles)
Dictionary of nucleotide frequencies used to pad the read if the input fragment is too short.
Output:
Fasta-formatted file of reads of identical length, representing 5’ ends of the terminal fragments.
To generate each read, a terminal fragment is chosen from input 1, with replacement. Then a segment of the specified read length (input 3) is extracted from the terminal fragment. If the terminal fragment is shorter than the read length, then random nucleotides are added to the 3' end according to the probabilities given in input 4, until the read length is reached. A unique name should be created for each read, and the name and read should be written to the output file in fasta format. The process is repeated for the specified number of reads (input 2).
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
The terminal fragments from the previous step are sampled according to input #5, to pick a fragment for sequencing. Then a piece of length input #8 is taken fromm the 5' end of the fragment to form a read. If the fragment is shorter than the read length (input #8), the fragment is padded with random sequence, given a vector of relative probability for A,C,G,T to appear in the random sequence (input #8). The output of this step will be a fasta file with "sequenced reads", which is the output of the simulation.
- FASTA: terminal fragment sequences
- total number of reads
- read length
- padding nucleotide frequencies
Output:
- FASTA: sequenced reads
Function design:
- read_in_fasta(file_path)
- reads lines of the FASTA into dictionary of strings or pandas dataframe
- option to generate synthetic sequences that include primers and variable length
- simulate_sequencing(n_reads, sequences, read_length, padding_probabilities):
- initiate results dict
- wrapper function that iterates over reads:
- per read do read_sequence():
- sample one sequence from the pool of given sequences according to relative
- locate position of primer sequence
- from this position read sequence
- if: read_length > length_sequence:
- add random nucleotides according padding_probabilities to the end of
the sequence until read_length is reached
- will the 'sequencing' be affected by leading/lagging nucleotides (markov chains etc)
which can affect the correct sequencing result?
- store sequenced reads as string to result dictionary
- return results dict which contains all sequencing reads as a FASTA file
- needed lower level functions:
- generate_dummy_data() / load_dummy_data()
- read_sequence()
- add_nucleotides_to_end()
- sample_sequence()
- locate_primer_site()
The text was updated successfully, but these errors were encountered:
README description
Read Sequencer
Overview
Read Sequencer is a python package to simulate sequencing.
It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file.
Installation from github
Read Sequencer requires Python 3.9 or later.
Install Read Sequencer from Github using:
Usage
Docker
The docker image is available on docker hub: https://hub.docker.com/r/grrchrr/readsequencer
Contributors and Contact Information
Christoph Harmel - [email protected]
Michael Sandholzer - [email protected]
Clara Serger - [email protected]
Original issue description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/7
Read sequencing
Simulate the sequencing of reads on the template of terminal fragments. Reads are copies of fixed length starting from the 5' end of fragments. If the desired read length is larger than the fragment length, sequencing would in principle proceed into the 3' adaptor and then would perhaps yield random bases. For simplicity, here we assume that random nucleotides are introduced in this case.
Input:
Output:
Fasta-formatted file of reads of identical length, representing 5’ ends of the terminal fragments.
To generate each read, a terminal fragment is chosen from input 1, with replacement. Then a segment of the specified read length (input 3) is extracted from the terminal fragment. If the terminal fragment is shorter than the read length, then random nucleotides are added to the 3' end according to the probabilities given in input 4, until the read length is reached. A unique name should be created for each read, and the name and read should be written to the output file in fasta format. The process is repeated for the specified number of reads (input 2).
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
The terminal fragments from the previous step are sampled according to input #5, to pick a fragment for sequencing. Then a piece of length input #8 is taken fromm the 5' end of the fragment to form a read. If the fragment is shorter than the read length (input #8), the fragment is padded with random sequence, given a vector of relative probability for A,C,G,T to appear in the random sequence (input #8). The output of this step will be a fasta file with "sequenced reads", which is the output of the simulation.
Project design plan
https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer/-/issues/1
Project design: read_sequencer
Input:
Output:
Function design:
The text was updated successfully, but these errors were encountered: