the name convention of top level pipeline/workflow should begin with "wf_*"
it should
1. load the config files
2. read the additional arguments
3. sub-level of these module should be tool sets divide by usage. such as align tools, quantification tools, DE-detectors. etc. the final tool should be decide by the sub-level tool sets.
4. top-level module should provide output check and input check for the sub-level module, that is pre-module function and post-module function.
hmm.. , a often seen job is so-called "bunch", that is , a bunch of reads file with the same index .
so we need a input file , with contains the locations of these reads, and a setting file, which stands for the setting for all reads in this bunch .
the config file remains mostly all the original cli options. a controller/checker should be established, and should provide following feature:
1.check the options before invoking external tools, show description message when error occurs
2.give some query/info interface
every module which contains a option_checker should have this constant OPTION_CHECKERS , which is a list of opt-checker objects
yes , the output file is the logging file, and should use the standard logging module. ... but still undone yet . it is mostly the python logging manual's fault .....
a mechanism for checking opts and print help is needed .
here, we assume the QC tool is fastqc.
and those qc tool should have following parts:
- run_qc
a function/method to invoke the QC pipeline
- parse_qc_result
a function to parse QC result
- SECTION_QC_SETTING
the name of QC setting section in configure file
- OPTION_CHECKERS
a list of option check , this will generate the default description string
most aligner need some kind of index to run the mapping.
so all aligner wrapper should have following feature:
- index
function interface for establishing index
input:
output: file_path of indice, which can be understand by the aligner itself
if the index already exists, just return the path.
- align
function interface for mapping sequence reads to reference genome.
input: sequence reads in fasta/fastq
output: return a dict contains serval entries ,
- is_path_contain_index
function for checking whether the given path have a index this is done by checking existence of related files. those related files should be provide by aligner module itself.
input : file path to possible index
output : bool value indiacting whether there is a index .
- SECTION_INDEX:
a string shows index parameter is under configure[SECTION_INDEX]
- SECTION_ALIGN:
a string shows align parameters are under configure[SECTION_ALIGN]
- interpret_index_path:
function : translate a given index path to a dict which is compatible with configure file section.
input: a string containing possible index path
output: a dict , which can be accept by ChainMap to make a dict as input of index/align
when input is None, return {}
- interpret_seq_files
function: translate string contains a path or paths of sequencing read file to a dict which is compatible with configure file section
input: a string contains one or more sequencing reads file path
output: a dict , which can be accept by ChainMap to make a dict as input of align function
if input is None, return {}
- is_map_result_already_exists
function for checking whether the alignment result already exists
input : file path, or path/prefix specify the path
output: bool value shows whether there already have mapped result .
- get_align_result_path
function to get the path of mapped result file
input : the same input as the align function
output: path or path/prefix showing where the output file should be
def index/quant(config, ...)
get_dict(...)
check_opts
get_cmd
run_cmd
similar with aligners, quantification tools also need index . so the main framework of quantification tools is the same as the aligners , only the spell differs .
- index
function interface for establishing index
input: ... some massive dict
output: return the file_path of indice
- quantify
function interface for calculating the expression
input: sequence reads in fasta/fastq
output: return a dict including the path of expression file
- is_path_contain_index :
function for checking whether the given path have index
input : file path to possible index
output : bool value indiacting whether there is a index .
- SECTION_INDEX:
a string shows index parameter is under configure[SECTION_INDEX]
- SECTION_QUANTIFY:
a string shows align parameters are under configure[SECTION_QUSANTIFY]
- interpret_index_path:
function : translate a given index path to a dict which is compatible with configure file section.
input: a string containing possible index path
output: a dict , which can be accept by ChainMap to make a dict as input of index/quant
return {} if None is given
- interpret_seq_files
function: translate string contains a path or paths of sequencing read file to a dict which is compatible with configure file section
input: a string contains one or more sequencing reads file path
output: a dict , which can be accept by ChainMap to make a dict as input of quant function
return {} if None is given
may need some sequence reads as input , this will be aligner's output . seems this tool do not need a index phrase . or it can be done manually.
- detect
main function interface ,
input: sequence_file, path to sequence reads fils in fastq/fa format
output: a dict(or other container data type), this should contain a path will the report of circular RNA detection
- export_as_bed
function to transform detection report to .bed file type
input: a dict contains settings for this dectection job.
output: 1. a bed file contains the information of circRNA detection.
2. optional gene-mapping file
- SECTION_DETECT:
a string shows where is the setting of this detection tool in config file.
- interpret_seq_files
function designed for work with files assigned by arguments,
- which_external_aligner
function to tell whether a external mapping should be performed before the detecting phase . return "" if no aligner is needed . else return the config section name of aligner such as "BWA", "STAR", etc
- bed_out
[optional] file path to bed type result, summarized by module's to_bed function
bed file of circRNA are comparable with the default bed file format described in http://www.ensembl.org/info/website/upload/bed.html
meaning of each column are : first 3 column are required field.
-
chrom : name of chromosome
-
chromStart : start position of the circRNA
-
chromEnd : end position of the circRNA
-
name : should follow pattern : rna_name@gene_name_or_id, string after the first @ symbol will be taken as description,
-
score : number of junction reads detected by detection tools