We propose a method for identifying a set of de novo representative genes, termed signature genes (SGs), which can be used to measure the relative abundance and as phylogenetic markers of each metagenomic species with high precision. An initial set of the 100 genes that correlate with the median gene abundance profile of the metagenomic species (MGS) is selected. However, even in samples with high sequencing depth and species abundances, some genes in the initial set may be undetected, leading to inconsistencies in the estimation of metagenomic species abundance. A variant of the coupon collector’s problem was utilized to evaluate the probability of identifying a certain number of genes in a sample, given their presence, and score the performance of a gene set. This allows us to reject the abundance measurements that are significantly deviating from the expected number of detected genes from the set. Within each sample the expected read counts per gene can be approximated by the discrete negative binomial (NB) distribution, as the reads are assumed to map in proportion to the gene length and show biological variability. A rank-based negative binomial model is used to assess the performance of different gene sets across a large set of samples, facilitating identification of an optimal signature gene set for the MGS
-
Notifications
You must be signed in to change notification settings - Fork 0
trinezac/SG_optimization
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
De Novo identification of species-specific genes for microbial profiling.
Resources
Stars
Watchers
Forks
Packages 0
No packages published