Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

How to interpret SATAY data in order to have meaningful information from it? #27

Open
Wteunisse opened this issue Oct 16, 2020 · 10 comments

Comments

@Wteunisse
Copy link
Collaborator

Some additional comments on how to interpret the data were made In the meeting with Werner.

  • There is a sequencing bias in the number of reads. We probably cannot do anything about this, but the sequencing may add an extra layer of variance to the number of reads.
  • Also, during the sequencing, there is a chance of observing a transposon or not. I don't think I fully understand this problem yet, but Werner suggested that we should look into a 'negative binomial distribution'. This because we only know the observed number of transposons but this might not be equal to the actual number of transposons.
@leilaicruz leilaicruz changed the title Things to keep in mind in preprocessing and interpreting SATAY data How to interpret SATAY data in order to have meaningful information from it? Oct 16, 2020
@wdaalman
Copy link
Collaborator

wdaalman commented Oct 20, 2020

To help you further along, the actual cells with transposons are converted into reads such that the reads follow a binomial distribution. Since we have the inverse problem,
if you know the reads and if you would know the probability that a cell with transposon turns into a read, the actual number of cells with transposon (including the unobserved ones) would follow a negative binomial distribution.

However, we cannot easily use the negative binomial distribution to invert reads to actual transposons, since Wessel mentioned today we don't know that probability, I thought you could try something else, namely finding the best fitting binomial distribution. Unfortunately Matlab's mle wants to have the probability parameter fixed, so I wrote a small script in Matlab using the generalized method of moments instead to fit simulated read data. This works reasonably well (run Reads_transposon_conversion_simulation_v2.m in the zip file).
Reads transposon conversion v2.zip

Two caveats can be that: (Updated to v2 to resolve first caveat: 1) we do not know which regions have no reads because they are unlucky in read-out or because they are very unfit. This gives a bias.)
2) In constructing a read distribution across the DNA including Wessel's normalization, we have not corrected for fitness bias. So an idea could be to first do this only for non-coding regions, get the probability parameter estimate, and use that on the real genes and invert reads to transpsons there using the negative binomial distribution.

@Wteunisse
Copy link
Collaborator Author

Very interesting, I will look into it! One thought I had about the probability is that we might be able to estimate the total number of cells during the SATAY experiment. I think Benoît also mentions a number in his paper, from this we know how much transpositions have taken place. So maybe we can have a good estimation of the probability of actually reading transposition.

@wdaalman
Copy link
Collaborator

That sounds good, it would be reassuring to see if there is a reasonable match with the fitted estimate.
Should you find out the probability is rather low, this implies noise willl be high (intuitively if almost every transposon is a read there is almost no noise). In that case, to dinstinguish noise from fitness effects of the transposon, you could think of increasing the duration of the growth phase to accentuate fitness effects.

@Gregory94
Copy link
Collaborator

Gregory94 commented Oct 27, 2020

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

@leilaicruz
Copy link
Member

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

Did you could download the paper? I could not ...

@Gregory94
Copy link
Collaborator

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

Did you could download the paper? I could not ...

Dejesus2016_NORMALIZATION OF TRANSPOSON-MUTANT LIBRARY SEQUENCING DATASETS TO IMPROVE IDENTIFICATION OF CONDITIONALLY ESSENTIAL GENES.pdf

@leilaicruz
Copy link
Member

Interesting that those papers: "NORMALIZATION OF TRANSPOSON-MUTANT LIBRARY SEQUENCING DATASETS TO IMPROVE IDENTIFICATION OF CONDITIONALLY ESSENTIAL GENES" and "Statistical analysis of genetic interactions in Tn-Seq
data" are from the same author Michael A. DeJesus from Department of Computer Science, Texas A&M University

@leilaicruz
Copy link
Member

@Gregory94 you should watch and take a look at the repo from the same author (Michael A. DeJesus): https://github.com/mad-lab/tools
It seems very useful ....

@Gregory94
Copy link
Collaborator

@Gregory94 you should watch and take a look at the repo from the same author (Michael A. DeJesus): https://github.com/mad-lab/tools
It seems very useful ....

Yes, indeed. But I think for many tools they created, it is optimized for their experimental setup which is different from ours. We should think whether we want to use a similar experimental approach as they had or change the tools they have and alter them for our approach.

@leilaicruz
Copy link
Member

leilaicruz commented Nov 6, 2020

Yes they are optimized to the type of data they get and with the vision they have to analyze those datasets. However still can be useful in terms of how they implemented it and some parts of the statistical analyses could be just abstracted from their use to ours. It looks very organized at first look , and in general it is always of great benefit to have good examples of well organized and structure code from where we can learn, build and collaborate .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants