Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfinfo and GRN files #191

Open
momur opened this issue Mar 27, 2024 · 4 comments
Open

tfinfo and GRN files #191

momur opened this issue Mar 27, 2024 · 4 comments

Comments

@momur
Copy link

momur commented Mar 27, 2024

Hello dev team,

Thanks for the amazing tool. I would like to understand the TFinfo and Base GRN files a little bit better.

The TF info file looks like this: What are these factors_direct and factors_indirect columns exactly? Are these the motifs found in the distal elements from co-accessibility analysis?
Screenshot 2024-03-26 at 6 08 09 PM

and GRN looks like this. The gene_short_name is the annotation of the peak_id in the TF info file, right?
Screenshot 2024-03-26 at 6 08 19 PM

Thanks!

@KenjiKamimoto-ac
Copy link
Member

Hi @momur ,

Thank you for trying celloracle.
In short, the difference between factors_indirect and factors_direct is based on their information source.
The factors_direct is based on the experimentally confirmed motif, while the factors_indirect is picked up based on relatively indirect evidence or computational inference.
For more information, please look at the explanation in the motif database. http://cisbp.ccbr.utoronto.ca/faq.html
and https://gimmemotifs.readthedocs.io/en/master/index.html

The binding site is shown in the seqname or peak_id column. Some of the elements are distal, and some are proximal.

As you pointed out, the gene_short_name is an annotation of the peak_id. For example, peak "chr10_100009210_100010306" is a cis-regulatory element of the gene DNMBP.

@momur
Copy link
Author

momur commented Mar 29, 2024

Hi,

Thank you for your reply. It helps.

My motivation here is to know which motifs are found in the promoter and enhancer regions. We provide co-accessible peaks, and celloracle performs TF motif scanning in the co-accessible sites. TF info is created after the TF motif scanning step. I thought that direct factors are the motifs found in the co-accessible sites. Is there a way to retrieve the information that I am looking for from any celloracle outputs?

Thanks!

@KenjiKamimoto-ac
Copy link
Member

@momur

You can distinguish promoter peaks and other distal regulatory element peaks as follows.

In the peak data preprocessing step, peak annotation was already done. https://morris-lab.github.io/CellOracle.documentation/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.html#3.-Integrate-TSS-info-and-cicero-connections

Screen Shot 2024-04-01 at 10 10 39 PM

In this dataframe, integrated, the promoter peaks have a co-accessible score of 1. If the co-accessible score is less than 1, the peaks do not contain TSS. So, you can distinguish promoters from enhancers by looking at the score. I think this is the information you are looking for.

@momur
Copy link
Author

momur commented Apr 5, 2024

Hi @KenjiKamimoto-wustl122 ,

Thanks for the explanation. It helps but I would like to rephrase my question to make it simpler.

Based on this file, I subset tfinfo for the gene called EBF1 (seqname as show in the picture). I would like to know what is the relationship between the seqname (in this case EBF1 gene) and the motifs in the factors_direct column (e.g., ATF2, CREB1). Do these motifs found regulatory region of EBF1? How should I interpret it?
Screenshot 2024-04-05 at 11 52 23 AM

I hope that it makes it clear now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants