Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating several samples, vanishing genes :) #30

Closed
Dalhte opened this issue Aug 11, 2023 · 4 comments
Closed

Integrating several samples, vanishing genes :) #30

Dalhte opened this issue Aug 11, 2023 · 4 comments

Comments

@Dalhte
Copy link

Dalhte commented Aug 11, 2023

Hi there.
First, thanks for this very nice tool :)
I have a small probleme (it's probably trivial but I'm very new to bioinformatic analysis): I have 4 differents 10X multiom ATAC+RNA samples : PG2, PG6 PG24 and PG13. I integrated those using seurat/signac first then I tried to run Multivelo
I treated the different samples separately for the preprocessing steps :
For example for PG2 :

adata_atacPG2 = sc.read_10x_mtx('/media/david/F/yard/apps/cellranger-arc-2.0.2/PG2/outs/filtered_feature_bc_matrix/', var_names='gene_symbols', cache=True, gex_only=False)
adata_atacPG2 = adata_atacPG2[:,adata_atacPG2.var['feature_types'] == "Peaks"]
adata_atacPG2 = mv.aggregate_peaks_10x(adata_atacPG2,'/media/david/F/yard/apps/cellranger-arc-2.0.2/PG2/outs/atac_peak_annotation.tsv', '/media/david/F/yard/apps/cellranger-arc-2.0.2/PG2/outs/analysis/feature_linkage/feature_linkage.bedpe',verbose=True)
mv.tfidf_norm(adata_atacPG2)

I renamed the cells with unique barcodes:

barcodes = adata_atacPG2.obs.index
barcodesnew = ['PG2_' + bc[0:len(bc)-2] for bc in barcodes]
adata_atacPG2.obs.index = barcodesnew

Having done that on the four samples, I generated a single object by concatenation :

adata_atacPG = adata_atacPG2.concatenate([adata_atacPG6, adata_atacPG24, adata_atacPG13])

Then I processed the RNA and so one. Everything seems to work very nicely BUT I lose some of the genes (and some important ones that is). After investigation, I realized that these genes were lost during the concatenation step, probably because these are specifically present in some of the adata_atacPGXX objects but not in every ones.

How may I tackle this problem ?

Best
David

@Dalhte
Copy link
Author

Dalhte commented Aug 11, 2023

Hello again
I solved part of my problem modifying the concat parameters to :

adata_atacPG = adata_atacPG2.concatenate([adata_atacPG6, adata_atacPG24, adata_atacPG13], join='outer')
And retrieved 3000 more genes. But, I'm quite greedy and I would love to have some more (There are some interesting ones still missing). I realized those last few genes are absent from the atac_peak_annotation.tsv files. It's quite surprising as signac LinkPeaks function manage to find linkeage between these gene and ATAC open regions. So my question: Do you know if it might be possible to extract the
-atac_peak_annotation.tsv
and prehaps also the feature_linkage.bedpe files
from Seurat/signac analyis ?
Best
David

@Dalhte
Copy link
Author

Dalhte commented Aug 11, 2023

me again, realizing my last question was discussed in previous #29 and #1 post !

@Dalhte Dalhte closed this as completed Aug 11, 2023
@Dalhte
Copy link
Author

Dalhte commented Aug 12, 2023

Hi there
I still have some problems I cannot solve this time :
In my RNA anndata object (adatam1), I have 13873 genes
In my ATAC anndata object ( adata_atacPGc), I also have 13873 genes, I checked that they are exactly the same
I just runned the mv.recover_dynamics_chrom on these two objects :

adata_result1 = mv.recover_dynamics_chrom(adatam1, adata_atacPGc,max_iter=5, init_mode="invert",verbose=False,parallel=True,save_plot=False,rna_only=False,fit=True,n_anchors=500,extra_color_key='celltype')
I got :
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 13873/13873 [19:44:34<00:00, 5.12s/it]
With the 13873 genes
But :
adata_result1
AnnData object with n_obs × n_vars = 28298 × 8630
I'm loosing more than 5000 genes, some of them with RNA velocity in the RNA objects, and some of them really relevant for what I'm analyzing.
How can I recover these genes ?
Thanks for your help
Best
David

@Dalhte Dalhte reopened this Aug 12, 2023
@danielee0707
Copy link
Collaborator

Hi David,
We recommend first using scVelo's preprocessing steps to prepare the input data. The "filter_and_normalize" function is able to select genes with minimum counts in the unspliced and spliced matrices. Multivelo also does some filtering steps internally to discard low-quality genes and make sure the velocity inference algorithm will run and converge successfully. These steps are crucial to ensure that low-quality genes won't obstruct downstream analyses such as lineage prediction. Unfortunately, there's no way to turn these steps off currently.

@Dalhte Dalhte closed this as completed Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants