-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing CNV allele calls are reported as CNV present #660
Comments
Thanks @EricRLucas . @cclarkson @alimanfoo @ahernank I suspect this might be a data issue, rather than an API issue, e.g. # Imports
import pandas as pd
import numpy as np
import zarr # Settings
bucket = 'vo_agam_release_master_us_central1'
release = 'v3.1'
cohorts_analysis = 'cohorts_20240924'
sample_set = '1177-VO-ML-LEHMANN-VMF00004'
drc_analysis = 'discordant_read_calls_20230911'
contig = '2L' # Get the taxa for the sample set.
cohorts_analysis_path = f'gs://{bucket}/{release}/metadata/{cohorts_analysis}/{sample_set}/samples.cohorts.csv'
cohorts_df = pd.read_csv(cohorts_analysis_path) # See the taxa for the sample set.
cohorts_df['taxon'].value_counts()
# Get the unassigned sample ids.
unassigned_samples_df = cohorts_df[cohorts_df['taxon'] == 'unassigned']
unassigned_sample_ids = unassigned_samples_df['sample_id'].tolist()
unassigned_sample_ids
# Get the DRCs for the sample set.
drc_analysis_path = f'gs://{bucket}/{release}/cnv/{sample_set}/{drc_analysis}/zarr'
drc_zarr = zarr.open(drc_analysis_path, mode="r") # Get the genotype (GT) data for the contig, for the sample set.
drc_gt_data = drc_zarr[contig]['calldata/GT'][:]
drc_gt_data[:3]
# Get the corresponding sample ids.
drc_sample_ids = drc_zarr[contig]['samples'][:]
drc_sample_ids[:3]
# Get the indices of the DRC samples that match the unassigned sample ids.
drc_unassigned_sample_indices = np.where(np.isin(drc_sample_ids, unassigned_sample_ids))[0]
drc_unassigned_sample_indices[:3]
# Get the genotype (GT) values for the unassigned samples.
drc_unassigned_gt_data = drc_gt_data[:, drc_unassigned_sample_indices]
drc_unassigned_gt_data[:3]
|
In contrast, some other sample sets that contain samples that did not have CNV called appear to have the expected # Settings
bucket = 'vo_agam_release_master_us_central1'
release = 'v3.9'
cohorts_analysis = 'cohorts_20240924'
sample_set = '1270-VO-MULTI-PAMGEN-VMF00232'
drc_analysis = 'discordant_read_calls_20230911'
contig = '2L' # See the taxa for the sample set.
cohorts_df['taxon'].value_counts()
# Get the non-gamb/colu/arabiensis sample ids.
non_cnv_samples_df = cohorts_df[
(cohorts_df['taxon'] != 'gambiae') & (cohorts_df['taxon'] != 'coluzzii') & (cohorts_df['taxon'] != 'arabiensis')
]
non_cnv_sample_ids = non_cnv_samples_df['sample_id'].tolist()
non_cnv_sample_ids
# Get the genotype (GT) values for the non-CNV samples.
drc_non_cnv_gt_data = drc_gt_data[:, drc_non_cnv_sample_indices]
drc_non_cnv_gt_data[:]
|
Just to clarify, the data_vars["call_genotype"] = (
[DIM_VARIANT, DIM_SAMPLE],
da_from_zarr(
root[f"{contig}/calldata/GT"], inline_array=inline_array, chunks=chunks
),
)
[...]
ds = xr.Dataset(data_vars=data_vars, coords=coords, attrs=attrs) BTW, the meaning of
I haven't yet found any source of potential change for those data, on their way from the Zarr to the Xarray, although I had suspected something might be going awry somewhere, e.g. within the core utils |
It looks like we should be looking at the AIM-based species calls, rather than the cohorts analysis based "taxon" calls, in order to determine which samples shouldn't have non-zero values. I'll check to see whether the CNV DRC GT data is properly aligned in that respect, and then I'll report back. |
It looks like we still see CNV DRC GT non-zeros for samples with non-gambiae/coluzzii/arabiensis species calls based on AIMs. |
The function cnv_discordant_read_calls returns 0 or 1 for absence / presence of CNV alleles. But in samples where CNV alleles were not called (eg: taxa that are not gambcolu or arabiensis), it returns 1, giving the impression that the allele is present.
The text was updated successfully, but these errors were encountered: