Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing CNV allele calls are reported as CNV present #660

Open
EricRLucas opened this issue Nov 21, 2024 · 5 comments
Open

Missing CNV allele calls are reported as CNV present #660

EricRLucas opened this issue Nov 21, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@EricRLucas
Copy link

EricRLucas commented Nov 21, 2024

The function cnv_discordant_read_calls returns 0 or 1 for absence / presence of CNV alleles. But in samples where CNV alleles were not called (eg: taxa that are not gambcolu or arabiensis), it returns 1, giving the impression that the allele is present.

@leehart leehart added the bug Something isn't working label Nov 26, 2024
@leehart leehart self-assigned this Nov 26, 2024
@leehart
Copy link
Collaborator

leehart commented Dec 5, 2024

Thanks @EricRLucas .

@cclarkson @alimanfoo @ahernank I suspect this might be a data issue, rather than an API issue, e.g.

# Imports
import pandas as pd
import numpy as np
import zarr
# Settings
bucket = 'vo_agam_release_master_us_central1'
release = 'v3.1'
cohorts_analysis = 'cohorts_20240924'
sample_set = '1177-VO-ML-LEHMANN-VMF00004'
drc_analysis = 'discordant_read_calls_20230911'
contig = '2L'
# Get the taxa for the sample set.
cohorts_analysis_path = f'gs://{bucket}/{release}/metadata/{cohorts_analysis}/{sample_set}/samples.cohorts.csv'
cohorts_df = pd.read_csv(cohorts_analysis_path)
# See the taxa for the sample set.
cohorts_df['taxon'].value_counts()
taxon
coluzzii      573
gambiae        44
arabiensis     29
unassigned      1
Name: count, dtype: int64
# Get the unassigned sample ids.
unassigned_samples_df = cohorts_df[cohorts_df['taxon'] == 'unassigned']
unassigned_sample_ids = unassigned_samples_df['sample_id'].tolist()
unassigned_sample_ids
['VBS01513-4651STDY7017567']
# Get the DRCs for the sample set.
drc_analysis_path = f'gs://{bucket}/{release}/cnv/{sample_set}/{drc_analysis}/zarr'
drc_zarr = zarr.open(drc_analysis_path, mode="r")
# Get the genotype (GT) data for the contig, for the sample set.
drc_gt_data = drc_zarr[contig]['calldata/GT'][:]
drc_gt_data[:3]
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)
# Get the corresponding sample ids.
drc_sample_ids = drc_zarr[contig]['samples'][:]
drc_sample_ids[:3]
array(['VBS00256-4651STDY7017184', 'VBS00257-4651STDY7017185',
       'VBS00259-4651STDY7017186'], dtype=object)
# Get the indices of the DRC samples that match the unassigned sample ids.
drc_unassigned_sample_indices = np.where(np.isin(drc_sample_ids, unassigned_sample_ids))[0]
drc_unassigned_sample_indices[:3]
array([348])
# Get the genotype (GT) values for the unassigned samples.
drc_unassigned_gt_data = drc_gt_data[:, drc_unassigned_sample_indices]
drc_unassigned_gt_data[:3]
array([[1],
       [1],
       [1]], dtype=int8)

@leehart
Copy link
Collaborator

leehart commented Dec 5, 2024

In contrast, some other sample sets that contain samples that did not have CNV called appear to have the expected 0 GT values, e.g.

# Settings
bucket = 'vo_agam_release_master_us_central1'
release = 'v3.9'
cohorts_analysis = 'cohorts_20240924'
sample_set = '1270-VO-MULTI-PAMGEN-VMF00232'
drc_analysis = 'discordant_read_calls_20230911'
contig = '2L'
# See the taxa for the sample set.
cohorts_df['taxon'].value_counts()
taxon
arabiensis    166
coluzzii       32
gambiae        10
unassigned      3
bissau          1
Name: count, dtype: int64
# Get the non-gamb/colu/arabiensis sample ids.
non_cnv_samples_df = cohorts_df[
  (cohorts_df['taxon'] != 'gambiae') &  (cohorts_df['taxon'] != 'coluzzii') &  (cohorts_df['taxon'] != 'arabiensis')
]
non_cnv_sample_ids = non_cnv_samples_df['sample_id'].tolist()
non_cnv_sample_ids
['VBS68624-6411STDY13142645',
 'VBS68663-6411STDY13142684',
 'VBS68696-6411STDY13142719',
 'VBS68701-6411STDY13142724']
# Get the genotype (GT) values for the non-CNV samples.
drc_non_cnv_gt_data = drc_gt_data[:, drc_non_cnv_sample_indices]
drc_non_cnv_gt_data[:]
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

@leehart
Copy link
Collaborator

leehart commented Dec 5, 2024

Just to clarify, the cnv_discordant_read_calls function currently uses the _cnv_discordant_read_calls_dataset function to simply extract the DRC data from the Zarr on GCS, which is then returned in an Xarray Dataset, i.e.

        data_vars["call_genotype"] = (
            [DIM_VARIANT, DIM_SAMPLE],
            da_from_zarr(
                root[f"{contig}/calldata/GT"], inline_array=inline_array, chunks=chunks
            ),
        )
[...]
        ds = xr.Dataset(data_vars=data_vars, coords=coords, attrs=attrs)

BTW, the meaning of call_genotype is currently documented as:

call_genotype, it has (variants, samples) values and contains the coverage call for each sample and each CNV region,

I haven't yet found any source of potential change for those data, on their way from the Zarr to the Xarray, although I had suspected something might be going awry somewhere, e.g. within the core utils simple_xarray_concat or da_from_zarr.

@leehart
Copy link
Collaborator

leehart commented Dec 10, 2024

It looks like we should be looking at the AIM-based species calls, rather than the cohorts analysis based "taxon" calls, in order to determine which samples shouldn't have non-zero values. I'll check to see whether the CNV DRC GT data is properly aligned in that respect, and then I'll report back.

@leehart
Copy link
Collaborator

leehart commented Dec 13, 2024

It looks like we still see CNV DRC GT non-zeros for samples with non-gambiae/coluzzii/arabiensis species calls based on AIMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants