Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imputations_accuracy() lists all markers as not in common and drops all from analysis #9

Open
pbpearman opened this issue Dec 20, 2018 · 2 comments
Labels

Comments

@pbpearman
Copy link

pbpearman commented Dec 20, 2018

Hi Thierry,
I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:

test2 <- genomic_converter(data = "UMBELLA_Erumb1_samples_gt_90pct_snps_gt_40pct_covered.recode.vcf",output="plink",filename = "test_2",strata="strata_eu_40.tsv", imputation.method = "rf")

The run finishes with the following results section:

############################### RESULTS ###############################
Data format of input: vcf.file
Biallelic data
Number of common markers: 156
Number of chromosome/contig/scaffold: 113
Number of individuals 551
Number of populations 56

I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:

> imp_acc <- imputations_accuracy("test_2.rad","test_2_imputed.rad")

Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 156 markers not in common between datasets
$Information dropped from the analysis
$Information dropped from the analysis$markers.dropped
[1] "Erumb1_s01095803__85__621" "Erumb1_s00047668__13__4936"
[3] "Erumb1_s00648045__68__686" "Erumb1_s01308175__88__428"
[5] "Erumb1_s02281967__111__51" "Erumb1_s02289709__112__65" .....

Misclassification Error: by populations
A tibble: 0 x 2
... with 2 variables: POP_ID , ME

Misclassification Error: by individuals
A tibble: 0 x 2
... with 2 variables: INDIVIDUALS , ME

Misclassification Error: by markers
A tibble: 0 x 2
... with 2 variables: MARKERS , ME

Misclassification Error: overall
[1] NaN

I think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know.
Best,
Peter

@pbpearman
Copy link
Author

Hi again Thierry,
I ran the same procedure on the stickleback data from the tutorial. I imputed the missing genotypes again in genomic converter, obtaining the saved imputed and unimputed datasets. I then used them both with imputation_accuracy().

stickle <- genomic_converter("example_vcf2dadi_ferchaud_2015.vcf",
filename = "stickle1", 
strata="strata.stickleback.tsv", 
imputation.method = "rf")

I got:

#######################################################################
###################### grur: imputations_accuracy #########################
#######################################################################
Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 31790 markers not in common between datasets
Information dropped from the analysis
Information dropped from the analysis$markers.dropped
[1] "groupI__28362__23825290" "groupI__28366__24653314" "groupI__28366__24653352"
[4] "groupI__28366__24653353" "groupI__28382__25273727" "groupI__28450__28141141"
[7] "groupI__28450__28141145" "groupI__34525__6537935" "groupI__34525__6537939"
[10] "groupI__34525__6537953" "groupI__34525__6537980" "groupI__5674__12268369"
[13] "groupI__5674__12268370" "groupI__6034__17836877" "groupII__1632__17623945" ...

The original number of markers was 31802. I checked the values in the resulting object, 'stickle', and all classification errors for individuals, populations etc, were all listed as zero. I also tried using the imputed and unimputed datasets that are in the object produced by genomic_converter(). Again, about 99% of the markers were dropped. I examined the marker names of the two datasets. They are identical, as far as I can see.

Is this right? Why are almost all the markers showing as not in common among the two datasets? This is not the behavior I expected.
Best, Peter

@thierrygosselin
Copy link
Owner

Hi Peter, definitely a bug. This function was not tested with lots of datasets and it's in the treatment of markers names before and after imputations the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants