TESS3 fails with large datasets #35

renhamm · 2024-02-28T23:13:33Z

When running tess3r through the ALGATR package, I am hitting an error. I am able to run everything very successfully on a single chromosome, but it fails when I try to run identical code on input files from a multi-chromosome vcf. After speaking with developers, this seems to be larger error innate to the internally called tess3 command. I am hitting the following error:

Error in if (rmse[r] < rmse.max) { :
missing value where TRUE/FALSE needed

I checked, and there are no NAs in any of my input data. Do you know what is causing this?

francoio · 2024-03-03T10:47:35Z

The format used by tess3r to encode genotypes (R class "matrix") does not distinguish between data for a single chromosome or multiple chromosomes. It's possible that the message may stem from an error in the data rather than an error in the program.

I would recommend using the tess3r package directly, after encoding the genotypes as matrix.

One way to obtain the matrix is to convert the VCF format to the PED format - which is close to "matrix" - using vcftools, read it into R (data.table::fread), and then remove the first columns of information.

Getting the matrix with vcfR is also possible but it may be more tricky. Something like this should work (perhaps slowly)

obj.vcf = vcfR::read.vcfR("my_genome.vcf")
library(adegenet)
x = vcfR2genlight(obj.vcf)
df = data.frame(lapply(x@gen, as.integer))
genotype_matrix = t(as.matrix(df))
row.names(genotype_matrix) = NULL

renhamm · 2024-03-09T01:52:06Z

Upon completing the above code, I am still hitting the same rmse error. Could it be that my data is too large? I have ~3 million SNPs.

francoio · 2024-03-09T14:56:42Z

There is not enough information to understand the origin of the error. However, for population structure analyses, it is recommended to reduce the SNP dataset using an LD pruning/clumping algorithm. The number of SNPs to retain depends on the extent of LD in the genomes. It can be assumed that a few hundred (250-500K) SNPs should be more than sufficient for detailed results.

francoio · 2024-03-09T15:11:22Z

hum, Your email editor did not read the message correctly. See the code here [ #35 | #35 ] "@" is similar to "$" for lists, and it means "attribute" for R objects of class S4, which are used by " adegenet" De: "jdeboer-lizard-wizard" ***@***.***> À: "bcm-uga" ***@***.***> Cc: "Olivier Francois" ***@***.***>, "Comment" ***@***.***> Envoyé: Samedi 9 Mars 2024 01:25:17 Objet: Re: [bcm-uga/TESS3_encho_sen] TESS3 fails with large datasets (Issue #35) When I run this I am getting the following error: Error in ***@***.*** : no applicable method for `@` applied to an object of class "matrix" Calls: vcf_to_dosage -> <Anonymous> -> is.biallelic Execution halted I can't seem to find any information about what "@" does in R, so I'm not sure how to debug. It's also confusing that the error says the problem occurs during ***@***.***" when the code says ***@***.***" Anything helps! On Sun, Mar 3, 2024 at 2:47 AM Olivier Francois ***@***.***> wrote:

The format used by tess3r to encode genotypes (R class "matrix") does not distinguish between data for a single chromosome or multiple chromosomes. It's possible that the message may stem from an error in the data rather than an error in the program. I would recommend using the tess3r package directly, after encoding the genotypes as matrix. One way to obtain the matrix is to convert the VCF format to the PED format - which is close to "matrix" - using vcftools, read it into R (data.table::fread), and then remove the first columns of information. Getting the matrix with vcfR is also possible but it may be more tricky. Something like this should work (perhaps slowly) obj.vcf = vcfR::read.vcfR("my_genome.vcf") library(adegenet) x = vcfR2genlight(obj.vcf) df = ***@***.***, as.integer)) genotype_matrix = t(as.matrix(df)) row.names(genotype_matrix) = NULL — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO2GFKUERNX6JEZFINWFZZLYWL5VHAVCNFSM6AAAAABD65GEKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGEYTSMJYGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

— Reply to this email directly, [ #35 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AEWYAM6KX7DXWFQGG2ZIHDDYXJJG3AVCNFSM6AAAAABD65GEKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGU4TMMBZG4 | unsubscribe ] . You are receiving this because you commented. Message ID: ***@***.***>

leonardslog · 2024-08-18T16:37:02Z

Hello,

I have encountered the same error with two genotype matrices converted from datasets processed by ipyrad and figured I'd contribute here in case this additional info helps diagnose the problem. I am using the tess3r directly (installed per github instructions), and have generated genotype matrices for using two approaches:

# process sample coordinates
sample_coords <- read.csv("coords.csv")
coords <- as.matrix(sample_coords)
row.names(coords) <- NULL
colnames(coords) <- NULL


# genotype matrix approach 1:
vcf <- read.vcfR("data.vcf") # requires vcfR
genind <- vcfR2genind(x = vcf)  # requires adagenet
mat <- as.matrix(genind@tab)
row.names(mat) <- NULL
colnames(mat) <- NULL

# genotype matrix approach 2 with .geno file
ugeno <- LEA::read.geno("data.u.geno") # requires LEA
mat.ugeno <- as.matrix(ugeno)
mat.ugeno[mat.ugeno == 9] = NA

I've then attempted to run tess on both matrices (just including the .geno code):

tess3.obj <- tess3(X = mat.ugeno, coord = coords, K = 1:8, 
                   ploidy = 2,method = "projected.ls", 
                   openMP.core.num = 4)

And get the following:

== Computing spectral decomposition of graph laplacian matrix: done
==Main loop with 4 threads: done
Error in if (rmse[r] < rmse.max) { : 
  missing value where TRUE/FALSE needed

I've double checked my matrix dimensions, the presence of NAs, and verified that all the matrix row lengths are the same regardless of differences in missing data. I've also removed samples that were mostly (>90%) missing data. The datasets are not particularly large, for example:

> genind
/// GENIND OBJECT /////////

 // 18 individuals; 96,610 loci; 194,368 alleles; size: 66.9 Mb

 // Basic content
   @tab:  18 x 194368 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-4)
   @loc.fac: locus factor for the 194368 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: adegenet::df2genind(X = t(x), sep = sep)

 // Optional content
   - empty -
>

Rerunning the above with the tess3Main function that's nested within tess3 (with a single K value) works fine but I notice the rmse and crossentropy slots in the resulting object are both NaN, leading to the error.

tess3Main.obj <- tess3Main(X = mat, coord = coords, K = 2, 
                   ploidy = 2, method = "projected.ls", 
                   openMP.core.num = 4)

I've replicated this result with two different datasets of similar size (~20 samples, ~100k variants) and am a little stumped. Perhaps the sample size is too small (I've sucessfully used tess3r in the past with ddradseq data of similar missingness, but with a much higher N of 195 and a lower snp count). I know both resulting matrices are not square matrices according to matrixcalc::is.positive.definite, but neither is the Arabidopsis dataset in tess3r's tutorial, so I'm doubting that's the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TESS3 fails with large datasets #35

TESS3 fails with large datasets #35

renhamm commented Feb 28, 2024

francoio commented Mar 3, 2024

renhamm commented Mar 9, 2024

francoio commented Mar 9, 2024

francoio commented Mar 9, 2024 via email

leonardslog commented Aug 18, 2024

TESS3 fails with large datasets #35

TESS3 fails with large datasets #35

Comments

renhamm commented Feb 28, 2024

francoio commented Mar 3, 2024

renhamm commented Mar 9, 2024

francoio commented Mar 9, 2024

francoio commented Mar 9, 2024 via email

leonardslog commented Aug 18, 2024