Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TESS3 fails with large datasets #35

Open
renhamm opened this issue Feb 28, 2024 · 5 comments
Open

TESS3 fails with large datasets #35

renhamm opened this issue Feb 28, 2024 · 5 comments

Comments

@renhamm
Copy link

renhamm commented Feb 28, 2024

When running tess3r through the ALGATR package, I am hitting an error. I am able to run everything very successfully on a single chromosome, but it fails when I try to run identical code on input files from a multi-chromosome vcf. After speaking with developers, this seems to be larger error innate to the internally called tess3 command. I am hitting the following error:

Error in if (rmse[r] < rmse.max) { :
missing value where TRUE/FALSE needed

I checked, and there are no NAs in any of my input data. Do you know what is causing this?

@francoio
Copy link
Collaborator

francoio commented Mar 3, 2024

The format used by tess3r to encode genotypes (R class "matrix") does not distinguish between data for a single chromosome or multiple chromosomes. It's possible that the message may stem from an error in the data rather than an error in the program.

I would recommend using the tess3r package directly, after encoding the genotypes as matrix.

One way to obtain the matrix is to convert the VCF format to the PED format - which is close to "matrix" - using vcftools, read it into R (data.table::fread), and then remove the first columns of information.

Getting the matrix with vcfR is also possible but it may be more tricky. Something like this should work (perhaps slowly)

obj.vcf = vcfR::read.vcfR("my_genome.vcf")
library(adegenet)
x = vcfR2genlight(obj.vcf)
df = data.frame(lapply(x@gen, as.integer))
genotype_matrix = t(as.matrix(df))
row.names(genotype_matrix) = NULL

@renhamm
Copy link
Author

renhamm commented Mar 9, 2024

Upon completing the above code, I am still hitting the same rmse error. Could it be that my data is too large? I have ~3 million SNPs.

@francoio
Copy link
Collaborator

francoio commented Mar 9, 2024

There is not enough information to understand the origin of the error. However, for population structure analyses, it is recommended to reduce the SNP dataset using an LD pruning/clumping algorithm. The number of SNPs to retain depends on the extent of LD in the genomes. It can be assumed that a few hundred (250-500K) SNPs should be more than sufficient for detailed results.

@francoio
Copy link
Collaborator

francoio commented Mar 9, 2024 via email

@leonardslog
Copy link

Hello,

I have encountered the same error with two genotype matrices converted from datasets processed by ipyrad and figured I'd contribute here in case this additional info helps diagnose the problem. I am using the tess3r directly (installed per github instructions), and have generated genotype matrices for using two approaches:

# process sample coordinates
sample_coords <- read.csv("coords.csv")
coords <- as.matrix(sample_coords)
row.names(coords) <- NULL
colnames(coords) <- NULL

# genotype matrix approach 1:
vcf <- read.vcfR("data.vcf") # requires vcfR
genind <- vcfR2genind(x = vcf)  # requires adagenet
mat <- as.matrix(genind@tab)
row.names(mat) <- NULL
colnames(mat) <- NULL
# genotype matrix approach 2 with .geno file
ugeno <- LEA::read.geno("data.u.geno") # requires LEA
mat.ugeno <- as.matrix(ugeno)
mat.ugeno[mat.ugeno == 9] = NA

I've then attempted to run tess on both matrices (just including the .geno code):

tess3.obj <- tess3(X = mat.ugeno, coord = coords, K = 1:8, 
                   ploidy = 2,method = "projected.ls", 
                   openMP.core.num = 4) 

And get the following:

== Computing spectral decomposition of graph laplacian matrix: done
==Main loop with 4 threads: done
Error in if (rmse[r] < rmse.max) { : 
  missing value where TRUE/FALSE needed

I've double checked my matrix dimensions, the presence of NAs, and verified that all the matrix row lengths are the same regardless of differences in missing data. I've also removed samples that were mostly (>90%) missing data. The datasets are not particularly large, for example:

> genind
/// GENIND OBJECT /////////

 // 18 individuals; 96,610 loci; 194,368 alleles; size: 66.9 Mb

 // Basic content
   @tab:  18 x 194368 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-4)
   @loc.fac: locus factor for the 194368 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: adegenet::df2genind(X = t(x), sep = sep)

 // Optional content
   - empty -
> 

Rerunning the above with the tess3Main function that's nested within tess3 (with a single K value) works fine but I notice the rmse and crossentropy slots in the resulting object are both NaN, leading to the error.

tess3Main.obj <- tess3Main(X = mat, coord = coords, K = 2, 
                   ploidy = 2, method = "projected.ls", 
                   openMP.core.num = 4) 

I've replicated this result with two different datasets of similar size (~20 samples, ~100k variants) and am a little stumped. Perhaps the sample size is too small (I've sucessfully used tess3r in the past with ddradseq data of similar missingness, but with a much higher N of 195 and a lower snp count). I know both resulting matrices are not square matrices according to matrixcalc::is.positive.definite, but neither is the Arabidopsis dataset in tess3r's tutorial, so I'm doubting that's the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants