Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HUNePi models #539

Open
ArtPoon opened this issue Sep 24, 2024 · 55 comments
Open

Update HUNePi models #539

ArtPoon opened this issue Sep 24, 2024 · 55 comments

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Sep 24, 2024

  • We are using two models, one to predict whether infections are increasing or decreasing, and a second to estimate the number of infections given increasing infections.
  • I recently found that two terms in the regression model were not implemented correctly, so we need to upload new models to the server.
  • Fortunately this does not seem to change predictions substantially - however the models become slightly more accurate.
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Sep 24, 2024

I increased the number of models to include all combinations of four summary statistics but the structure of the RDS object seems to be the same:

# Current version
> mods <- readRDS("infections_increasing_model_comparisons.rds")
> length(mods)
[1] 8
> sapply(mods, function(x) class(x))
     H     U     Ne    pi    HUNe  HU    HUNePi HUPi 
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm"  "glm"
[2,] "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"   "lm" 
# Revised version
> length(increasing_mods)
[1] 15
> sapply(increasing_mods, function(x) class(x))
     H     U     Ne    pi    HU    HNe   HPi   UNe   UPi   NePi  HUNe  HUPi  HNePi
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm"
[2,] "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm" 
     UNePi HUNePi
[1,] "glm" "glm" 
[2,] "lm"  "lm"  

ArtPoon added a commit that referenced this issue Sep 24, 2024
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 15, 2024

@GopiGugan reports that $N_e$ estimates are NaN for all lineages when running LambdaSkyline on the trees. What's changed between then and now? Is it because the trees are too big? This doesn't seem very likely because at least some of the trees won't be very large..

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 15, 2024

  • need to get a modest number of trees from @GopiGugan to try to reproduce the issue
  • get version number of R and LambdaSkyline on Paphlagon and BEVi

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 21, 2024

Got trees and label CSV files from @GopiGugan, attempted to run but got a KeyError exception:

(venv) art@aziraphale:~/git/HUNePI$ python scripts/simulation/get_summary_stats.py data/BA.5.2.6.nwk data/labels.BA.5.2.6.csv 
Traceback (most recent call last):
  File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 194, in <module>
    label_dict = parse_labels(args.labels)
  File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 74, in parse_labels
    if row['index'] not in results:
KeyError: 'index'
(venv) art@aziraphale:~/git/HUNePI$ head data/labels.BA.5.2.6.csv 
0,['hCoV-19/Ireland/D-BH-070014231/2023|Europe / Ireland / Dublin|EPI_ISL_18377214|2023-10-06']
1,['hCoV-19/Denmark/DCGC-660914/2023|Europe / Denmark|EPI_ISL_18360507|2023-09-18']
2,['hCoV-19/Canada/SK-RRPL-641967/2023|North America / Canada / Saskatchewan|EPI_ISL_18235362|2023-08-21']
3,['hCoV-19/Ireland/D-CS0380/2023|Europe / Ireland / Dublin|EPI_ISL_18146885|2023-08-04']
4,['hCoV-19/USA/NY-Wadsworth-23046175-01/2023|North America / USA / New York / Albany County|EPI_ISL_18134984|2023-08-03']
5,['hCoV-19/Canada/SK-RRPL-639662/2023|North America / Canada / Saskatchewan|EPI_ISL_18111086|2023-07-28']
6,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48166/2023|Europe / Italy / Campania|EPI_ISL_18281259|2023-07-20']
7,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48202/2023|Europe / Italy / Campania|EPI_ISL_18281287|2023-07-17']
8,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48203/2023|Europe / Italy / Campania|EPI_ISL_18281288|2023-07-07']
9,['hCoV-19/Ireland/D-SJH-CS0365/2023|Europe / Ireland / Dublin|EPI_ISL_18037744|2023-07-04']

Looks like CSV files are not in the correct format, doing a quick patch now:

Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> handle = open("data/labels.BA.5.2.6.csv")
>>> outfile = open("data/fixed.BA.5.2.6.csv", 'w')
>>> outfile.write("name,index\n")
11
>>> for line in handle:
... 
KeyboardInterrupt
>>> import csv
>>> rows = csv.reader(handle)
>>> for row in rows:
...     idx, names = row
...     names = eval(names)
...     for nm in names:
...         _ = outfile.write(f"{nm},{idx}\n")
... 
>>> outfile.close()
>>> 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 21, 2024

Working through one of the test files, this part of skyline_est.R is extremely slow:

> for (tip_place_in_table in 1:nrow(add_tip_count)){
+   tip_name = add_tip_count[tip_place_in_table,1]
+   freq = add_tip_count[tip_place_in_table,2]
+   if(freq != 0){
+     for (counter in 1:freq){
+       tree <- phytools::bind.tip(tree, paste0(tip_name,"_", counter), edge.length = 0, 
+                            where=which(tree$tip.label == tip_name))
+     }
+   }
+ }

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 21, 2024

I wrote a Python script that is much faster than the above R code at adding tips to the tree based on the labels CSV file.

I tried running the resulting tree through est_skyline.R, but I obtained the following error:

> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- read.tree("test.nwk")
> phy

Phylogenetic tree with 7573 tips and 5912 internal nodes.

Tip labels:
  4273, 3941, 4341_0, 4341_1, 4949, 4343, ...

Rooted; includes branch lengths.

> alpha = betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced

> alpha
            p1    value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.86258 29798.75     26     NA    NA        0 TRUE   NA 154.697
> skyline <- (skyline.multi.phylo(phy, alpha$pi))
Error in lbeta(2 - alpha, alpha) : 
  non-numeric argument to mathematical function

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 24, 2024

Tried this with another tree of similar size and got a different error message:

> require(LambdaSkyline)
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk")
> phy

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 24, 2024

I wrote a Python script to randomly downsample the tree to 100 tips, and I still got warnings from LambdaSkyline:

> phy <- read.tree("~/git/HUNePI/data/EG.5.1.n100.nwk")
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 24, 2024

Even a random coalescent tree fails:

> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 24, 2024

Talked to @pmagbor. Using renv to restore on my machine the same package environment in which she had reproduced Erin's work.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

Now when I launch R from my ~/git/HUNePI directory, I get this:

renv 1.0.11 was loaded from project library, but this project is configured to use renv 1.0.7.
- Use `renv::record("[email protected]")` to record renv 1.0.11 in the lockfile.
- Use `renv::restore(packages = "renv")` to install renv 1.0.7 into the project library.Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.11]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

> renv::restore(packages='renv')
The following package(s) will be updated:

# CRAN -----------------------------------------------------------------------
- renv   [1.0.11 -> 1.0.7]

Do you want to proceed? [Y/n]: y

# Downloading packages -------------------------------------------------------
- Downloading renv from CRAN ...                	ERROR [error code 22]
- Downloading renv from CRAN ...                	ERROR [error code 22]
- Downloading renv from CRAN ...                	ERROR [error code 22]
- Downloading renv from CRAN ...                	ERROR [error code 22]
- Downloading renv from CRAN ...                	ERROR [error code 22]
- Downloading renv from CRAN ...                OK [file is up to date]
Successfully downloaded 1 package in 4.1 seconds.

# Installing packages --------------------------------------------------------
- Installing renv ...                           OK [built from source and cached in 8.2s]

The following loaded package(s) have been updated:
- renv
Restart your R session to use the new versions.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
...
The following package(s) are out of sync [lockfile != library]:

# CRAN -----------------------------------------------------------------------
- BiocManager   [1.30.23 != 1.30.25]
- boot          [1.3-28 != 1.3-28.1]
- class         [7.3-20 != 7.3-22]
- codetools     [0.2-18 != 0.2-19]
- foreign       [0.8-83 != 0.8-85]
- KernSmooth    [2.23-20 != 2.23-22]
- lattice       [0.20-45 != 0.21-9]
- MASS          [7.3-59 != 7.3-60]
- Matrix        [1.5-1 != 1.6-1.1]
- mgcv          [1.8-41 != 1.9-0]
- nlme          [3.1-160 != 3.1-163]
- nnet          [7.3-18 != 7.3-19]
- rpart         [4.1.19 != 4.1.21]
- spatial       [7.3-15 != 7.3-17]
- survival      [3.4-0 != 3.5-7]

The lockfile was generated with R 4.2.2, but you're using R 4.3.2.
 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

I'll do a local install of 4.2.2 binaries and see if I can restore the environment completely.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

I compiled R-4.2.2 from source into a local directory and ran it:

# Bootstrapping renv 1.0.7 ---------------------------------------------------
- Downloading renv ... OK
- Installing renv  ... OK

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
A large number of files (9799 in total) have been discovered.
It may take renv a long time to crawl these files for dependencies.
Consider using .renvignore to ignore irrelevant files.
See `?renv::dependencies` for more information.
Set `options(renv.config.dependencies.limit = Inf)` to disable this warning.

The following package(s) are in an inconsistent state:

 package              installed recorded used
 abind                n         y        ?   
 ade4                 n         y        ? 

...

The following package(s) are out of sync [lockfile != library]:

# CRAN -----------------------------------------------------------------------
- BiocManager   [1.30.23 != 1.30.25]
- MASS          [7.3-59 != 7.3-58.1]

See ?renv::status() for advice on resolving these issues.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

> renv::restore()
The following package(s) will be updated:

# Bioconductor ---------------------------------------------------------------
- BiocVersion            [* -> 3.16.0]

# CRAN -----------------------------------------------------------------------
- BiocManager            [1.30.25 -> 1.30.23]
- MASS                   [7.3-58.1 -> 7.3-59]
- abind                  [* -> 1.4-5]
- ade4                   [* -> 1.7-22]
- alphavantager          [* -> 0.1.3]

- zip                    [* -> 2.3.1]
- zoo                    [* -> 1.8-12]

# GitHub ---------------------------------------------------------------------
- LambdaSkyline          [* -> phoscheit/LambdaSkyline@HEAD]

Do you want to proceed? [Y/n]: Y
# Downloading packages -------------------------------------------------------
- Downloading MASS from CRAN ...                OK [file is up to date]
- Downloading BiocManager from CRAN ...         OK [file is up to date]
- Downloading BH from CRAN ...                  OK [13.4 Mb in 1.3s]
- Downloading BiocVersion from Bioconductor ... OK [file is up to date]
- Downloading DBI from CRAN ...                 OK [file is up to date]
- Downloading LambdaSkyline from GitHub ...     - GitHub authentication credentials are not available.
- Please set GITHUB_PAT, or ensure the 'gitcreds' package is installed.
- See https://usethis.r-lib.org/articles/git-credentials.html for more details.
OK [7.5 Kb in 0.21s]
- Downloading ape from CRAN ...                 OK [1.5 Mb in 0.25s]

...

...

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

The restore process terminated cleanly when I got back from break. However, when the renv process prompted me to restart the R session, I naively quit and reran R, thinking that the restored properties would be a persistent state. I was wrong:

The following loaded package(s) have been updated:
- BiocManager
Restart your R session to use the new versions.

> 
Save workspace image? [y/n/c]: n
(venv) art@aziraphale:~/git/HUNePI$ R

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> 

As a result, I had to rerun renv::restore(), which repeated the download and install of all packages associated with the lockfile.

- Installing class ...                          FAILED
Error: Error installing package 'class':
=================================

* installing *source* packageclass...
** packageclasssuccessfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0gcc -I"/opt/R/4.3.2/lib/R/include" -DNDEBUG   -I/usr/local/include    -fpic  -g -O2  -c class.c -o class.o
class.c:31:9: error: unknown type nameSint’; did you meanuint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |         ^~~~
      |         uint
class.c:31:21: error: unknown type nameSint’; did you meanuint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                     ^~~~
      |                     uint
class.c:31:33: error: unknown type nameSint’; did you meanuint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                                 ^~~~
      |                                 uint
class.c:31:57: error: unknown type nameSint’; did you meanuint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                                                         ^~~~
      |                                                         uint
class.c:32:23: error: unknown type nameSint’; did you meanuint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                       ^~~~
      |                       uint
class.c:32:34: error: unknown type nameSint’; did you meanuint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                                  ^~~~
      |                                  uint
class.c:32:47: error: unknown type nameSint’; did you meanuint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                                               ^~~~
      |                                               uint
class.c:95:8: error: unknown type nameSint’; did you meanuint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |        ^~~~
      |        uint
class.c:95:19: error: unknown type nameSint’; did you meanuint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                   ^~~~
      |                   uint
class.c:95:30: error: unknown type nameSint’; did you meanuint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                              ^~~~
      |                              uint
class.c:95:42: error: unknown type nameSint’; did you meanuint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                                          ^~~~
      |                                          uint
class.c:95:54: error: unknown type nameSint’; did you meanuint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                                                      ^~~~
      |                                                      uint
class.c:96:23: error: unknown type nameSint’; did you meanuint’?
   96 |        double *train, Sint *class, double *test, Sint *res, double *pr,
      |                       ^~~~
      |                       uint
class.c:96:50: error: unknown type nameSint’; did you meanuint’?
   96 |        double *train, Sint *class, double *test, Sint *res, double *pr,
      |                                                  ^~~~
      |                                                  uint
class.c:97:8: error: unknown type nameSint’; did you meanuint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |        ^~~~
      |        uint
class.c:97:21: error: unknown type nameSint’; did you meanuint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                     ^~~~
      |                     uint
class.c:97:31: error: unknown type nameSint’; did you meanuint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                               ^~~~
      |                               uint
class.c:97:41: error: unknown type nameSint’; did you meanuint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                                         ^~~~
      |                                         uint
class.c:210:24: error: unknown type nameSint’; did you meanuint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                        ^~~~
      |                        uint
class.c:210:34: error: unknown type nameSint’; did you meanuint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                  ^~~~
      |                                  uint
class.c:210:54: error: unknown type nameSint’; did you meanuint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                                      ^~~~
      |                                                      uint
class.c:211:9: error: unknown type nameSint’; did you meanuint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |         ^~~~
      |         uint
class.c:211:36: error: unknown type nameSint’; did you meanuint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                    ^~~~
      |                                    uint
class.c:211:47: error: unknown type nameSint’; did you meanuint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                               ^~~~
      |                                               uint
class.c:212:9: error: unknown type nameSint’; did you meanuint’?
  212 |         Sint *iters)
      |         ^~~~
      |         uint
class.c:245:24: error: unknown type nameSint’; did you meanuint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                        ^~~~
      |                        uint
class.c:245:34: error: unknown type nameSint’; did you meanuint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                  ^~~~
      |                                  uint
class.c:245:54: error: unknown type nameSint’; did you meanuint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                                      ^~~~
      |                                                      uint
class.c:246:9: error: unknown type nameSint’; did you meanuint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |         ^~~~
      |         uint
class.c:246:36: error: unknown type nameSint’; did you meanuint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                    ^~~~
      |                                    uint
class.c:246:47: error: unknown type nameSint’; did you meanuint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                               ^~~~
      |                                               uint
class.c:247:9: error: unknown type nameSint’; did you meanuint’?
  247 |         Sint *iters)
      |         ^~~~
      |         uint
class.c:276:37: error: unknown type nameSint’; did you meanuint’?
  276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
      |                                     ^~~~
      |                                     uint
class.c:276:47: error: unknown type nameSint’; did you meanuint’?
  276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
      |                                               ^~~~
      |                                               uint
class.c:277:9: error: unknown type nameSint’; did you meanuint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |         ^~~~
      |         uint
class.c:277:19: error: unknown type nameSint’; did you meanuint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                   ^~~~
      |                   uint
class.c:277:46: error: unknown type nameSint’; did you meanuint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                                              ^~~~
      |                                              uint
class.c:278:9: error: unknown type nameSint’; did you meanuint’?
  278 |         Sint *niter, Sint *iters)
      |         ^~~~
      |         uint
class.c:278:22: error: unknown type nameSint’; did you meanuint’?
  278 |         Sint *niter, Sint *iters)
      |                      ^~~~
      |                      uint
class.c:327:54: error: unknown type nameSint’; did you meanuint’?
  327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
      |                                                      ^~~~
      |                                                      uint
class.c:327:64: error: unknown type nameSint’; did you meanuint’?
  327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
      |                                                                ^~~~
      |                                                                uint
class.c:328:20: error: unknown type nameSint’; did you meanuint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                    ^~~~
      |                    uint
class.c:328:30: error: unknown type nameSint’; did you meanuint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                              ^~~~
      |                              uint
class.c:328:57: error: unknown type nameSint’; did you meanuint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                                                         ^~~~
      |                                                         uint
class.c:329:9: error: unknown type nameSint’; did you meanuint’?
  329 |         Sint *niter, Sint *iters)
      |         ^~~~
      |         uint
class.c:329:22: error: unknown type nameSint’; did you meanuint’?
  329 |         Sint *niter, Sint *iters)
      |                      ^~~~
      |                      uint
class.c:386:14: error: unknown type nameSint’; did you meanuint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |              ^~~~
      |              uint
class.c:386:24: error: unknown type nameSint’; did you meanuint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                        ^~~~
      |                        uint
class.c:386:34: error: unknown type nameSint’; did you meanuint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                                  ^~~~
      |                                  uint
class.c:386:49: error: unknown type nameSint’; did you meanuint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                                                 ^~~~
      |                                                 uint
class.c:431:27: error:VR_knnundeclared here (not in a function)
  431 |     {"VR_knn", (DL_FUNC) &VR_knn, 14},
      |                           ^~~~~~
class.c:432:28: error:VR_knn1undeclared here (not in a function)
  432 |     {"VR_knn1", (DL_FUNC) &VR_knn1, 10},
      |                            ^~~~~~~
class.c:433:28: error:VR_lvq1undeclared here (not in a function)
  433 |     {"VR_lvq1", (DL_FUNC) &VR_lvq1, 10},
      |                            ^~~~~~~
class.c:434:28: error:VR_lvq2undeclared here (not in a function)
  434 |     {"VR_lvq2", (DL_FUNC) &VR_lvq2, 11},
      |                            ^~~~~~~
class.c:435:28: error:VR_lvq3undeclared here (not in a function)
  435 |     {"VR_lvq3", (DL_FUNC) &VR_lvq3, 12},
      |                            ^~~~~~~
class.c:436:28: error:VR_olvqundeclared here (not in a function)
  436 |     {"VR_olvq", (DL_FUNC) &VR_olvq, 10},
      |                            ^~~~~~~
class.c:437:33: error:VR_onlineSOMundeclared here (not in a function)
  437 |     {"VR_onlineSOM", (DL_FUNC) &VR_onlineSOM, 9},
      |                                 ^~~~~~~~~~~~
make: *** [/opt/R/4.3.2/lib/R/etc/Makeconf:191: class.o] Error 1
ERROR: compilation failed for packageclass* removing/home/art/git/HUNePI/renv/staging/1/classinstall of package 'class' failed [error code 1]
Traceback (most recent calls last):
13: renv::restore()
12: records <- renv_restore_run_actions(project, diff, current, lockfile, rebuild) at restore.R#158
11: renv_install_impl(records) at restore.R#186
10: if (staged)
      renv_install_staged(records)
    else
      renv_install_default(records) at install.R#275
 9: renv_install_default(records) at install.R#295
 8: handler(package, renv_install_package(record)) at install.R#398
 7: handler(package, renv_install_package(record)) at install.R#398
 6: withCallingHandlers(
      renv_install_package_impl(record),
      error = function(e) writef("FAILED")
    ) at install.R#431
 5: withCallingHandlers(
      renv_install_package_impl(record),
      error = function(e) writef("FAILED")
    ) at install.R#431
 4: if (copyable)
      renv_file_copy(path, installpath, overwrite = TRUE)
    else
      r_cmd_install(package, path) at install.R#641
 3: if (!identical(status, 0L))
      r_exec_error(package, output, "install", status) at r.R#234
 2: abort(all) at r.R#52
 1: stop(fallback) at abort.R#44

I went back through the console messages associated with my previous attempt to run restore and determined that it did not attempt to install class that time, so this is a new problem.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

I'm dumb and forgot to run my local install of R version 4.2.2 instead of my default R (version 4.3.2). Launching the former displayed the following:

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- The project is out-of-sync -- use `renv::status()` for details.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

Still not working:

> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced

> alpha
          p1    value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 33.81239     13     NA    NA        0 FALSE   NA 0.119
> sky <- skyline.multi.phylo(phy, alpha$pi)
Error in lbeta(2 - alpha, alpha) : 
  non-numeric argument to mathematical function

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

Ooof. I'm dumb. I've been calling alpha$pi instead of alpha$p1.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

Ok this works:

> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
          p1   value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 37.0464     13     NA    NA        0 FALSE   NA 0.109
> str(sky)
List of 6
 $ time           : num [1:102] 0.00 2.22e-16 4.44e-16 6.70e-04 8.08e-04 ...
 $ interval.length: num [1:102] 0.00 2.22e-16 2.22e-16 6.70e-04 1.38e-04 ...
 $ population.size: num [1:102] 3.304 3.304 3.304 3.304 0.668 ...
 $ parameter.count: int 102
 $ epsilon        : num 0
 $ logL           : num -37
 - attr(*, "class")= chr "skyline"

But this is still running into problems:

> tree <- read.tree("data/EG.5.1.expand.nwk")
> tree

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
> alpha
Error: object 'alpha' not found

Looks like it has something to do with tree size:

> tree <- read.tree("data/EG.5.1.n100.nwk")
> tree

Phylogenetic tree with 100 tips and 99 internal nodes.

Tip labels:
  2280, 2688, 1668_2, 2383_6, 778, 2749, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
          p1    value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 365.5645     13     NA    NA        0 FALSE   NA  0.19
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:197] 0 0.0662 2.7766 3.0424 5.1165 ...
 $ interval.length: num [1:197] 0 0.0662 2.7104 0.2658 2.0741 ...
 $ population.size: num [1:197] 21.9 21.9 21.9 21.9 21.9 ...
 $ parameter.count: int 197
 $ epsilon        : num 0
 $ logL           : num -366
 - attr(*, "class")= chr "skyline"

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 25, 2024

@GopiGugan can you please try to reproduce the NaN problem using my new branch iss539?

@GopiGugan
Copy link
Contributor

GopiGugan commented Oct 29, 2024

Getting an error when trying to find Ne:

> str(tree)
List of 6
 $ edge       : int [1:6909, 1:2] 4915 4916 4917 4917 4915 4918 4919 4919 4919 4915 ...
 $ edge.length: num [1:6909] 2.847 0.891 0 0 1.16 ...
 $ Nnode      : int 1996
 $ node.label : chr [1:1996] "3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|123"| __truncated__ "14330.96" "" "29480.64" ...
 $ tip.label  : chr [1:4914] "709_0" "709_1" "634_0" "634_1" ...
 $ root.edge  : num 0
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions

ctree = clustering.consensus(
trees, cutoff=args.boot_cutoff, callback=callback)
outfile.close() # done with Phylo.parse generator
# calculate HUNePI statistics
# estimate Ne before collapsing polytomies
clabel_dict = manage_collapsed_nodes(label_dict, ctree)
cne = find_ne(ctree, clabel_dict)

temp_tree.nwk.zip

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

Confirmed that I can reproduce this error in my R environment (customized for HUNePI):

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree

Phylogenetic tree with 4914 tips and 1996 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

@GopiGugan can you please send me a bunch of trees, for which some run and others fail?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

I am stepping through the betacoal.maxlik code to diagnose the problem. This function is quite simple:

function(phylo,epsilon=0.0) {
  Inter <- coalescent.intervals.multi(phylo)
  optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter,x,epsilon)$logL,
                 lower=0.001,upper=1.999, method="bobyqa")
}

so I'm stepping through coalescent.intervals.multi with the tree that @GopiGugan sent:

> x <- read.tree("~/Downloads/temp_tree.nwk")
> class(x)
[1] "phylo"
> t <- sort(branching.times.samp(x))
> summary(t)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   38.81   41.06   40.37   42.78   45.76
> t <- signif(t,digits=10)
> Dist <- unique(t)
> lt <- length(Dist) 
> lt
[1] 5649
>   # interval widths
  w <- numeric(lt)
  w[1] <- Dist[1]
  for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1]  # why not use diff()?
> summary(w)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000510 0.001280 0.008101 0.002850 5.075050 
>   # Logs if a coalescence happens or not
  Coal <- logical(lt)
> table(Coal)
Coal
FALSE 
 5649

This last result is weird because it indicates that there are no coalescent events in the tree.. Maybe we are just initializing a vector of the required length before populating it?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

Continuing on:

> # Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
  NumInvolved <- integer(lt)
> str(NumInvolved)
 int [1:5649] 0 0 0 0 0 0 0 0 0 0 ...
l<-numeric(lt)
  l[1] <- 0 # Number of lineages at present time

I know that the following loop crashes at i=8:

for(i in 2:lt){
    Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]
    
    CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
    Involved <- 0 # Will record the total number of lineages coalescing at that time
    
    if(length(CoalNodes) >0){
      # if(length(CoalNodes)>1) print("Simultaneous Coalescences detected! Results meaningless!")
      Coal[i-1] <- TRUE
      for(j in 1:length(CoalNodes)){
        Edges <- x$edge[x$edge[,1]==CoalNodes[j],] # Finds how many lineages coalesce at node CoalNodes[j]
        Involved <- Involved + length(Edges[,1]) # Updates the count of coalescing lineages
        NumInvolved[i-1] <- length(Edges[,1])
      }
    }
    l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
  }

Stepping through i=2 to i=8:

> i <- 2
> Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]
    
    CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
    Involved <- 0 # Will record the total number of lineages coalescing at that time
> length(CoalNodes)
[1] 0
> l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
> l[2]
[1] 1
> i <- 3
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
SampNodes <- Nodes[Nodes<=length(x$tip.label)] 
Involved <- 0
> length(CoalNodes)
[1] 0

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

> i <- 8
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
SampNodes <- Nodes[Nodes<=length(x$tip.label)] 
Involved <- 0

length(CoalNodes)
[1] 1
> Coal[i-1] <- TRUE
> Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
> Involved <- Involved + length(Edges[,1])
Error in Edges[, 1] : incorrect number of dimensions
> Edges
[1] 5406 1159

R is expecting Edges to be a matrix. Looks like the problem is that there is only one node that descends from the ancestral node, so it is not really a coalescent event.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

I modified the function to handle cases where only a single branch descends from an internal node:

coalescent.intervals.multi <- function(x) 
{
  if (class(x) != "phylo") stop("object \"x\" is not of class \"phylo\"")
  
  # ordered branching times
  t <- sort(branching.times.samp(x))
  
  t <- signif(t,digits=10) # Required to eliminate numerical instabilities
  Dist <- unique(t) # We group all identical times together
  lt <- length(Dist) # Total number of intervals
  
  # interval widths
  w <- numeric(lt)
  w[1] <- Dist[1]
  for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1]
  
  # Logs if a coalescence happens or not
  Coal <- logical(lt)
  
  # Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
  NumInvolved <- integer(lt)
  
  l<-numeric(lt)
  l[1] <- 0 # Number of lineages at present time
  for(i in 2:lt) {
    # nodes (leaves or internal) with timestamp Dist[i-1]
    Nodes <- as.integer(names(t[t==Dist[i-1]]))
    
    # all internal nodes, corresponding to a coalescence event
    CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
    # all leaves, corresponding to a sampling event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)]
    Involved <- 0 # total number of lineages coalescing at that time
    
    if(length(CoalNodes) >0){
      # if(length(CoalNodes)>1) 
      #  print("Simultaneous Coalescences detected! Results meaningless!")
      Coal[i-1] <- TRUE
      for(j in 1:length(CoalNodes)) {
        # Finds how many lineages coalesce at node CoalNodes[j]
        Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
        if (is.null(dim(Edges))) {
          next
        }

        # Updates the count of coalescing lineages
        Involved <- Involved + length(Edges[,1])
        NumInvolved[i-1] <- length(Edges[,1])  # this seems irrelevant
      }
    }
    if (Involved==0) {
      Coal[i-1] <- FALSE
    }
    l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
  }
  NumInvolved[lt] <- l[lt]
  Coal[lt] <- TRUE
  obj <- list(
    lineages=l,
    interval.length=w,
    interval.count=lt,
    total.depth =sum(w),
    coalescences = Coal,
    NumInvolved=NumInvolved)
  class(obj) <- "coalescentIntervals"
  return(obj)
}

This seems to be running without error, although it is taking a long time:

> Inter <- coalescent.intervals.multi(tree)
> res <- optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter, x, 0.0)$logL, lower=0.001, upper=1.999, method="bobyqa")

I think a faster method would be to simply remove such nodes from the input tree:

> phy <- read.tree("~/Downloads/temp_tree.nwk")
> phy

Phylogenetic tree with 4914 tips and 1996 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...

Rooted; includes branch lengths.
> has.singles(phy)
[1] TRUE
> phy2 <- collapse.singles(phy)
> phy2

Phylogenetic tree with 4914 tips and 1479 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, , , , 0.78, 36080.88, ...

Rooted; includes branch lengths.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

The optimization finished but there was an issue at a later step:

> res
             p1    value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.596305 19503.13     18     NA    NA        0 TRUE TRUE 226.902
> sky <- skyline.multi.phylo(tree, res$pi)
Error in Edges[, 1] : incorrect number of dimensions

At this point I think the easier route is to remove single nodes.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

Okay that works:

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
             p1   value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.596361 19038.5     18     NA    NA        0 TRUE TRUE 181.989
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
 $ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
 $ population.size: num [1:5134] 49.1 49.1 49.1 49.1 49.1 ...
 $ parameter.count: int 5134
 $ epsilon        : num 0
 $ logL           : num -19039
 - attr(*, "class")= chr "skyline"

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

@GopiGugan I pushed this fix to branch iss539, can you please pull and re-run?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 29, 2024

Skyline for test tree:
image

@GopiGugan
Copy link
Contributor

@ArtPoon are you able to share the versions of ape and LambdaSkyline you're using?

The error message is gone now, but we're getting NA values for the population size:

> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree('temp_tree.nwk')
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
       p1         value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa NA 8.988466e+307     NA     NA    NA     9999   NA   NA 0.001
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
 $ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
 $ population.size: num [1:5134] NA NA NA NA NA NA NA NA NA NA ...
 $ parameter.count: int 5134
 $ epsilon        : num 0
 $ logL           : num NA
 - attr(*, "class")= chr "skyline"

session info:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 37 (Server Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.3

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1

loaded via a namespace (and not attached):
[1] compiler_4.2.2      parallel_4.2.2      Rcpp_1.0.10
[4] nlme_3.1-160        grid_4.2.2          digest_0.6.31
[7] optimx_2022-4.30    numDeriv_2016.8-1.1 lattice_0.20-45

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Oct 30, 2024

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /home/art/R-4.2.2/lib/R/lib/libRblas.so
LAPACK: /home/art/R-4.2.2/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1        

loaded via a namespace (and not attached):
 [1] minqa_1.2.5         compiler_4.2.2      BiocManager_1.30.23
 [4] parallel_4.2.2      tools_4.2.2         Rcpp_1.0.10        
 [7] nlme_3.1-160        grid_4.2.2          digest_0.6.35      
[10] nloptr_2.0.3        optimx_2023-10.21   numDeriv_2016.8-1.1
[13] renv_1.0.7          pracma_2.4.4        lattice_0.20-45 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 5, 2024

On BEVi:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.2

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 5, 2024

  • @GopiGugan is running the lineages on BEVi right now - it's been a bit over a day so far
  • this implies that the HUNePi estimates that we've produced to date have skipped Ne computation
  • @GopiGugan noted that this step is trivial to parallelize
  • we might also want to think about randomly downsampling the trees

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 5, 2024

We are using MPI within the same function in batch_utils.py to generate clusters (trees):

for lineage, features in by_lineage.items():
if lineage in minor or (
updated_lineages is not None and lineage not in updated_lineages):
continue
if callback:
callback(f'start {lineage}, {len(features)} entries')
cmd = [
"mpirun", "--machinefile", args.machine_file, "python3", "covizu/clustering.py",
recode_file, lineage, # positional arguments <JSON file>, <str>
"--mode", "deep",
"--max-variants", str(args.max_variants),
"--nboot", str(args.nboot),
"--outdir", args.outdir,
"--binpath", args.binpath
]
if initial_time:
cmd.extend(["--timestamp", str(initial_time)])
subprocess.check_call(cmd)

We could use the same approach of using subprocess to call mpirun on an R script that uses the parallel library, e.g., mclapply.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 5, 2024

This script will randomly downsample a tree - we could incorporate this into the part of the batch_utils.py function that expands tips with duplicate sequences.

import argparse
import random
from Bio import Phylo
import sys


description = """
Remove tips at random from a tree until a target size is reached.
Writes resulting Newick string to stdout.
"""

parser = argparse.ArgumentParser(description=description)
parser.add_argument("nwkfile", type=argparse.FileType('r'),
                    help="Path to file containing Newick tree.")
parser.add_argument("target", type=int, help="Target number of tips.")
args = parser.parse_args()

phy = Phylo.read(args.nwkfile, 'newick')
tips = phy.get_terminals()  # returns a list of Clade objects
if args.target >= len(tips):
    sys.stderr.write(f"Error: requirement already met ({len(tips)}<={target})\n")
    sys.exit()

# random permutation
random.shuffle(tips)
while len(tips) > args.target:
    tip = tips[0]
    _ = phy.prune(tip)
    tips = tips[1:]

Phylo.write(phy, sys.stdout, 'newick')

The make_beadplots function is getting too big, we should try to split it into two or more functions if possible...

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 5, 2024

Here is a simple example of mclapply in use:
https://github.com/PoonLab/tragula/blob/9a59fd123a13fad2aa3a7f2a5a8e0fc6aeec2626/scripts/topicspace.R#L130-L141

  # calculate the pairwise Wasserstein distance matrix
  n <- length(wpps)
  if (require(parallel, quietly=TRUE)) {
    res <- mclapply(0:(n*n-1),  # some iterable to distribute among threads as `k`
      function(k) {  # anonymous function
      i <- k %/% n + 1  # do stuff here
      j <- k %% n + 1
      if (i < j) { 
        transport::wasserstein(wpps[[i]], wpps[[j]], p=p, prob=TRUE)   # return value
      } else {
        0  # other return value
      }
    }, mc.cores = mc.cores)  # the number of threads
    # results are assigned to `res` as a list
    wdist <- matrix(unlist(res), nrow=n, ncol=n, byrow=T)

@GopiGugan
Copy link
Contributor

GopiGugan commented Nov 12, 2024

I wrote a python script with rpy2 to use mclapply :

import rpy2.robjects as robjects
from rpy2.robjects import r, ListVector
from rpy2.robjects.packages import importr
import rpy2.robjects.vectors as rvectors
from rpy2.rinterface_lib.sexp import NALogicalType
import glob
import os
import re
import csv

nwk_files = glob.glob('ctree/*.nwk')
nwk_files_abs = [os.path.abspath(file) for file in nwk_files]
lineages = [os.path.basename(file).split('.nwk')[0] for file in nwk_files]

parallel = importr("parallel")

robjects.r('set.seed(123456)')

r_lineages = rvectors.StrVector(lineages)
r_data = rvectors.StrVector(nwk_files_abs)

r('''
    find_ne <- function(file) {
        library(ape)
        library(LambdaSkyline)

        tryCatch({
            tree <- read.tree(file)
            tree <- collapse.singles(tree)
            alpha <- betacoal.maxlik(tree)
            sky <- skyline.multi.phylo(tree, alpha$p1)
            pop_sizes <- head(sky$population.size, n = 5)
            return(mean(pop_sizes, na.rm = TRUE))
        }, error = function(e) {
            return(NA)
        })
    }
''')

r_results = parallel.mclapply(r_data, r('find_ne'), mc_cores=20)
r_named_results = robjects.r['setNames'](r_results, r_lineages)

results_dict = dict(zip(lineages, r_named_results))
res = {k: 'NA' if isinstance(v[0], NALogicalType) else float(v[0]) for k, v in results_dict.items()}

with open("hunepi_ne.csv", "w", newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['Lineage', 'Ne'])
    for lineage, ne in res.items():
        writer.writerow([lineage, ne])

After downsampling to 500 tips max, we still get NA Ne values for 695 lineages

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 12, 2024

@GopiGugan reports that downsampling and running the resulting trees on 20 cores took about ten minutes

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 12, 2024

@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (NA for $N_e$)

@GopiGugan
Copy link
Contributor

@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (NA for N e )

I tried a random subset of trees. They all fail at the following step:

> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced

This seems to be the same error you were getting before:

Tried this with another tree of similar size and got a different error message:

> require(LambdaSkyline)
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk")
> phy

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 13, 2024

The $N_e$ estimates by lineage seem to be reasonably correlated between the original and pruned trees:

setwd("~/git/covizu/iss539/")
full <- read.csv("full-stats.csv")
pruned <- read.csv("pruned-Ne.csv")
idx <- match(full$lineage, pruned$Lineage)
full$pruned.ne <- pruned$Ne[idx]

par(mar=c(5,5,1,1))
plot(full$Ne, full$pruned.ne, log='xy', xlab="Original trees",
     ylab="Pruned trees (500 tips)", bty='n', type='n')
abline(a=0, b=1, lty=2, col='red')
points(full$Ne, full$pruned.ne, pch=19, cex=0.3, 
       col=rgb(0,0,0,0.2))
> cor.test(log(full$Ne), log(full$pruned.ne), method='spearman')

	Spearman's rank correlation rho

data:  log(full$Ne) and log(full$pruned.ne)
S = 907043858, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.810618 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 13, 2024

The same trees tend to fail:

> tab <- table(is.na(full$Ne), is.na(full$pruned.ne))
> tab
       
        FALSE TRUE
  FALSE  3063  321
  TRUE     30  392
> fisher.test(tab)

	Fisher's Exact Test for Count Data

data:  tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  83.83524 189.96729
sample estimates:
odds ratio 
  124.5724 

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 13, 2024

It looks like part of the problem is that you are not expanding the trees into binary (dichotomous) versions:

> phy <- read.tree("~/git/covizu/iss539/ctree/AY.17.nwk")
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
> phy2 <- multi2di(phy)
> phy2

Phylogenetic tree with 62 tips and 61 internal nodes.

Tip labels:
  28, 29, 20, 4, 32_0, 32_1, ...
Node labels:
  1.00, , , 0.85, 0.53, 0.82, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy2)
> alpha
             p1    value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.797876 62.88628     16     NA    NA        0 TRUE TRUE 0.022
> sky <- skyline.multi.phylo(phy2, alpha$p1)
> sky
$time
 [1] 0.00000 2.50874 2.62527 3.07890 3.32507 3.40158 3.41762 3.61494 3.84003
[10] 3.84527 3.86048 3.86503 3.99503 4.74003 4.85762 4.87048 4.95503 5.25344
[19] 5.49606 5.57126 5.64003 5.96745 6.24890 6.39282 6.41757 6.71126 6.76671
[28] 6.89565 7.21962 7.29040 7.65332 8.01444 8.03655 8.06745 8.15524 8.25016
[37] 8.70444 8.75444 8.75888 8.80009 9.88444

$interval.length
 [1] 0.00000 2.50874 0.11653 0.45363 0.24617 0.07651 0.01604 0.19732 0.22509
[10] 0.00524 0.01521 0.00455 0.13000 0.74500 0.11759 0.01286 0.08455 0.29841
[19] 0.24262 0.07520 0.06877 0.32742 0.28145 0.14392 0.02475 0.29369 0.05545
[28] 0.12894 0.32397 0.07078 0.36292 0.36112 0.02211 0.03090 0.08779 0.09492
[37] 0.45428 0.05000 0.00444 0.04121 1.08435

$population.size
 [1]  0.1165300  0.1165300  0.1165300  3.2272074  3.2272074  3.2272074
 [7]  4.6218444  4.6218444  4.6218444  4.6218444 34.2015934 34.2015934
[13] 34.2015934 34.2015934 34.2015934 34.2015934  3.3980938 22.5725486
[19] 22.5725486 22.5725486 22.5725486 26.5043770 26.5043770  8.1894774
[25]  8.1894774 13.9256263  9.7423137  9.7423137 72.9813053 72.9813053
[31] 72.9813053 72.9813053 72.9813053 72.9813053  8.0100758  7.7336799
[37] 36.8743607 36.8743607  0.4506327  4.1825619 23.5979175

$parameter.count
[1] 41

$epsilon
[1] 0

$logL
[1] -62.88628

attr(,"class")
[1] "skyline"

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 13, 2024

@GopiGugan what branch are you working on? Can I jump in?

@GopiGugan
Copy link
Contributor

GopiGugan commented Nov 25, 2024

There seem to be tips without a label (None) after pruning the tree.

@ArtPoon - I tried using Bioplus however, I am still getting an None label in the tip:

>>> from Bio import Phylo
>>> from covizu import clustering
>>> from Bioplus import prunetree
>>> trees = Phylo.parse('2024-08-13/BA.5.5.nwk', 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> [i.name for i in tree.get_terminals() if i.name is None]
[]
>>> tree = prunetree.prune_tips(tree,500)
>>> len(tree.get_terminals())
501
>>> [i.name for i in tree.get_terminals() if i.name is None]
[None]

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 26, 2024

Reproduced the problem on my machine:

(venv) art@aziraphale:~/git/covizu/iss539/2024-08-13$ python
Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> from covizu import clustering
>>> trees = Phylo.parse("BA.5.5.nwk", 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> tree
Clade(branch_length=0.0, confidence=1.0)
>>> Phylo.write(tree, "consensus.nwk", 'newick')
1
>>> from Bioplus import prunetree
>>> tree = prunetree.prune_tips(tree, 500)
>>> len(tree.get_terminals())
501
>>> tips = tree.get_terminals()
>>> tips[0]
Clade(branch_length=6.917200357142859, name='4179')
>>> bad = [tip for tip in tips if tip.name is None]
>>> bad
[Clade(branch_length=7.270655824742268, confidence=0.5)]
>>> bad[0].clades
[]

Looks like there is a problem with Biopython.Phylo's prune method, because that is what is being called by Bioplus prunetree. This method was recently patched in a commit on Biopython, so we should try pulling the latest version (might not be a release) and trying to reproduce the problem.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 26, 2024

Something strange happened here:

>>> phy = Phylo.read("consensus.nwk", 'newick')
>>> tips = phy.get_terminals()
>>> target = 500
>>> while len(tips) > target:
...     tips = sorted(tips, key=lambda x: x.branch_length)
...     tip = tips[0]
...     parent = phy.prune(tip)
...     tips = tips[1:]
...     bad = [tip for tip in tips if tip.name is None]
...     if bad:
...         break
... 
>>> bad
[]
>>> len(tips)
500
>>> tips = phy.get_terminals()
>>> len(tips)
501
>>> bad = [tip for tip in tips if tip.name is None]
>>> bad
[Clade(branch_length=7.27065, confidence=0.5)]
>>> parent
Clade(branch_length=0.0, confidence=1.0)
>>> tip
Clade(branch_length=6.09403, name='1483')

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 26, 2024

This seems to work:

>>> phy = Phylo.read("consensus.nwk", 'newick')
>>> tips = phy.get_terminals()
>>> while len(tips) > target:
...     tips = sorted(tips, key=lambda x: x.branch_length)
...     tip = tips[0]
...     parent = phy.prune(tip)
...     tips = phy.get_terminals()
... 
>>> tip
Clade(branch_length=6.09498, name='3592')
>>> parent
Clade(branch_length=0.0, confidence=1.0)
>>> len(tips)
500
>>> bad = [tip for tip in tips if tip.name is None]
>>> bad
[]
>>> bad = [tip for tip in phy.get_terminals() if tip.name is None]
>>> bad
[]

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Nov 27, 2024

I've updated the Bioplus repo, @GopiGugan can you please re-run with the new prunetree code?

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Dec 3, 2024

For downsampled trees (to 500 tips), we should only run these for estimating Ne and retain the original trees for the other three statistics

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Dec 11, 2024

@GopiGugan to submit PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants