-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak and Output Stalling when Processing Large Datasets with TreeDistance() #123
Comments
Thanks for the report, and sorry to hear you've come up against this issue. Could you give more details of the nature of your large datasets? At a minimum, it would be helpful to know how many trees of how many leaves you are processing. Better still would be if you could share a problematic dataset so I could attempt to reproduce the issue myself. Thanks! |
Thank you very much for your response. Below are the input files and R code I have been using. The issue seems to primarily occur during the assignment of the result of TreeDistance(tree) to the variable distance, where the process either gets stuck or terminates. I later attempted to calculate distances in parallel, which was successful in some instances but failed in others, and I am unsure of the specific reason behind this. ` setwd('R:/Rstudio workplace/wjj_tree_filter/fna_RF') tree <- tryCatch({ distance <- tryCatch({ distance_matrix <- as.matrix(distance) |
Thanks; I'll try to take a look later this week. |
Thanks for bearing with me whilst I look into this. Whilst the calculation of the information shared between the trees is reasonably quick, as you have observed, converting these into distances requires calculating the maximum distance between trees with non-overlapping leaf labels – and this post-processing takes much longer, as I've not invested much time in optimizing this. One delay arises because the trees are presented with node labels. I recently updated the code that reorders trees for analysis and normalization to preserve node labels, but this additional code is not optimized for speed. I'll update the code to automatically remove this information when comparing trees, but in the meantime you can run trees <- tree
trees[] <- lapply(trees, "[[<-", "node.label", NULL)
trees[] <- lapply(trees, "[[<-", "edge.length", NULL)
trees <- TreeTools::Preorder(trees)
# Then calculate distance with
TreeDistance(trees) I've also updated TreeDist to display a progress bar for the post-processing phase, which should give some indication as to how progress is proceeding. Install this version using devtools::install_github("ms609/TreeTools")
devtools::install_github("ms609/TreeDist") On my machine, I can now calculates the distances for the trees you provided in around a minute. There's more that could be done to speed this up – but I can't spare the time for this at present. I'll leave the issue open for when I (or other contributors) have the chance to return to this. |
Thank you for taking the time to resolve this issue. I will try the method you provided and update the R package. |
I'm encountering an issue while using the TreeDistance() function to process large datasets. After the computations are completed, the process appears to freeze without returning any output. Concurrently, I observe that the memory usage continues to increase indefinitely. This behavior suggests a possible memory leak within the function when dealing with substantial amounts of data.
The text was updated successfully, but these errors were encountered: