Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak and Output Stalling when Processing Large Datasets with TreeDistance() #123

Open
qwer62667771 opened this issue Jul 20, 2024 · 5 comments

Comments

@qwer62667771
Copy link

I'm encountering an issue while using the TreeDistance() function to process large datasets. After the computations are completed, the process appears to freeze without returning any output. Concurrently, I observe that the memory usage continues to increase indefinitely. This behavior suggests a possible memory leak within the function when dealing with substantial amounts of data.

@ms609
Copy link
Owner

ms609 commented Jul 22, 2024

Thanks for the report, and sorry to hear you've come up against this issue. Could you give more details of the nature of your large datasets? At a minimum, it would be helpful to know how many trees of how many leaves you are processing. Better still would be if you could share a problematic dataset so I could attempt to reproduce the issue myself. Thanks!

@qwer62667771
Copy link
Author

Thank you very much for your response. Below are the input files and R code I have been using. The issue seems to primarily occur during the assignment of the result of TreeDistance(tree) to the variable distance, where the process either gets stuck or terminates. I later attempted to calculate distances in parallel, which was successful in some instances but failed in others, and I am unsure of the specific reason behind this.

1-all_genetrees.txt

`
library('TreeDist')

setwd('R:/Rstudio workplace/wjj_tree_filter/fna_RF')

tree <- tryCatch({
ape::read.tree('1-all_genetrees.txt')
}, error = function(e) {
print(paste("Error reading tree file:", e))
quit(save = "no", status = 1)
})

distance <- tryCatch({
TreeDistance(tree)
}, error = function(e) {
print(paste("Error calculating tree distance:", e))
quit(save = "no", status = 1)
})

distance_matrix <- as.matrix(distance)
write.csv(distance_matrix, "3-distance_matrix.csv", row.names = TRUE)
`

@ms609
Copy link
Owner

ms609 commented Jul 22, 2024

Thanks; I'll try to take a look later this week.

@ms609
Copy link
Owner

ms609 commented Jul 25, 2024

Thanks for bearing with me whilst I look into this.

Whilst the calculation of the information shared between the trees is reasonably quick, as you have observed, converting these into distances requires calculating the maximum distance between trees with non-overlapping leaf labels – and this post-processing takes much longer, as I've not invested much time in optimizing this.

One delay arises because the trees are presented with node labels. I recently updated the code that reorders trees for analysis and normalization to preserve node labels, but this additional code is not optimized for speed. I'll update the code to automatically remove this information when comparing trees, but in the meantime you can run

trees <- tree
trees[] <- lapply(trees, "[[<-", "node.label", NULL)
trees[] <- lapply(trees, "[[<-", "edge.length", NULL)
trees <- TreeTools::Preorder(trees)
# Then calculate distance with
TreeDistance(trees)

I've also updated TreeDist to display a progress bar for the post-processing phase, which should give some indication as to how progress is proceeding. Install this version using

devtools::install_github("ms609/TreeTools")
devtools::install_github("ms609/TreeDist")

On my machine, I can now calculates the distances for the trees you provided in around a minute.

There's more that could be done to speed this up – but I can't spare the time for this at present. I'll leave the issue open for when I (or other contributors) have the chance to return to this.

@qwer62667771
Copy link
Author

Thank you for taking the time to resolve this issue. I will try the method you provided and update the R package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants