Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up epichains classes #107

Merged
merged 40 commits into from
Dec 4, 2023
Merged

Split up epichains classes #107

merged 40 commits into from
Dec 4, 2023

Conversation

jamesmbaazam
Copy link
Member

@jamesmbaazam jamesmbaazam commented Nov 8, 2023

This PR closes #66, closes #78, and closes #79.

It does this by:

  • splitting up the <epichains> class into <epichains_tree> and epichains_summary, which inherit from <data.frame> and <vector> respectively.
  • adding a constructor, helper, and validator for each class.
  • removing the <aggregate_epichains_df> class as it is no longer deemed necessary.
  • providing each new class with its own print() and summary() methods.

@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (0a28674) 98.63% compared to head (ca5eec9) 98.90%.

❗ Current head ca5eec9 differs from pull request most recent head 0c89a08. Consider uploading reports for the commit 0c89a08 to get more accurate results

Files Patch % Lines
R/epichains.R 98.36% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
+ Coverage   98.63%   98.90%   +0.27%     
==========================================
  Files           8        8              
  Lines         511      549      +38     
==========================================
+ Hits          504      543      +39     
+ Misses          7        6       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jamesmbaazam jamesmbaazam force-pushed the split-epichains-classes branch from 1e568e2 to f037500 Compare November 28, 2023 17:53
@jamesmbaazam jamesmbaazam marked this pull request as ready for review November 28, 2023 17:58
@jamesmbaazam jamesmbaazam requested a review from sbfnk November 28, 2023 18:22
Copy link
Contributor

@sbfnk sbfnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'm still slightly concerned if it's confusing to have both simulate_tree and simulate_summary. The main reason for simulate_summary to exist is that it's faster than simulate_tree and thus it's useful to have internally especially when approximating likelihoods where it may be run ~1,000,000s of times within e.g. an MCMC. But for the user I wonder if there is ever a situation where this would matter, and where we could otherwise have just a simulate function that returns the tree.

But I'm not sure, and I'm not sure what the best solution is. Perhaps something to come out when others review the package.

R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
R/epichains.R Outdated Show resolved Hide resolved
@sbfnk
Copy link
Contributor

sbfnk commented Nov 30, 2023

If sticking with the current setup (which may well be best) then I think it would be a good idea for summary.epichains_tree() to return an <epichains_summary> as suggested by @pratikunterwegs in #79 (comment)

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented Nov 30, 2023

Looks good to me. I'm still slightly concerned if it's confusing to have both simulate_tree and simulate_summary. The main reason for simulate_summary to exist is that it's faster than simulate_tree and thus it's useful to have internally especially when approximating likelihoods where it may be run ~1,000,000s of times within e.g. an MCMC. But for the user I wonder if there is ever a situation where this would matter, and where we could otherwise have just a simulate function that returns the tree. But I'm not sure, and I'm not sure what the best solution is. Perhaps something to come out when others review the package.

I have a few thoughts on alternatives.

  1. I know having these two functions leads to the issue in Code duplication as a result of splitting chain_sim() #44 and the solution could be having simulate_summary() as just a summarising helper for the results of simulate/simulate_tree(). This might also affect our decision on whether to take on data.table as a dependency to fix the benchmarking issue in chain_sim uses inefficient rbinding of data frames #8 that is being fixed in not for merging: benchmark options for binding data frames in simulation #114. We could use <data.table> for the wrangling in the summary helper that I am proposing. Downside here is users will always have to simulate things they don't need using simulate() before obtaining the summaries.

  2. We could leave simulate_summary() but add the clarification, "The main reason for simulate_summary to exist is that it's faster than simulate_tree and thus it's useful to have internally especially when approximating likelihoods where it may be run ~1,000,000s of times within e.g. an MCMC". Downside is Code duplication as a result of splitting chain_sim() #44.

I'm inclined to go with this option 1.

@sbfnk
Copy link
Contributor

sbfnk commented Nov 30, 2023

I agree. One thing I've been wondering is how much of a speed difference there actually is. Perhaps we could add simulate_summary to the benchmark in #114?

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented Nov 30, 2023

Definitely much faster. See #114 (comment).

@sbfnk
Copy link
Contributor

sbfnk commented Dec 1, 2023

Interesting! Another option would be

  1. A kind of half-way house between the two: rename simulate_tree to simulate and make simulate_summary internal only and used in
    dist <- simulate_summary(
    which has the benefit that the user interface is simpler but the downside that it may make some use cases slower so the question is how much we believe that others will want to simulate summaries only e.g. for inference.

Still doesn't help with the code duplication unless there is some of it which could be turned into a function.

@sbfnk
Copy link
Contributor

sbfnk commented Dec 1, 2023

How about

  1. We keep things as they are but rename simulate_tree to simulate and then make summary(simulate) return the same type of output as simulate_summary? This would simplify the hierarchy and hopefully make it clear that simulate_summary is a faster shortcut to simulating the summary directly if that's all we care about. It doesn't help with Code duplication as a result of splitting chain_sim() #44 but perhaps we can just accept that this (bit of) duplication is unavoidable.

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented Dec 1, 2023

Am I wrong for interpreting #107 (comment) to mean option 4? 🤔

As in, "simulate()" returns an <epichains_tree> object and summary.epichains_tree returns an <epichains_summary> object, which is the same as simulate_summary() returns <epichains_summary>.

Also, will we be masking stats::simulate()?

@sbfnk
Copy link
Contributor

sbfnk commented Dec 1, 2023

Am I wrong for interpreting #107 (comment) to mean option 4? 🤔

Yes, they're basically the same (+rename)

Also, will we be masking stats::simulate()?

Ah yes. We could extend it (as it's a generic) but we'd have to construct an object to simulate from first.

Given all of this my inclination would be to just go ahead with #107 (comment) and keep simulate_tree as is after all. But the decision is not clear-cut. What do you think?

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented Dec 1, 2023

Given all of this my inclination would be to just go ahead with #107 (comment) and keep simulate_tree as is after all. But the decision is not clear-cut. What do you think?

Agreed. Looking at the stats::simulate() generic, it looks more useful as a "scenario" simulation function.

I will go ahead with #107 (comment).

Thanks for helping to brainstorm.

Copy link
Collaborator

@pratikunterwegs pratikunterwegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leave simulate_summary() but add the clarification, "The main reason for simulate_summary to exist is that it's faster than simulate_tree and thus it's useful to have internally especially when approximating likelihoods where it may be run ~1,000,000s of times within e.g. an MCMC". Downside is #44.

Just looking into this PR as the instigator of #79 --- as an external developer, I haven't been able to understand the difference between simulate_tree() and simulate_summary(). Perhaps I'm confused because they both return objects with some shared inheritance, and I'm conditioned by R syntax to assume that a 'summary' is a condensed version of another object.

Would it help to rename simulate_summary() to say, sample_tree_metrics()? This would then reserve 'simulate' for code that gives the tree structure. However, running summary(simulate_tree()) should also give the same-ish output as sample_tree_metrics() for the same ntrees and offspring_dist.

Also, would it help to pick between 'tree' and 'chain' and use one, or do they mean different things?

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented Dec 4, 2023

Would it help to rename simulate_summary() to say, sample_tree_metrics()? This would then reserve 'simulate' for code that gives the tree structure. However, running summary(simulate_tree()) should also give the same-ish output as sample_tree_metrics() for the same ntrees and offspring_dist.

It's not really sampling. The current simulate_summary() function is running a stripped down version of simulate_tree() that only returns the chain_statistic of interest without tracking related information. Speaking out loud now, I think it's probably better to rename it to simulate_statistic(). I mentioned in the second bullet of #79 that the name simulate_summary() would need some reconsideration.

@jamesmbaazam jamesmbaazam merged commit e2c27a2 into main Dec 4, 2023
8 checks passed
@jamesmbaazam jamesmbaazam deleted the split-epichains-classes branch December 4, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants