Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactoring / improvements of datasets #38

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

romainx
Copy link

@romainx romainx commented Apr 12, 2020

Hello,

This version is refactoring of some existing datasets.
Previous datasets have been kept as is for compatibility reason.
New datasets hold the same name with a suffix _td for tidy.

Global changes / rules

  • Import from tibble in order to get consistent behaviour regardless of whether or not tibble is attached.
  • Turning on markdown support for Roxygen, see Turning on markdown support
  • Include a Makefile for common tasks.
  • Run test in Travis CI (to be validated)
  • Try to enforce a standardization on column names for example date_of_onset everywhere, gender instead of sex, etc.
  • Try to keep data tidy.

Dengue & Zika datasets Funk et al. (2016)

  • Only kept original columns (removed nr column that can be computed).
  • Rename columns to try to stick to a standard.
  • Include the code used to generate datasets in data-raw/ as stated in R Packages book.
  • Include datasets tests through testthat.
  • Add licence in the documentation of the datasets.
  • Small refactoring of the Markdown documentation.

I've made this work in order to try to improve this package -- and also to learn a little bit more about R Data Packages. Does it make sense? Are you interested? If so I can continue with some of the other datasets in order to apply the same principles. 😊

Best

@romainx romainx closed this Apr 23, 2020
@finlaycampbell
Copy link
Collaborator

Hi Romain,
I'm so sorry for missing this pull request when you made it, I don't know how this happened!
Thank you very much for volunteering to improve this package - I think your suggestions are great, especially around standardising variable names. Really we should've enforced this from the beginning, having two copies of each dataset for the sake of backward compatibility is a bit annoying. If you're still happy to go through the other datasets (after being so rudely ignored), that would be much appreciated. Then we can also get a sense for how many variables names are actually changed before we merge this branch, and whether we want to consider breaking backwards compatibility for the sake of avoiding duplicated datasets. Thanks again for helping with this, it's much appreciated!

@romainx
Copy link
Author

romainx commented Apr 24, 2020

Hello @finlaycampbell,

No problem 😄 we have all the same trouble, we have many things to do and not much time.
I've closed this PR since I though it was finally useless.
But thanks for your feedback, really appreciate.

I will have a look on other datasets. The tradeoff between compatibility and duplication is not so easy. I was not very confortable with this topic, as far as I know there is no clear way to do that i R see this question on SO.

Conservative approach

The one I've drafted in this PR.

  • Pros: Backward compatible
  • Cons: Duplication, maintenance

Radical approach

The one described in SO.

  • Pros: Lean
  • Cons: Backward compatible

Intermediate approach

Keep the original dataset but do not export it anymore
Export the new dataset
Write a function to emulate the previous data interface and flag it as deprecated
Properly comment both

  • Pros: Lean
  • Cons: Additional work and still break the backward compatibility since package's user will have to turn dataset lazy loading into function call.

What is your opinion?

* Implemented another way of deprecating datasets
* Added some information in NEWS.md about changes
@romainx
Copy link
Author

romainx commented Apr 25, 2020

Hello,

I've implemented a more sustainable approach: deprecating datasets by moving them to data-raw/_dep<dataset_name>.
I've also written in NEWS.md the main change proposal.

Please give me you opinion.

Best

@romainx
Copy link
Author

romainx commented Apr 25, 2020

I've drafted what I think is a far better way to deal with that.

# Default version
> head(dengue_fais_2011)

# A tibble: 6 x 2
#  date_of_onset incidence
#  <date>            <dbl>
# 1 2011-09-15            0
# 2 2011-09-22            0
# 3 2011-09-29            0

# Activate the compatibility mode with a proper deprecation warning
> legacy_mode()

# Loading objects:
#  dengue_fais_2011
# Warning message:
# This function replaces datasets with the previous version for compatibility reason 

# Back to previous version that have been simply moved to /data-raw
> head(dengue_fais_2011)

# A tibble: 6 x 3
#  onset_date    nr value
#  <date>     <dbl> <dbl>
# 1 2011-09-15     7     0
# 2 2011-09-22    14     0
# 3 2011-09-29    21     0

@romainx
Copy link
Author

romainx commented Apr 27, 2020

Here is an example of the Time Series feature 😄

ebola_sierraleone_2014_as_ts() %>%
    group_by_key() %>%
    index_by(date = ~ yearweek(.)) %>%
    summarise(incidence = sum(incidence)) %>%
    ggplot(aes(x = date, y = incidence, color = status)) +
    geom_line() +
    geom_point()

Rplot

@finlaycampbell
Copy link
Collaborator

Hi Romain,

Many thanks for all the work put into this! I'm a big fan of the legacy_mode() approach, I'll just have to discuss with others first if they're happy breaking backwards compatibility with the default outbreaks load - I'll get back to you. We should at least post a message when the package is loaded to indicate that some of the data has been refactored.

Regarding to as_ts() functions - they're very nice for analysis, but this package is really intended as a pure data repository, with the analysis lying in the hands of the user. Not all the data will necessarily be convertible to a timeseries and users might want to use other packages (including the RECON package incidence. I think it would be best to remove these functions, though I think it would be nice to include the timeseries code you've written in one of the examples (perhaps in ebola_sim, which we use as an example quite a lot). It's a very good use-case of the outbreaks package (I wasn't even aware of the tsibble package) so it's good to include, just probably not as a standalone set of functions!

Thanks again, let me know if you have any comments questions.

@romainx
Copy link
Author

romainx commented Apr 27, 2020

Hello,

Understood regarding the function related to the tsibble package, you're right. I will remove them (maybe put in on another branch waiting to add a page with the example). For information the tidyverts initiative is led (or at least sponsored) by Rob J Hyndman who has work a lot on time series and forecasting.

For the rest, just tell me if it's worth continuing the refactoring because it's a lot of work and I'm still not sure it's useful. Maybe people are happy with the datasets as they are and I will move to something else. In this case I will keep my branch as is and I will let you decide if you want to pick something.

Please let me know.

Many thanks.
Best.

* Removed everything related to TS
* Status up to date in `NEWS.md`
@finlaycampbell
Copy link
Collaborator

Hey,

I'll check out the tidyverts initiative now, it looks nice!

Regarding the refactoring work, I definitely think this is definitely a project worth completing and improves the package functionality overall - but please only continue with it if you are happy to do so, especially if it's a lot of work.

After discussion we've decided the best approach is to default to legacy_mode() for the next 1-2 months, with a message upon package load stating that users can use the updated datasets by entering modern_mode() (placeholder function name for now) and that this will become the default soon. This will give developers and users some time to update their code. Thoughts?

@romainx
Copy link
Author

romainx commented Apr 27, 2020

Ok understood seems reasonable. I will take a break and then check how to enable the modern_mode versus legacy_mode.

Best.

@romainx romainx marked this pull request as draft May 8, 2020 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants