[WIP] Refactoring / improvements of datasets #38

romainx · 2020-04-12T19:35:45Z

Hello,

This version is refactoring of some existing datasets.
Previous datasets have been kept as is for compatibility reason.
New datasets hold the same name with a suffix _td for tidy.

Global changes / rules

Import from tibble in order to get consistent behaviour regardless of whether or not tibble is attached.
Turning on markdown support for Roxygen, see Turning on markdown support
Include a Makefile for common tasks.
Run test in Travis CI (to be validated)
Try to enforce a standardization on column names for example date_of_onset everywhere, gender instead of sex, etc.
Try to keep data tidy.

Dengue & Zika datasets Funk et al. (2016)

Only kept original columns (removed nr column that can be computed).
Rename columns to try to stick to a standard.
Include the code used to generate datasets in data-raw/ as stated in R Packages book.
Include datasets tests through testthat.
Add licence in the documentation of the datasets.
Small refactoring of the Markdown documentation.

I've made this work in order to try to improve this package -- and also to learn a little bit more about R Data Packages. Does it make sense? Are you interested? If so I can continue with some of the other datasets in order to apply the same principles. 😊

Best

finlaycampbell · 2020-04-24T12:29:38Z

Hi Romain,
I'm so sorry for missing this pull request when you made it, I don't know how this happened!
Thank you very much for volunteering to improve this package - I think your suggestions are great, especially around standardising variable names. Really we should've enforced this from the beginning, having two copies of each dataset for the sake of backward compatibility is a bit annoying. If you're still happy to go through the other datasets (after being so rudely ignored), that would be much appreciated. Then we can also get a sense for how many variables names are actually changed before we merge this branch, and whether we want to consider breaking backwards compatibility for the sake of avoiding duplicated datasets. Thanks again for helping with this, it's much appreciated!

romainx · 2020-04-24T13:46:17Z

Hello @finlaycampbell,

No problem 😄 we have all the same trouble, we have many things to do and not much time.
I've closed this PR since I though it was finally useless.
But thanks for your feedback, really appreciate.

I will have a look on other datasets. The tradeoff between compatibility and duplication is not so easy. I was not very confortable with this topic, as far as I know there is no clear way to do that i R see this question on SO.

Conservative approach

The one I've drafted in this PR.

Pros: Backward compatible
Cons: Duplication, maintenance

Radical approach

The one described in SO.

Pros: Lean
Cons: Backward compatible

Intermediate approach

Keep the original dataset but do not export it anymore
Export the new dataset
Write a function to emulate the previous data interface and flag it as deprecated
Properly comment both

Pros: Lean
Cons: Additional work and still break the backward compatibility since package's user will have to turn dataset lazy loading into function call.

What is your opinion?

* Implemented another way of deprecating datasets * Added some information in NEWS.md about changes

romainx · 2020-04-25T09:52:50Z

Hello,

I've implemented a more sustainable approach: deprecating datasets by moving them to data-raw/_dep<dataset_name>.
I've also written in NEWS.md the main change proposal.

Please give me you opinion.

Best

romainx · 2020-04-25T10:50:25Z

I've drafted what I think is a far better way to deal with that.

# Default version
> head(dengue_fais_2011)

# A tibble: 6 x 2
#  date_of_onset incidence
#  <date>            <dbl>
# 1 2011-09-15            0
# 2 2011-09-22            0
# 3 2011-09-29            0

# Activate the compatibility mode with a proper deprecation warning
> legacy_mode()

# Loading objects:
#  dengue_fais_2011
# Warning message:
# This function replaces datasets with the previous version for compatibility reason 

# Back to previous version that have been simply moved to /data-raw
> head(dengue_fais_2011)

# A tibble: 6 x 3
#  onset_date    nr value
#  <date>     <dbl> <dbl>
# 1 2011-09-15     7     0
# 2 2011-09-22    14     0
# 3 2011-09-29    21     0

* Added tsibble capabilities * Fixed ebola_kikwit by turning not reported data into `NA`

romainx · 2020-04-27T07:36:28Z

Here is an example of the Time Series feature 😄

ebola_sierraleone_2014_as_ts() %>%
    group_by_key() %>%
    index_by(date = ~ yearweek(.)) %>%
    summarise(incidence = sum(incidence)) %>%
    ggplot(aes(x = date, y = incidence, color = status)) +
    geom_line() +
    geom_point()

finlaycampbell · 2020-04-27T08:55:22Z

Hi Romain,

Many thanks for all the work put into this! I'm a big fan of the legacy_mode() approach, I'll just have to discuss with others first if they're happy breaking backwards compatibility with the default outbreaks load - I'll get back to you. We should at least post a message when the package is loaded to indicate that some of the data has been refactored.

Regarding to as_ts() functions - they're very nice for analysis, but this package is really intended as a pure data repository, with the analysis lying in the hands of the user. Not all the data will necessarily be convertible to a timeseries and users might want to use other packages (including the RECON package incidence. I think it would be best to remove these functions, though I think it would be nice to include the timeseries code you've written in one of the examples (perhaps in ebola_sim, which we use as an example quite a lot). It's a very good use-case of the outbreaks package (I wasn't even aware of the tsibble package) so it's good to include, just probably not as a standalone set of functions!

Thanks again, let me know if you have any comments questions.

romainx · 2020-04-27T09:20:59Z

Hello,

Understood regarding the function related to the tsibble package, you're right. I will remove them (maybe put in on another branch waiting to add a page with the example). For information the tidyverts initiative is led (or at least sponsored) by Rob J Hyndman who has work a lot on time series and forecasting.

For the rest, just tell me if it's worth continuing the refactoring because it's a lot of work and I'm still not sure it's useful. Maybe people are happy with the datasets as they are and I will move to something else. In this case I will keep my branch as is and I will let you decide if you want to pick something.

Please let me know.

Many thanks.
Best.

* Removed everything related to TS * Status up to date in `NEWS.md`

finlaycampbell · 2020-04-27T10:28:09Z

Hey,

I'll check out the tidyverts initiative now, it looks nice!

Regarding the refactoring work, I definitely think this is definitely a project worth completing and improves the package functionality overall - but please only continue with it if you are happy to do so, especially if it's a lot of work.

After discussion we've decided the best approach is to default to legacy_mode() for the next 1-2 months, with a message upon package load stating that users can use the updated datasets by entering modern_mode() (placeholder function name for now) and that this will become the default soon. This will give developers and users some time to update their code. Thoughts?

romainx · 2020-04-27T12:35:09Z

Ok understood seems reasonable. I will take a break and then check how to enable the modern_mode versus legacy_mode.

Best.

romainx added 2 commits April 12, 2020 18:25

[WIP] First version of dengue / zika refactoring

61fe2e2

[WIP] Refactoring: Dengue & Zika datasets Funk et al. (2016)

14f75ad

romainx closed this Apr 23, 2020

finlaycampbell reopened this Apr 24, 2020

Lean deprecation

e34e53f

* Implemented another way of deprecating datasets * Added some information in NEWS.md about changes

romainx added 5 commits April 25, 2020 16:06

implementation of legacy mode

70efe6d

ebola_kikwit_1995

ee7d2ac

Timeseries implementation + fix ebola_kikwit

af82714

* Added tsibble capabilities * Fixed ebola_kikwit by turning not reported data into `NA`

ebola_sierraleone_2014

271886a

fluH7N9_china_2013

27088ee

Refactoring status

4ab86ed

* Removed everything related to TS * Status up to date in `NEWS.md`

romainx marked this pull request as draft May 8, 2020 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Refactoring / improvements of datasets #38

[WIP] Refactoring / improvements of datasets #38

romainx commented Apr 12, 2020

finlaycampbell commented Apr 24, 2020

romainx commented Apr 24, 2020

romainx commented Apr 25, 2020

romainx commented Apr 25, 2020 •

edited

Loading

romainx commented Apr 27, 2020

finlaycampbell commented Apr 27, 2020

romainx commented Apr 27, 2020

finlaycampbell commented Apr 27, 2020

romainx commented Apr 27, 2020

[WIP] Refactoring / improvements of datasets #38

Are you sure you want to change the base?

[WIP] Refactoring / improvements of datasets #38

Conversation

romainx commented Apr 12, 2020

Global changes / rules

Dengue & Zika datasets Funk et al. (2016)

finlaycampbell commented Apr 24, 2020

romainx commented Apr 24, 2020

Conservative approach

Radical approach

Intermediate approach

romainx commented Apr 25, 2020

romainx commented Apr 25, 2020 • edited Loading

romainx commented Apr 27, 2020

finlaycampbell commented Apr 27, 2020

romainx commented Apr 27, 2020

finlaycampbell commented Apr 27, 2020

romainx commented Apr 27, 2020

romainx commented Apr 25, 2020 •

edited

Loading