Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactoring / improvements of datasets #38

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ README.Rmd
docs/
raw/
^appveyor\.yml$
logo/*
logo/*
^Makefile$
^data-raw$
5 changes: 4 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r

language: R
language: r
sudo: false
cache: packages
warnings_are_errors: false

after_success:
- make test
6 changes: 4 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ Depends:
License: GPL (>=2)
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
Suggests: testthat, covr, ape, incidence
RoxygenNote: 7.1.0
Roxygen: list(markdown = TRUE)
Suggests: testthat, covr, ape, incidence, dplyr, lubridate, devtools, tidyr, magrittr, forcats
URL: https://github.com/reconhub/outbreaks
BugReports: https://github.com/reconhub/outbreaks/issues
Imports: tibble
32 changes: 32 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# h/t to @jimhester and @yihui for this parse block:
# https://github.com/yihui/knitr/blob/dc5ead7bcfc0ebd2789fe99c527c7d91afb3de4a/Makefile#L1-L4
# Note the portability change as suggested in the manual:
# https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages
PKGNAME = `sed -n "s/Package: *\([^ ]*\)/\1/p" DESCRIPTION`
PKGVERS = `sed -n "s/Version: *\([^ ]*\)/\1/p" DESCRIPTION`


all: check

build:
R CMD build .

check: build
R CMD check --no-manual $(PKGNAME)_$(PKGVERS).tar.gz

install_deps:
Rscript \
-e 'if (!requireNamespace("remotes")) install.packages("remotes")' \
-e 'remotes::install_deps(dependencies = TRUE)'

install: install_deps build
R CMD INSTALL $(PKGNAME)_$(PKGVERS).tar.gz

clean:
@rm -rf $(PKGNAME)_$(PKGVERS).tar.gz $(PKGNAME).Rcheck

test:
Rscript -e "devtools::test()"

document:
Rscript -e "devtools::document(roclets = c('rd', 'collate', 'namespace'))"
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
# Generated by roxygen2: do not edit by hand

export(legacy_mode)
importFrom(tibble,tibble)
123 changes: 123 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,126 @@
outbreaks 2.0.0 (to be released)
================================

This version is a **refactoring of existing datasets**.
Previous datasets have been moved in `/data-raw` and prefixed by `leg_` (for legacy).

Note: If for compatibility reason you need to use the previous version of the datasets,
you can load them by running `legacy_mode()`.

### Global changes

* **Technical changes**
* tibble
* Convert each dataset to `tibble`.
* Import from `tibble` in order to get consistent behaviour regardless of whether or not `tibble` is attached.
* Tests
* Include a test for each dataset through [testthat](https://testthat.r-lib.org/).
* Run tests in Travis CI.
* Structure tests: Test if the dataset structure (format) is correct
* Data tests: Test if the dataset data (content) is correct
* TODO: Some tests could be factorized see http://r-pkgs.had.co.nz/tests.html
* Include a `Makefile` for common tasks.
* **Documentation**
* Turning on markdown support for Roxygen, see [Turning on markdown support](https://roxygen2.r-lib.org/articles/rd-formatting.html#turning-on-markdown-support).
* Detailed description of the dataset is including in the "Details" section (instead of "Format").
* Dedicated section "Licence" added to mention clearly the licence if it's known -> [#7](https://github.com/reconhub/outbreaks/issues/7).
* Define `@family` with disease name in order to find similar datasets easily.
* **Dataset structure**
* Variables are in lower case.
* Stick to **tidy data** principle
* Some common variables
* `id`: Unique identification
* `age`: Age of individual
* `date_of_onset` (`Date`): Date of symptom onset
* `date_of_report` (`Date`): Date of reporting
* `gender`: Gender of individual as a factor with two values ("male", "female")
* `incidence` (`integer`): Incidence is given as the number of new cases reported
* `outcome`:
* `age`: Age of individual
* `age_group`: Age grouping
* `geo`: Geographical coordinates (must be two columns)
* **Process**
* Write a "contributing guide" -> #TODO

### Dengue & Zika datasets Funk et al. (2016)

Datasets: `dengue_fais_2011`, `dengue_yap_2011`, `zika_yap_2007`.

* **Technical changes**
* Include the code used to download the source file and
generate datasets in `data-raw/` as stated in *[R Packages](http://r-pkgs.had.co.nz/data.html)* book.
* **Format**
* `onset_date` -> `date_of_onset`(standardization)
* `nr` -> Removed since it can be computed
* `value`-> `incidence` (`integer`)
* **Documentation**
* Apply general documentation rules

### Ebola in Kikwit, Democratic Republic of the Congo, 1995

Datasets: `ebola_kikwit_1995`

*Source data not available.*

It was a sparse dataset, there is no event when reporting is FALSE so
* Replace data when reporting is `FALSE` by `NA`

```R
leg_ebola_kikwit_1995 %>% group_by(reporting) %>% summarise_if(is.numeric, sum)

# A tibble: 2 x 3
# reporting onset death
# <lgl> <int> <int>
# 1 FALSE 0 0
# 2 TRUE 292 236
```

* **Format**
* Replace missing values (`reporting == FALSE` with `NA`)
* `date` -> `date_of_onset`(standardization)
* `onset`-> `incidence` (standardization)
* `death` (no change)
* `reporting` -> removed

* **Documentation**
* Apply general documentation rules
* Fixed the number of incidence is 292 and not 291

### Ebola in Sierra Leone, 2014

Datasets: `ebola_sierraleone_2014`

* **Format**
* `id` (no change)
* `age` (no change)
* `sex` -> `gender` and change factors name
* `status` (no change)
* `date_of_onset` (no change)
* `date_of_sample`(no change)
* `district` (no change)
* `chiefdom` (no change)

### Influenza A H7N9 in China, 2013

Datasets: `fluH7N9_china_2013`

* **Format**

* `case_id` -> id
* `date_of_onset` (no change)
* `date_of_hospitalisation` (no change)
* `date_of_outcome` (no change)
* `outcome` (no change)
* `gender` (hange factors name)
* `age` (no change)
* `province` (no change)

### References

* [Tidy Data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
* [Data chapter in the book *R packages*](http://r-pkgs.had.co.nz/data.html)
* [Taking your data to go with R packages](https://www.davekleinschmidt.com/r-packages/)

outbreaks 1.8.0 (2020-02-13)
==================

Expand Down
34 changes: 18 additions & 16 deletions R/dengue_fais_2011.R
Original file line number Diff line number Diff line change
@@ -1,39 +1,41 @@
#' Dengue on the island of Fais, Micronesia, 2011
#' Dengue on the island of Fais, Micronesia, 2011 (new version)
#'
#' These data describe incidence of clincal cases of Dengue
#' These data describe incidence of clinical cases of Dengue
#' on the island of Fais, Micronesia.
#'
#' @docType data
#'
#' @format {
#' A data frame with 57 rows and 3 columns
#' \describe{
#' \item{onset_date}{Date}
#' \item{nr}{Days after starting date}
#' \item{value}{Number of cases}
#' }
#'
#' The data on Dengue incidence reported by Funk et al. (2016) cover the period
#' from 2011-09-15 to 2012-02-14, over which time a total of 157 clinical cases
#' were reported among 294 residents. The first reported case is thought to be
#' the index case. The population of Fais is concentrated in a single population
#' centre, and is thought to have been immunologically naive at the time of
#' infection.
#'
##' # Licence
#' [CC BY](https://creativecommons.org/licenses/by/4.0/)
#'
#' @docType data
#'
#' @format A data frame with 57 rows and 2 columns
#' \describe{
#' \item{date_of_onset}{First day of the onset week}
#' \item{incidence}{Incidence is given as the number of new cases reported in the week beginning at \code{date_of_onset}}
#' }
#'
#' @rdname dengue_fais_2011
#'
#' @author Data from Funk et al. (2016), provided by Sebastian Funk (github.com/sbnfunk).
#' Transfer to R and documentation by Finlay Campbell (\email{finlaycampbell93@@gmail.com}).
#' @author
#' * Data from Funk et al. (2016), provided by [Sebastian Funk](https://github.com/sbnfunk).
#' * Transfer to R and documentation by Finlay Campbell (\email{finlaycampbell93@@gmail.com}).
#' * Refactoring by [Romain](https://github.com/romainx).
#'
#' @source Funk et al. (2016)
#'
#' @references
#'
#' S. Funk, et al. 2016. Comparative Analysis of Dengue and Zika Outbreaks Reveals
#' Differences by Setting and Virus. PLOS Neglected Tropical Diseases, 10(12),
#' e0005173. http://doi.org/10.1371/journal.pntd.0005173
#' e0005173. <http://doi.org/10.1371/journal.pntd.0005173>
#'
#' @family dengue
#'
#' @examples
#' ## show first few weeks of Dengue incidence
Expand Down
39 changes: 21 additions & 18 deletions R/dengue_yap_2011.R
Original file line number Diff line number Diff line change
@@ -1,18 +1,8 @@
#' Dengue on the Yap Main Islands, Micronesia, 2011
#' Dengue on the Yap Main Islands, Micronesia, 2011 (new version)
#'
#' These data describe incidence of clincal cases of Dengue
#' These data describe incidence of clinical cases of Dengue
#' on the Yap Main Islands, Micronesia.
#'
#' @docType data
#'
#' @format {
#' A data frame with 185 rows and 3 columns
#' \describe{
#' \item{onset_date}{Date}
#' \item{nr}{Days after starting date}
#' \item{value}{Number of cases}
#' }
#'
#' The data on Dengue incidence reported by Funk et al. (2016) cover the period
#' from 2011-07-07 to 2012-04-12, over which time a total of 978 cases were
#' reported among 7391 residents. Suspected Dengue cases were identified by the
Expand All @@ -22,23 +12,36 @@
#' reverse transcriptase polymerase chain reaction by the CDC Dengue Branch,
#' Puerto Rico.
#'
##' # Licence
#' [CC BY](https://creativecommons.org/licenses/by/4.0/)
#'
#' @docType data
#'
#' @format A data frame with 185 rows and 2 columns
#' \describe{
#' \item{date_of_onset}{First day of the onset week}
#' \item{incidence}{Incidence is given as the number of new cases reported in the week beginning at \code{date_of_onset}}
#' }
#'
#' @rdname dengue_yap_2011
#'
#' @author Data from Funk et al. (2016), provided by Sebastian Funk (github.com/sbnfunk).
#' Transfer to R and documentation by Finlay Campbell (\email{finlaycampbell93@@gmail.com}).
#' @author
#' * Data from Funk et al. (2016), provided by [Sebastian Funk](https://github.com/sbnfunk).
#' * Transfer to R and documentation by Finlay Campbell (\email{finlaycampbell93@@gmail.com}).
#' * Refactoring by [Romain](https://github.com/romainx).
#'
#' @source Funk et al. (2016)
#' @source
#' Funk et al. (2016)
#'
#' @references
#'
#' S. Funk, et al. 2016. Comparative Analysis of Dengue and Zika Outbreaks Reveals
#' Differences by Setting and Virus. PLOS Neglected Tropical Diseases, 10(12),
#' e0005173. http://doi.org/10.1371/journal.pntd.0005173
#' e0005173. <http://doi.org/10.1371/journal.pntd.0005173>
#'
#' @family dengue
#'
#' @examples
#' ## show first few weeks of Dengue incidence
#' # show first few weeks of Dengue incidence
#' head(dengue_yap_2011)
#'
"dengue_yap_2011"
36 changes: 20 additions & 16 deletions R/ebola_kikwit_1995.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,36 +3,40 @@
#' These data comprise of new cases of Ebola haemorrhagic fever in Kikwit,
#' Democratic Republic of the Congo.
#'
#' @docType data
#'
#' @format {
#' A data frame with 192 rows and 4 columns
#' \describe{
#' \item{date}{Date}
#' \item{onset}{Number of new cases}
#' \item{deaths}{Number of deaths per day}
#' \item{reporting}{Whether data were reported on a daily basis}
#' }
#'
#' The data on daily cases reported by Khan et al. (1999) cover the period 1995-03-01 to 1995-07-16,
#' over which time there were 291 cases and 236 deaths. The first case became ill on
#' over which time there were 292 cases and 236 deaths. The first case became ill on
#' 1995-01-06, which is taken as the first timepoint in this version of the data. Over the entire period,
#' there were 316 cases i.e. the onset times are not reported for 24 individuals, and the recovery times
#' for the individuals who did not die are not reported.
#'
#' # Licence
#' **Unknown**
#'
#' @docType data
#'
#' @format A data frame with 192 rows and 3 columns
#' \describe{
#' \item{date_of_onset}{Date of onset}
#' \item{incidence}{Number of new cases}
#' \item{death}{Number of deaths per day}
#' }
#'
#' @rdname ebola_kikwit_1995
#'
#' @author Data from Khan et al. (1999), provided by T.J. McKinley.
#' Transfer to R and documentation by Simon Frost (\email{sdwfrost@@gmail.com}).
#' @author
#' * Data from Khan et al. (1999), provided by T.J. McKinley.
#' * Transfer to R and documentation by Simon Frost (\email{sdwfrost@@gmail.com}).
#' * Refactoring by [Romain](https://github.com/romainx).
#'
#' @source Khan et al. (1999)
#' @source
#' Khan et al. (1999)
#'
#' @references
#'
#' A.S. Khan, et al. 1999. The reemergence of Ebola hemorrhagic fever,
#' Democratic Republic of the Congo, 1995. J Infect Dis 179:S76-S86.
#'
#' @family ebola
#'
#' @examples
#' ## show first few cases
#' head(ebola_kikwit_1995)
Expand Down
Loading