README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
library(ingestr)
```

<img src="man/figures/logo_ingestr.svg" align="right" width=150px>  

# ingestr

__An R package for reading environmental data from raw formats into dataframes.__

[![Travis-CI Build Status](https://travis-ci.org/jpshanno/ingestr.svg?branch=master)](https://travis-ci.org/jpshanno/ingestr)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/jpshanno/ingestr?branch=master&svg=true)](https://ci.appveyor.com/project/jpshanno/ingestr)
[![Coverage Status](https://img.shields.io/codecov/c/github/jpshanno/ingestr/master.svg)](https://codecov.io/github/jpshanno/ingestr?branch=master)
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

This project was initiated at the inaugural [IMCR Hackathon](https://github.com/IMCR-Hackathon/HackathonCentral) hosted by the [Environmental Data Institute](https://environmentaldatainitiative.org/). The end product of this effort will be an R package on CRAN.  The package will primarily deal with reading data from files, though there will be some utilities for initial cleanup of files such as removing blank rows and columns at the end of a CSV file. Our work at the hackathon focused on package infrastructure, standardization, and template construction. 

The guiding principles of ingestr are that

1. All sources of environmental-related data should be easy to read directly
2. Reading in data should provide a standard output
3. Header information contained within sensor data files should be stored in a standard, easily readable format
4. Associating imported data with its original source is the first step towards good data provenance records and reproducibility
5. We don't know about every common sensor and love contributions of code or sensors that need support. See [issues](https://github.com/jpshanno/ingestr/issues) to submit an example data file, and see our [contributing guide](https://jpshanno.github.io/ingestr/CONTRIBUTING) to contribute code.

## Installation

You can install ingestr from github with:

```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("jpshanno/ingestr")

# or from source
ingestr_source <- file.path(tempdir(), "ingestr-master.zip")
download.file("https://github.com/jpshanno/ingestr/archive/master.zip",
              ingestr_source)
unzip(ingestr_source,
      exdir = dirname(ingestr_source))
install.packages(sub(".zip$", "", ingestr_source), 
                 repos = NULL,
                 type = "source")

```

## Ingesting Data
Each ingestr function to read in data starts with `ingest_` to make autocomplete easier. We are targetting any source of environmental data that returns data in a standard format: native sensor files, delimited outputs, HTML tables, PDF tables, Excel sheets, API returns, ...

Running any ingest function will read in the data and format the data into a clean R data.frame. Column names are taken directly from the data file, and users have the option to read the header information into a temporary file that can then be loaded using `ingest_header()`.  
All data and header data that are read in will have the data source appended to the data as a column called input_source.

### Sensor and Instrument Data
Many sensors provide their output as delimited files with header information contained above the recorded data. 
```{r example, eval = FALSE}
campbell_file <- 
  system.file("example_data",
              "campbell_scientific_tao5.dat",
              package = "ingestr")

campbell_data <- 
  ingest_campbell(input.source = campbell_file,
                  export.header = TRUE,
                  add.units = TRUE,
                  add.measurements = TRUE)

str(campbell_data)

campbell_header <- 
  ingest_header(input.source = campbell_file)

str(campbell_header)
```

### Formatted Non-Sensor Data Sources
Some environmental data is published online as html elements.  This data can be difficult to read directly from the websites where they are hosted into R.  To facilitate access, we have created functions that parse the html so that this data can be directly downloaded in R.  To track the provenance of these data, the column input_source is populated by the URL location from which the data were downloaded.  

```{r example2}

PDO_data <- 
  ingest_PDO(input.source = "http://jisao.washington.edu/pdo/PDO.latest",  
             end.year = NULL,
             export.header = TRUE)

str(PDO_data)

PDO_header <- 
  ingest_header(input.source = "http://jisao.washington.edu/pdo/PDO.latest")

str(PDO_header)
```

### Batch Ingests
Sensor data stored in folders is available for batch import using `ingest_` functions, or any other function that reads in data (i.e. `read.csv`, `readr::read_csv`. When a directory is read in file names are checked for duplicates, and imported data is checked for duplicate file contents. The user is warned and can choose to suppress the warning or remove the duplicates.  

```{r eval = FALSE}
temperature_data <- 
  ingest_directory(directory = "campbell_loggers",
                   ingest.function = ingest_campbell,
                   pattern = ".dat")

temperature_data
```

### Incorporate File Naming Conventions as Data

Filenames generally include information about the data collected: site, sensor, measurement type, date collected, etc. We are working on a generalized approach (probably just a function or two) that would split the filename into data columns using a template.  
For example if a set of file names read as "site-variable-year" (152-soil_moisture-2017.csv, 152-soil_temperature-2017.csv, 140-soil_moisture_2017.csv, etc), then the function would take an argument supplying the template as column headers: "site-variable-year" with either delimiters or the length of each variable to enable splitting. These functions will likely build off of the great work done on `tidyr::separate()` and we suggest using that until we have incorporated a solution.

## Preliminary Clean-up Utilities

Basic data cleaning utilities will be included in ingestr. These will include identifying duplicate rows, empty rows, empty columns, and columns that contain suspicious entries (i.e. "."). These utilities will be able to flag or correct the problems depending upon user preference. In keeping with our commitment to data provenance and reproduciblity all cleaning utilities will provide a record of identified and corrected issues which can be saved by the user and stored with the final dataset.

## QAQCR (quacker)

While ingestr is focused on getting data into R and running preliminary checks, another group at the IMCR Hackathon focused on quality assurance checks for environmental data. [qaqcr](https://github.com/IMCR-Hackathon/qaqc_tools) provides a simple, standard way to apply the quality control checks that are applicable to your data.

The packages are the start of a larger ecosystem including [EMLassemblyline](https://github.com/EDIorg/EMLassemblyline) for environmental data management to create a convenient, reproducible workflow moving from raw data to archived datasets with rich EML metadata.


```{r eval=FALSE, include=FALSE}


[![Travis-CI Build Status](https://travis-ci.org/jpshanno/ingestr.svg?branch=master)](https://travis-ci.org/jpshanno/ingestr)
[![Coverage Status](https://img.shields.io/codecov/c/github/jpshanno/ingestr/master.svg)](https://codecov.io/github/jpshanno/ingestr?branch=master)

R package for reading environmental data from raw formats into dataframes. 

This is an alpha work in progress initiated at the inaugural [IMCR Hackathon](https://github.com/IMCR-Hackathon/HackathonCentral).  The end product of this effort will be an R package on CRAN.  The package will primarily deal with reading data from files, though there will be some utilities for initial cleanup of files such as removing blank rows and columns at the end of a CSV file.

We're just getting started, so expect things to break!

# Reading in Files

Scientific data files are produced in many formats by many means. Here's what's on our radar.

* Sensors
       * Solinst
    * iButton
    * EGM4 - todo
    * Hobo - todo
    * YSI - todo
    * others?
* Instrument Reports
    * Shimadzu
    * Horiba
    * Plate reader
* Non-sensor-originated data, organized by data source
    * HTML
        * https://www.esrl.noaa.gov/psd/enso/mei/table.html
        * http://research.jisao.washington.edu/pdo/PDO.latest
    * PDF
    * NetCDF
        * CF-compliant
        * Non-CF-compliant
    * Excel/Data notebook
    * Text/CSV/ASCII
    * Databases

# Helper Utilities

The package should be able to parse a single file or all files in a folder or zip file. If batch reading files then files should be checked for duplicate contents.

Assuming the user has set up several scripts in a folder for batch processing files, the package should support batch running all scripts in that folder.

The package should be able to extract information from filenames, such as station, date, variable, into columns within the data frame. For example if a set of file names read as "site-variable-year" (152-soil_moisture-2017, 152-soil_temperature-2017, 140-soil_moisture_2017, etc), then the function would take an argument supplying the template as column headers: "site-variable-year".

The package should be able to split a single column in the original data into multiple columns in the data frame, a la [tidyr](http://tidyr.tidyverse.org/).

# Cleanup Utilities

These are cleanup utilities that make sense to include in the data ingestion step.
* Remove blank rows and columns
* Find exact duplicates at the row level and flag or delete them
* Put datetimes in standard format.
    * ISO example datetime: 2018-06-12T16:33-06
  
# Provenance

Any function we make should record the source file as part of the data.

If data cleaning is performed, a separate data frame is output with three columns: the original filename, the line of text or data from the original file that was cleaned or removed, and the reason.

```