preparing_data.Rmd

# Preparing data {#prep-data}

The purpose of Chapter \@ref(prep-data) is to introduce you to the basic workflow for preparing data for OHI. This is a a 2-hour hands-on training: you will be following along on your own computer and working with a copy of the demonstration repository (_toolbox-demo_) that is used throughout the this chapter.


## Overview

Preparing data takes the biggest chunk of time when you're using the OHI Toolbox, even more than the final scores calculation itself. That's when you explore raw data, see whether it fits with your ideal spatial boundaries and whether it makes sense to include it in your final calculations. If it does, you can format it further and save it as data layers, or Toolbox inputs for scores calculations.

The _Starter Repo_ \@ref(toolbox-ecosystem) aims to help you wade through these important first steps of the assessment. Treat the preparatory, or "prep", files as your notebook, calculator, and presentation of your work. Here, the process of data exploration is recorded and can be easily shared with anyone.

<!-- MOVE THESE TO ECOSYSTEM SECTIONS? **Why prepare data in the Toolbox?** This is for transparency, reproducibility, and credibility. If someone wants to understand what data you're using, how you have processed it, and your rationale for including it or excluding other options, they can find the answers here. With this reproducible record, you will be able to repeat these methods with additional years of data using the same code you have already written. It also makes your technical team more stable. If there are personale changes, your team memory is preserved. It's easier for new members to pick up where it was left when your data preparation has been documented clearly and kept in one place.  -->

<!-- **Why R Markdown?** R Markdown produces a report with your comments, `R` code, and figures/tables embedded in it. It can be rendered as a nice HTML webpage that is easy to update and share with other people. We highly recommend it to anyone who is interested in open science, especially OHI+ users. -->

In this chapter, we will go through the basic requirements and steps of preparing a data layer: 

- understand the formatting requirements
- try a hands-on tutorial in which you can use sample data to format and save
- register a data layer

### Prerequisites

Before the training starts, please make sure you have done the following: 

1. Have up-to-date versions of `R` and RStudio and have RStudio configured with Git/GitHub 
    - https://cloud.r-project.org
    - http://www.rstudio.com/download 
    - http://happygitwithr.com/rstudio-git-github.html 
    
1. Fork the **toolbox-demo** repository into your own GitHub account by going to https://github.com/OHI-Science/toolbox-demo, clicking "Fork" in the upper right corner, and selecting your account. Note: if you already have an OHI+ repository of your own, you can use it instead. All of the file names and architecture is the same.

1. Clone the **toolbox-demo** repo from your GitHub account into RStudio into a folder called "github" in your home directory (filepath "~/github"). **Note**: If you've already forked and cloned the demo, you can instead pull from your computer to make sure you have the most recent versions of all the files. 

1. Get comfortable: be set up with two screens if possible. You will be following along in RStudio on your own computer while also watching an instructor's screen or following this tutorial.

After your `toolbox-demo` repository is set up, open the `prep` folder. It should look like this: 

![](https://docs.google.com/drawings/d/1JOeYc-4Izf3h9FHPKQhkTou2uZgOB8dh0melYhV5s44/pub?w=450)

Each goal and sub-goal has its own sub-folder, so you can store raw and intermediate data as you work in R Markdown (or `R`). It is highly recommended that data preparation occurs in the `prep` folder as much as possible, as it will also be archived by GitHub for future reference. 

But before we start preparing a data layer, let's first go over the data formatting requirements. 

## Data Formatting requirements

A data layer is a data file used to calculate scores. The global analysis included over 100 data layer files, and there will probably be as many in your own assessments. Each data layer (data input) **has its own _.csv_ file**. These data layers are combined to calculate goal scores, meaning that they are inputs for status, trend, pressures, and resilience.

The OHI Toolbox expects each _.csv_ file to have:

- data for every region within the study area
- a unique region identifier (*rgn_id*) associated with a single score or value.
- data organized in 'long' format - as few columns as possible. (_You can read more about long formatting [here](http://ohi-science.org/manual/#long-formatting)_) 

**OHI goal scores are calculated at the scale of the reporting unit**, which is called a ‘**region**’ and then combined to produce the score for the overall area assessed, called a ‘**study area**’. For example, the _U.S._ is a _study area_, and each _coastal state_ is a _region_. 

In addition, to calculate trend, input data should be available as a time series for **at least 5 recent years** - and the longer the better, as this can be used in setting temporal reference points.

>Finalized data layers have at least two columns: the `rgn_id` column and a column with data identified by its units (eg. _km2_ or _score_). There often may be a `year` column or a `category` column (for natural product categories or habitat types).

Below are examples of two different data layer files: tourism count (`tr_total.csv`) and natural products harvested (`np_harvest_tonnes.csv`). They show information for a study area with 4 regions. Each region has multiple years of data. And the second data layer has an additional 'categories' column for the different types of natural products that were harvested. 

![](https://docs.google.com/drawings/d/1Hydt117b7mFAcq4wkmdGKfPWq2Kn_1EVpfyL-zkzx-o/pub?w=500)

In this example, the two data layers are appropriate for _status calculations_ with the Toolbox because:

1. At least five years of data are available
2. There are no data gaps
3. Data are presented in 'long' or 'narrow' format 

_When data gaps, temporal or spatial, are inevitable, various gap-filling techniques can be used. We won't go through the details here. For more information and examples, visit [OHI Manual](http://ohi-science.org/manual/#gapfilling)_ 

## Data Wrangling

So how do we get from raw data to a data layer that looks like the ones shown above? When we first access a data set, we often don't know whether it is suitable for our purpose, or if it provides adequate information. Raw data can be in a different format than the desired long format and have extra or incomplete information. The main task of data preparation is to comb through the raw data, combine different sources, and sometimes fill in the information gap. We call this process "data wrangling." 

For reproducibility and transparency, it is also good practice to record and share the decision-making process - trials and errors, why you decide to include, or more importantly, exclude a certain data source. Fortunately, we can take notes on the data exploration process and code in one place!

Here is an [example](https://github.com/OHI-Science/bhi-1.0-archive/blob/draft/baltic2015/prep/CW/eutrophication/eutrophication_prep.md) of a data prep document. Let's give it a try to make one just like this. 

Now let's switch to the `demo` repo. Today we will use the _Clean Water (CW)_ goal as an hands-on example. We will get sample secchi depth data, as an indicator of water clarity. 

Click on the `CW` folder and open `CW_data_prep.Rmd`. We will follow the tutorial from there.

## Save and register data layers 

> Note: if you are preparing data from your Starter/Prep repository, you won't have the layers folder and `layers.csv`. You won't be able to do the following steps until you have the Full repository.

After you have successfully processed a data layer, and it can be used by the Toolbox for calculations, it needs to be saved and registered here: 

![](https://docs.google.com/drawings/d/1IwHf8pMBlpzOIYKL0hoEF5rETGA765Qtwkz7es7B3Rg/pub?h=500)

`layers` folder is where all data layers live, and where the Toolbox pulls a layer out when needed. 

`layers.csv` is a giant spreadsheet that contains information about each layer - which goal it is used for, filename, column names, etc., and it will direct ohicore to appropriate data layers during calculations. 

Let's first save the layer. You can do that in the prep script. Let's switch back to the Demo repository, and follow the last section. 

## Register data layers in `layers.csv`  

After we have saved a data layer in `layers` folder, we will catalogue the layer in `layers.csv`. `

![](https://docs.google.com/drawings/d/1juueRiA0gevOoHH_8owzLNanyQ-H1mAOEl7L_l5YSJ0/pub?h=600) 

If a layer simply has a new filename, only the *filename* column needs to be updated:

However, **if a new layer has been added (for example when a new goal model is developed)**, you will open `layers.csv` in a spreadsheet software (i.e. Microsoft Excel), add a new row in the registry for the new data layer and fill in the first eight columns (columns A-H):

 + **targets:** Add the goal/dimension that the new data layer relates to. Goals are indicated with two-letter codes and sub-goals are indicated with three-letter codes, with pressures, resilience, and spatial layers indicated separately.
 + **layer:** Add an identifying name for the new data layer, which will be referenced in R scripts like `functions.R` and *.csv* files like `pressures_matrix.csv` and `resilience_matrix.csv`.
 + **name:** Add a longer title for the data layer.
 + **description:** Add a longer description of the new data layer.
 + **fld_value:** Add the appropriate units for the new data layer. It is _the same as the column name in the data file_, which will be referenced in R scripts in subsequent calculations. (example: area_km2)
 + **units:** Add a description about the *units* chosen in the *fld_value* column above. Think about what units you would like to be displayed online when filling out "units." (example: km^2)
 + **filename:** Add a filename for the new data layer that matches the name of the *.csv* file that was created previously in the `layers` folder.
 + **fld_id_num:** Area designation that applies to the newly created data layer, such as: *rgn_id* and *fao_id*.

 It is important to check that you have filled you the fields correctly, for instance, if "fld_value" does not match the header of the source data layer, you will see an error message when you try to calculate scores. Other columns are generated later by the Toolbox as it confirms data formatting and content.

Let's open your layers.csv from your Finder, and we will fill it out together from there: 

![](https://docs.google.com/drawings/d/1AYmvy3ZVSW_4jFvuV6FcFCmefEvLDd4XHg9rAuExtD0/pub?w=960&h=720)

This is what the new line should look like: 

![](https://docs.google.com/drawings/d/1qjraupQuobcxVolubwqtIzOoqv_ovqO6NYi4O_sT3IE/pub?w=900)

## Organizing metadata

Keeping track of metadata is important as you create these layers. Metadata is information about the data, most importantly, its source. It's best to provide not only a reference like Feely et al. 2009, but also a url link to where other people could access the data. Other information that can be really useful is the years of data you use, and which goal models the data are used in. 

You can keep track of metadata in whatever way makes most sense to you. Since `layers.csv` is a file that you will keep updated with every data layer, it can make sense to add information there. You can add as many columns to layers.csv as you'd like. However, this will create a really huge spreadsheet, so maybe you want to keep this information in another place. 

If you've been using the Data Planner from Chapter \@ref(planner-guide), that can be a good place to continue keeping track of all the metadata. Or, you could start another spreadsheet. 

It doesn't matter where or how you keep this information, but be sure that you keep it somewhere, and keep it organized and consistent. You'll need information for every single layer you include, and it's easier to do as you go along than all at the end. We recommend copying the structure of `layers.csv` (if you're not using `layers.csv` for metadata) and having at a minimum columns that are for the data name, Toolbox data layer name, description, reference, and url. 

Later on in your assessment, we'll be able to take information from the columns of your csv and use them for reporting and on your website. An example of this is from the Global assessment: see http://ohi-science.org/ohi-global/layers. 

## Data prep vs. functions.r
Where does one end and the other begin?

## Chapter Recap: 

Hooray! You have just learned: how to process data, save, and register a data layer for the OHI Toolbox! 

- OHI data layer formatting requirements
  - data for each region
  - long format
  - best to have at least five years of continuous data
- how to process a messy and large raw data layer (secchi depth data for CW goal)
- how to saved and register the prepared data layer

Now the layer is ready to be used by the toolbox to calculate status and trend scores. We'll see you at Chapter 6 \@ref(calcs-basic) to learn that!