Skip to content

Commit

Permalink
update formatting for episode 3
Browse files Browse the repository at this point in the history
  • Loading branch information
kristi-sara committed Aug 8, 2024
1 parent 2845930 commit 457fe7e
Showing 1 changed file with 13 additions and 13 deletions.
26 changes: 13 additions & 13 deletions _episodes/03-ixodes-whole-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,33 +18,33 @@ keypoints:

### Goals
-----
> The entire or whole interaction dataset on GloBI consists of over 6 million interaction records. There are many ways to approach a large dataset and this exercise is to demonstrate one example using shell and R. We are not going to follow along with shell and introduction to R tutorials in this workshop, but Carpentries has a few nice ones to get you started, including [Introduction to shell](https://swcarpentry.github.io/shell-novice/) and [Introduction to R](https://datacarpentry.org/R-ecology-lesson/01-intro-to-r.html).
The entire or whole interaction dataset on GloBI consists of over 6 million interaction records. There are many ways to approach a large dataset and this exercise is to demonstrate one example using shell and R. We are not going to follow along with shell and introduction to R tutorials in this workshop, but Carpentries has a few nice ones to get you started, including [Introduction to shell](https://swcarpentry.github.io/shell-novice/) and [Introduction to R](https://datacarpentry.org/R-ecology-lesson/01-intro-to-r.html).

> These exercises can be followed along using R and shell, but it is not necessary. If you would like to follow along, please go ahead and open R-studio and your shell window.
These exercises can be followed along using R and shell, but it is not necessary. If you would like to follow along, please go ahead and open R-studio and your shell window.

### Getting started
---------------------------------
> At the end of this time we will regroup and report back the other workshop participants about what we did in this breakout group. Who would like to be the person/s who report back for the breakout group?
At the end of this time we will regroup and report back the other workshop participants about what we did in this breakout group. Who would like to be the person/s who report back for the breakout group?

> Let's collaboratively take notes in the Google Document. The link to the document is in the chat.
Let's collaboratively take notes in the Google Document. The link to the document is in the chat.

### Find all of the records in the dataset based on a taxon name
---------------------------------
> We are interested in finding all of the records in the interactions.csv dataset that deal with *Ixodes* and we are interested in reducing the size of the data so it is easier to manage. One quick way to do this is via the shell.
We are interested in finding all of the records in the interactions.csv dataset that deal with *Ixodes* and we are interested in reducing the size of the data so it is easier to manage. One quick way to do this is via the shell.

> How many records are in the GloBI dataset. It is a lot!
How many records are in the GloBI dataset. It is a lot!

~~~
wc -l interactions.csv
~~~

> One of the first things we might want to do is trim the dataset to only those taxa we are interested in analysing. In this case, we will look for all *Ixodes* records. To do so, we will use a simple [shell script](https://github.com/seltmann/interaction-data-workshop), extract all of the rows that contain the word *Ixodes* and create a new file file. This process will help reduce the size of the dataset so we can use R for our analysis. The shell script will take ~ 4 minutes and 12 seconds to complete!
One of the first things we might want to do is trim the dataset to only those taxa we are interested in analysing. In this case, we will look for all *Ixodes* records. To do so, we will use a simple [shell script](https://github.com/seltmann/interaction-data-workshop), extract all of the rows that contain the word *Ixodes* and create a new file file. This process will help reduce the size of the dataset so we can use R for our analysis. The shell script will take ~ 4 minutes and 12 seconds to complete!

~~~
sh Globi_Ixodes_data.sh
~~~

> When we examine the code in the script we see that it is using grep, which is "a Unix command used to search files for the occurrence of a string of characters that matches a specified pattern". Grep matches on the row and does not specify which column *Ixodes* is found. We then sort the records to look for only exact, unique versions of the records.
When we examine the code in the script we see that it is using grep, which is "a Unix command used to search files for the occurrence of a string of characters that matches a specified pattern". Grep matches on the row and does not specify which column *Ixodes* is found. We then sort the records to look for only exact, unique versions of the records.

~~~
echo Creating headers
Expand All @@ -59,13 +59,13 @@ sort -r ../data/Ixodes_data.csv | uniq > ../data/Ixodes_data_unique.csv
wc -l ../data/Ixodes_data_unique.csv
~~~

> If you want to find several taxa and combine the datasets, you could create files from multiple taxa and combine the output together into a single dataset using **cat**. An example of this can be found [here](https://github.com/lee-michellej/globi_tritrophic_networks/blob/master/Code/Globi_bee_data.sh). This example takes all files of the files in the Data folder that contanin the pattern **unique.tsv** and creates a new file called *all_data.txt**._
If you want to find several taxa and combine the datasets, you could create files from multiple taxa and combine the output together into a single dataset using **cat**. An example of this can be found [here](https://github.com/lee-michellej/globi_tritrophic_networks/blob/master/Code/Globi_bee_data.sh). This example takes all files of the files in the Data folder that contanin the pattern **unique.tsv** and creates a new file called *all_data.txt**._

~~~
cat ../Data/*unique.tsv >> ../Data/all_data.txt
~~~

> Now lets compare the new datasets.How many records are in the trimmed GloBI datasets? Is there a difference between unique and not?
Now lets compare the new datasets.How many records are in the trimmed GloBI datasets? Is there a difference between unique and not?

~~~
wc -l Ixodes_data.csv
Expand All @@ -75,11 +75,11 @@ wc -l Ixodes_data_unique.csv

### Let's do something in R
---------------------------------
> Load trimmed dataset into R using R-studio. We will start by stepping through some R code and discuss the results. The [R code](https://github.com/seltmann/interaction-data-workshop) we are using can be downloaded to follow along or you can see an [html preview](https://htmlpreview.github.io/?https://github.com/seltmann/globi-workshop-2021/blob/main/code/globi-example.html) of the code.
Load trimmed dataset into R using R-studio. We will start by stepping through some R code and discuss the results. The [R code](https://github.com/seltmann/interaction-data-workshop) we are using can be downloaded to follow along or you can see an [html preview](https://htmlpreview.github.io/?https://github.com/seltmann/globi-workshop-2021/blob/main/code/globi-example.html) of the code.

> We will start by just finding the columns and create a subset of the data to import into Google Sheets. Time permitting, we will talk about some of the interesting data issues we are finding in the dataset.
We will start by just finding the columns and create a subset of the data to import into Google Sheets. Time permitting, we will talk about some of the interesting data issues we are finding in the dataset.

> ## `Exercise 1: What do the columns mean?`
> ## Exercise 1: What do the columns mean?
-----
> There are 88 columns in the interactions data file. In this exercise, we will find the columns and pick out which ones are commonly useful in research data. You can create your own list or use this Google Sheet with the first [100 rows of the Ixodes_data_unique.csv](https://docs.google.com/spreadsheets/d/10C4VnpPZnq5LbaMorVcU8EWTXlI7-dbpg7az-_GKIVI/edit?usp=sharing) file.
1. Obtain a list of all of the column names.
Expand Down

0 comments on commit 457fe7e

Please sign in to comment.