Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Mar 12, 2024
1 parent 2fd57c5 commit 1122e44
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 12 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
62e0040d
1ada4d36
14 changes: 7 additions & 7 deletions hands-on.html
Original file line number Diff line number Diff line change
Expand Up @@ -560,13 +560,13 @@ <h4 class="anchored" data-anchor-id="how-can-you-see-the-sql-query">How can you
# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]
Relevance num_species
&lt;chr&gt; &lt;dbl&gt;
1 Potential predator (eggs; mammal) 2
2 Microtine (alternate prey for predators) 5
3 Study species 41
4 Incidental monitoring 18
5 Potential predator (avian) 25
6 Potential predator (mammal) 6
7 Study species; potential predator (eggs) 2</code></pre>
1 Incidental monitoring 18
2 Study species 41
3 Potential predator (avian) 25
4 Potential predator (mammal) 6
5 Study species; potential predator (eggs) 2
6 Potential predator (eggs; mammal) 2
7 Microtine (alternate prey for predators) 5</code></pre>
</div>
</div>
<p>Does that code looks familiar? But this time, here is really the query that was used to retrieve this information:</p>
Expand Down
2 changes: 1 addition & 1 deletion search.json
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"href": "hands-on.html#lets-connect-to-our-first-database",
"title": "Hands-on DuckDB & dplyr",
"section": "Let’s connect to our first database",
"text": "Let’s connect to our first database\n\nlibrary(dbplyr) # to query databases in a tidyverse style manner\nlibrary(DBI) # to connect to databases\n# install.packages(\"duckdb\") # install this package to get duckDB API\nlibrary(duckdb) # Specific to duckDB\n\n\nLoad the bird database\nThis database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section:\n\nconn &lt;- dbConnect(duckdb::duckdb(), dbdir = \"./data/bird_database.duckdb\")\n\nList all the tables present in the database:\n\ndbListTables(conn)\n\n[1] \"Bird_eggs\" \"Bird_nests\" \"Camp_assignment\" \"Personnel\" \n[5] \"Site\" \"Species\" \n\n\nLet’s have a look at the Species table\n\nspecies_db &lt;- tbl(conn, \"Species\")\nspecies_db\n\n# Source: table&lt;Species&gt; [?? x 4]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n Code Common_name Scientific_name Relevance \n &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; \n 1 agsq Arctic ground squirrel Spermophilus parryii Potential predator (egg…\n 2 amcr American Crow Corvus brachyrhynchos Potential predator (avi…\n 3 amgp American Golden-Plover Pluvialis dominica Study species \n 4 arfo Arctic fox Alopex lagopus Potential predator (mam…\n 5 arte Arctic Tern Sterna paradisaea Incidental monitoring \n 6 basa Baird's Sandpiper Calidris bairdii Study species \n 7 bbis Broad-billed Sandpiper Calidris falcinellus Study species \n 8 bbpl Black-bellied Plover Pluvialis squatarola Study species \n 9 bbsa Buff-breasted Sandpiper Calidris subruficollis Study species \n10 besw Bewick's Swan Cygnus columbianus Incidental monitoring \n# ℹ more rows\n\n\nYou can filter the data and select columns:\n\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3)\n\n# Source: SQL [3 x 1]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n# Ordered by: Scientific_name\n Scientific_name \n &lt;chr&gt; \n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba \n\n\n\n\n\n\n\n\nNote\n\n\n\nNote that those are not data frames but tables. What dbplyr is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc.\n\n\n\nHow can I get a “real” data frame?\nYou add collect() to your query.\n\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3) %&gt;% \n collect()\n\n# A tibble: 3 × 1\n Scientific_name \n &lt;chr&gt; \n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba \n\n\nNote it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can.\n\n\nHow can you see the SQL query?\nAdding show_query() at the end of your code block will let you see the SQL code that has been used to query the database.\n\n# Add show_query() to the end to see what SQL it is sending!\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3) %&gt;% \n show_query()\n\n&lt;SQL&gt;\nSELECT Scientific_name\nFROM Species\nWHERE (Relevance = 'Study species')\nORDER BY Scientific_name\nLIMIT 3\n\n\nThis is a great way to start getting familiar with the SQL syntax, because although you can do a lot with dbplyr you can not do everything that SQL can do. So at some point you might want to start using SQL directly.\nHere is how you could run the query using the SQL code directly:\n\n# query the database using SQL\ndbGetQuery(conn, \"SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3\")\n\n Scientific_name\n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba\n\n\nYou can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc.\nLet’s count how many species there are per Relevance categories:\n\nspecies_db %&gt;%\n group_by(Relevance) %&gt;%\n summarize(num_species = n())\n\n# Source: SQL [7 x 2]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n Relevance num_species\n &lt;chr&gt; &lt;dbl&gt;\n1 Potential predator (eggs; mammal) 2\n2 Microtine (alternate prey for predators) 5\n3 Study species 41\n4 Incidental monitoring 18\n5 Potential predator (avian) 25\n6 Potential predator (mammal) 6\n7 Study species; potential predator (eggs) 2\n\n\nDoes that code looks familiar? But this time, here is really the query that was used to retrieve this information:\n\nspecies_db %&gt;%\n group_by(Relevance) %&gt;%\n summarize(num_species = n()) %&gt;%\n show_query()\n\n&lt;SQL&gt;\nSELECT Relevance, COUNT(*) AS num_species\nFROM Species\nGROUP BY Relevance\n\n\n\n\n\nAverage egg volume analysis\nLet’s reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database:\n\n# loading all the necessary tables\neggs_db &lt;- tbl(conn, \"Bird_eggs\")\nnests_db &lt;- tbl(conn, \"Bird_nests\")\n\nCompute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object\n\n# Compute the egg volume\neggs_volume_db &lt;- eggs_db %&gt;%\n mutate(egg_volume = pi/6*Width^2*Length)\n\n\n\n\n\n\n\nCaution\n\n\n\nLimitation: no way to add or update data in the database, dbplyr is view only. If you want to add or update data, you’ll need to use the DBI package functions.\n\n\nNow let’s join this information to the nest table, and average by species\n\n# Join the egg and nest tables to compute average\nspecies_egg_volume_avg_db &lt;- left_join(nests_db, eggs_volume_db, by=\"Nest_ID\") %&gt;%\n group_by(Species) %&gt;%\n summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %&gt;%\n arrange(desc(egg_volume_avg)) %&gt;% \n collect() %&gt;%\n drop_na()\n\nspecies_egg_volume_avg_db\n\n# A tibble: 7 × 2\n Species egg_volume_avg\n &lt;chr&gt; &lt;dbl&gt;\n1 bbpl 33975.\n2 amgp 28545.\n3 rutu 18094.\n4 dunl 11777.\n5 wrsa 10111.\n6 sepl 9903.\n7 reph 8444.\n\n\nWhat does this SQL query looks like?\n\nspecies_egg_volume_avg_db &lt;- left_join(eggs_volume_db, nests_db, by=\"Nest_ID\") %&gt;%\n group_by(Species) %&gt;%\n summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %&gt;%\n arrange(desc(egg_volume_avg)) %&gt;% \n show_query()\n\n&lt;SQL&gt;\nSELECT Species, AVG(egg_volume) AS egg_volume_avg\nFROM (\n SELECT\n LHS.Book_page AS \"Book_page.x\",\n LHS.\"Year\" AS \"Year.x\",\n LHS.Site AS \"Site.x\",\n LHS.Nest_ID AS Nest_ID,\n Egg_num,\n Length,\n Width,\n egg_volume,\n Bird_nests.Book_page AS \"Book_page.y\",\n Bird_nests.\"Year\" AS \"Year.y\",\n Bird_nests.Site AS \"Site.y\",\n Species,\n Observer,\n Date_found,\n how_found,\n Clutch_max,\n floatAge,\n ageMethod\n FROM (\n SELECT\n Bird_eggs.*,\n ((3.14159265358979 / 6.0) * (POW(Width, 2.0))) * Length AS egg_volume\n FROM Bird_eggs\n ) LHS\n LEFT JOIN Bird_nests\n ON (LHS.Nest_ID = Bird_nests.Nest_ID)\n) q01\nGROUP BY Species\nORDER BY egg_volume_avg DESC\n\n\n\n\n\n\n\n\nQuestion\n\n\n\nWhy does the SQL query include the volume computation?\n\n\n\n\nDisconnecting from the database\nBefore we close our session, it is good practice to disconnect from the database first\n\nDBI::dbDisconnect(conn, shutdown = TRUE)"
"text": "Let’s connect to our first database\n\nlibrary(dbplyr) # to query databases in a tidyverse style manner\nlibrary(DBI) # to connect to databases\n# install.packages(\"duckdb\") # install this package to get duckDB API\nlibrary(duckdb) # Specific to duckDB\n\n\nLoad the bird database\nThis database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section:\n\nconn &lt;- dbConnect(duckdb::duckdb(), dbdir = \"./data/bird_database.duckdb\")\n\nList all the tables present in the database:\n\ndbListTables(conn)\n\n[1] \"Bird_eggs\" \"Bird_nests\" \"Camp_assignment\" \"Personnel\" \n[5] \"Site\" \"Species\" \n\n\nLet’s have a look at the Species table\n\nspecies_db &lt;- tbl(conn, \"Species\")\nspecies_db\n\n# Source: table&lt;Species&gt; [?? x 4]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n Code Common_name Scientific_name Relevance \n &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; \n 1 agsq Arctic ground squirrel Spermophilus parryii Potential predator (egg…\n 2 amcr American Crow Corvus brachyrhynchos Potential predator (avi…\n 3 amgp American Golden-Plover Pluvialis dominica Study species \n 4 arfo Arctic fox Alopex lagopus Potential predator (mam…\n 5 arte Arctic Tern Sterna paradisaea Incidental monitoring \n 6 basa Baird's Sandpiper Calidris bairdii Study species \n 7 bbis Broad-billed Sandpiper Calidris falcinellus Study species \n 8 bbpl Black-bellied Plover Pluvialis squatarola Study species \n 9 bbsa Buff-breasted Sandpiper Calidris subruficollis Study species \n10 besw Bewick's Swan Cygnus columbianus Incidental monitoring \n# ℹ more rows\n\n\nYou can filter the data and select columns:\n\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3)\n\n# Source: SQL [3 x 1]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n# Ordered by: Scientific_name\n Scientific_name \n &lt;chr&gt; \n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba \n\n\n\n\n\n\n\n\nNote\n\n\n\nNote that those are not data frames but tables. What dbplyr is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc.\n\n\n\nHow can I get a “real” data frame?\nYou add collect() to your query.\n\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3) %&gt;% \n collect()\n\n# A tibble: 3 × 1\n Scientific_name \n &lt;chr&gt; \n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba \n\n\nNote it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can.\n\n\nHow can you see the SQL query?\nAdding show_query() at the end of your code block will let you see the SQL code that has been used to query the database.\n\n# Add show_query() to the end to see what SQL it is sending!\nspecies_db %&gt;%\n filter(Relevance==\"Study species\") %&gt;%\n select(Scientific_name) %&gt;%\n arrange(Scientific_name) %&gt;%\n head(3) %&gt;% \n show_query()\n\n&lt;SQL&gt;\nSELECT Scientific_name\nFROM Species\nWHERE (Relevance = 'Study species')\nORDER BY Scientific_name\nLIMIT 3\n\n\nThis is a great way to start getting familiar with the SQL syntax, because although you can do a lot with dbplyr you can not do everything that SQL can do. So at some point you might want to start using SQL directly.\nHere is how you could run the query using the SQL code directly:\n\n# query the database using SQL\ndbGetQuery(conn, \"SELECT Scientific_name FROM Species WHERE (Relevance = 'Study species') ORDER BY Scientific_name LIMIT 3\")\n\n Scientific_name\n1 Actitis macularius\n2 Calidris acuminata\n3 Calidris alba\n\n\nYou can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc.\nLet’s count how many species there are per Relevance categories:\n\nspecies_db %&gt;%\n group_by(Relevance) %&gt;%\n summarize(num_species = n())\n\n# Source: SQL [7 x 2]\n# Database: DuckDB v0.9.2 [unknown@Linux 6.5.0-1015-azure:R 4.3.3/./data/bird_database.duckdb]\n Relevance num_species\n &lt;chr&gt; &lt;dbl&gt;\n1 Incidental monitoring 18\n2 Study species 41\n3 Potential predator (avian) 25\n4 Potential predator (mammal) 6\n5 Study species; potential predator (eggs) 2\n6 Potential predator (eggs; mammal) 2\n7 Microtine (alternate prey for predators) 5\n\n\nDoes that code looks familiar? But this time, here is really the query that was used to retrieve this information:\n\nspecies_db %&gt;%\n group_by(Relevance) %&gt;%\n summarize(num_species = n()) %&gt;%\n show_query()\n\n&lt;SQL&gt;\nSELECT Relevance, COUNT(*) AS num_species\nFROM Species\nGROUP BY Relevance\n\n\n\n\n\nAverage egg volume analysis\nLet’s reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database:\n\n# loading all the necessary tables\neggs_db &lt;- tbl(conn, \"Bird_eggs\")\nnests_db &lt;- tbl(conn, \"Bird_nests\")\n\nCompute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object\n\n# Compute the egg volume\neggs_volume_db &lt;- eggs_db %&gt;%\n mutate(egg_volume = pi/6*Width^2*Length)\n\n\n\n\n\n\n\nCaution\n\n\n\nLimitation: no way to add or update data in the database, dbplyr is view only. If you want to add or update data, you’ll need to use the DBI package functions.\n\n\nNow let’s join this information to the nest table, and average by species\n\n# Join the egg and nest tables to compute average\nspecies_egg_volume_avg_db &lt;- left_join(nests_db, eggs_volume_db, by=\"Nest_ID\") %&gt;%\n group_by(Species) %&gt;%\n summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %&gt;%\n arrange(desc(egg_volume_avg)) %&gt;% \n collect() %&gt;%\n drop_na()\n\nspecies_egg_volume_avg_db\n\n# A tibble: 7 × 2\n Species egg_volume_avg\n &lt;chr&gt; &lt;dbl&gt;\n1 bbpl 33975.\n2 amgp 28545.\n3 rutu 18094.\n4 dunl 11777.\n5 wrsa 10111.\n6 sepl 9903.\n7 reph 8444.\n\n\nWhat does this SQL query looks like?\n\nspecies_egg_volume_avg_db &lt;- left_join(eggs_volume_db, nests_db, by=\"Nest_ID\") %&gt;%\n group_by(Species) %&gt;%\n summarise(egg_volume_avg = mean(egg_volume, na.rm = TRUE)) %&gt;%\n arrange(desc(egg_volume_avg)) %&gt;% \n show_query()\n\n&lt;SQL&gt;\nSELECT Species, AVG(egg_volume) AS egg_volume_avg\nFROM (\n SELECT\n LHS.Book_page AS \"Book_page.x\",\n LHS.\"Year\" AS \"Year.x\",\n LHS.Site AS \"Site.x\",\n LHS.Nest_ID AS Nest_ID,\n Egg_num,\n Length,\n Width,\n egg_volume,\n Bird_nests.Book_page AS \"Book_page.y\",\n Bird_nests.\"Year\" AS \"Year.y\",\n Bird_nests.Site AS \"Site.y\",\n Species,\n Observer,\n Date_found,\n how_found,\n Clutch_max,\n floatAge,\n ageMethod\n FROM (\n SELECT\n Bird_eggs.*,\n ((3.14159265358979 / 6.0) * (POW(Width, 2.0))) * Length AS egg_volume\n FROM Bird_eggs\n ) LHS\n LEFT JOIN Bird_nests\n ON (LHS.Nest_ID = Bird_nests.Nest_ID)\n) q01\nGROUP BY Species\nORDER BY egg_volume_avg DESC\n\n\n\n\n\n\n\n\nQuestion\n\n\n\nWhy does the SQL query include the volume computation?\n\n\n\n\nDisconnecting from the database\nBefore we close our session, it is good practice to disconnect from the database first\n\nDBI::dbDisconnect(conn, shutdown = TRUE)"
},
{
"objectID": "hands-on.html#how-did-we-create-this-database",
Expand Down
6 changes: 3 additions & 3 deletions sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://UCSB-Library-Research-Data-Services.github.io/intro-database-r/about.html</loc>
<lastmod>2024-03-11T20:02:37.873Z</lastmod>
<lastmod>2024-03-12T16:35:15.466Z</lastmod>
</url>
<url>
<loc>https://UCSB-Library-Research-Data-Services.github.io/intro-database-r/index.html</loc>
<lastmod>2024-03-11T20:02:37.885Z</lastmod>
<lastmod>2024-03-12T16:35:15.474Z</lastmod>
</url>
<url>
<loc>https://UCSB-Library-Research-Data-Services.github.io/intro-database-r/hands-on.html</loc>
<lastmod>2024-03-11T20:02:37.881Z</lastmod>
<lastmod>2024-03-12T16:35:15.474Z</lastmod>
</url>
</urlset>

0 comments on commit 1122e44

Please sign in to comment.