Skip to content

Commit

Permalink
Merge pull request #25 from lmu-osc/17-feedback-on-current-state-of-t…
Browse files Browse the repository at this point in the history
…utorial-malikas-version

17 feedback on current state of tutorial malikas version
  • Loading branch information
NeuroShepherd authored Aug 28, 2024
2 parents d182634 + 475efb0 commit fbc60c2
Show file tree
Hide file tree
Showing 9 changed files with 96 additions and 101 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
on:
workflow_dispatch:
schedule:
- cron: "0 23 * * 0"
push:
branches: main

Expand Down
8 changes: 4 additions & 4 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,9 @@ website:
- href: renv_getting_started.qmd
text: "1. Quick Start"
- href: starting_details.qmd
text: "2. Starting Details"
- href: caching.qmd
text: "3. Caching"
text: "2. Understanding {renv}"
- href: restoring_a_project.qmd
text: "4. Restoring Projects"
text: "3. Restoring Projects"
- section: "Exercises"
contents:
- href: ex_init_snapshot.qmd
Expand All @@ -68,6 +66,8 @@ website:
text: "Explicitly Record"
- section: "Optional Content"
contents:
- href: caching.qmd
text: "Caching"
- href: embed_and_use.qmd
text: "`embed()` and `use()`"
- href: advanced_topics.qmd
Expand Down
22 changes: 7 additions & 15 deletions caching.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ In the context of software development, caching is used to store data that is fr

In the context of {renv}, the package cache is a shared library that contains the packages downloaded for and then used in your projects. The cache will, when needed, contain multiple different versions of the same package and your project will link to the correct version, only downloading the version specified in the `renv.lock` if you don't already have it somewhere in the renv cache. This shared library is a huge space saver, especially if you have many projects using the same packages.

A cache is built per each minor version of R you use. For example, if have used {renv} with R versions 4.3 and 4.4 on your computer, then you will end up with a cache matching each of these R versions. This is an important detail to note for the [Restoring Projects](restoring_a_project.html) section of this tutorial, as you may need to rebuild the cache if you upgrade your R version.
A cache is built per each minor version of R you use. For example, if you have used {renv} with R versions 4.3 and 4.4 on your computer, then you will end up with a cache matching each of these R versions. This is an important detail to note for the [Restoring Projects](restoring_a_project.html) section of this tutorial, as you may need to rebuild the cache if you upgrade your R version.

## Cache Locations

Expand All @@ -48,15 +48,14 @@ list.files("~/Library/Caches/org.R-project.R/R/renv/cache/v5", full.names = T)
#> [4] "~/Library/Caches/org.R-project.R/R/renv/cache/v5/R-4.4"
```

The {renv} package also provides a function to access the exact path to the cache used in your current project. This cache location will be slightly more specific than the paths listed above because it is a reference to **one** specific cache, but not **all** of the caches on your system. You can access the path to the cache with the following code:
The {renv} package also provides a function to access the exact path to the cache used in your current project. This cache location will be slightly more specific than the paths listed above because it is a reference to the **one** cache specific to the version of R used in your project. You can access the path to the cache with the following code:

``` r
# Run on a MacOS, and <USER> removed.
renv::paths$cache()
#> [1] "/Users/<USER>/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/aarch64-apple-darwin20"
```


### Packages in the Cache

Each of the caches will contain the packages that you downloaded for use in your projects as you were using that version of R. So for example, if you downloaded version 1.0.9 of {dplyr} while using R 4.3, you would find the package in the `R-4.3` folder of the cache. If you then started a project using R 4.4 and downloaded version 1.1.2 of {dplyr}, you would find that version in the `R-4.4` folder of the cache. Moreover, if another project using R 4.4 also needed version 1.1.4 of {dplyr}, it would be found in the `R-4.4` folder of the cache.
Expand Down Expand Up @@ -85,26 +84,19 @@ In this case, I have versions 3.4.3, 3.4.4, 3.5.0, and 3.5.1 of {ggplot2} in the

The discussion of caching so far has covered just the shared libraries that {renv} uses to store packages. But how does {renv} use these caches, how does this relate back to your project libraries, and what is the role of the `renv/library` folder in your projects?

Each {renv} project has its own library, located in the `renv/library` folder of the project. When you install a package in a project with {renv}, however, the package is not *technically* installed in the `renv/library` library. In fact, none of the packages used in your project are *actually* stored in this folder. Instead, the contents of these folders are "symlinks" to the packages in the shared library.
Each {renv} project has its own library, located in the `renv/library` folder of the project. When you install a package in a project with {renv}, however, the package is not *technically* installed in the `renv/library` library. In fact, none of the packages used in your project are *actually* stored in this folder. Instead, the contents of these folders are "symlinks" to the packages in the shared library.

### Symlinks

A symlink or "symbolic link" is a file that points to another file or directory. It is a reference to the original file or directory, and it can be used to access the original file or directory from a different location. Symlinks are important because they allow you to create shortcuts to files or directories, which can be useful for organizing files, accessing files from different locations, or creating symbolic links to files or directories that are located on different drives or partitions.

In the context of {renv}, the apparent packages in the `renv/library` folder of your project are actually symlinks that point to the packages in the shared library!


# Summary and Key Points

This chapter contained quite a bit of technical detail and may have been a bit overwhelming. Here are the key points to remember:

1. Caching is used to store data that is frequently accessed, such as packages, to speed up the execution of a program.
2. In the context of {renv}, the package cache is a shared library that contains the packages downloaded for and then used in your projects.
3. The cache is built per each minor version of R you use, and the cache locations are specific to your operating system. **You will almost never need to directly access these cache locations.**
4. Each project has its own library, located in the `renv/library` folder of the project, but the packages in this folder are actually symlinks to the packages in the shared library.






1. Caching is used to store data that is frequently accessed, such as packages, to speed up the execution of a program.
2. In the context of {renv}, the package cache is a shared library that contains the packages downloaded for and then used in your projects.
3. The cache is built per each minor version of R you use, and the cache locations are specific to your operating system. **You will almost never need to directly access these cache locations.**
4. Each project has its own library, located in the `renv/library` folder of the project, but the packages in this folder are actually symlinks to the packages in the shared library.
26 changes: 13 additions & 13 deletions comp_reproducible.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ fig-cap-location: bottom
bibliography: references.bib
---

In this tutorial, we focus on the practical aspects of making sure that the code we write can be run on other machines, by other people, and in the future. In other words, we want to make sure that our code is portable and future-proof by ensuring the software originally used in creating our code is the same software used by others. Let’s look at this in more detail.

# What is Reproducibility?

## A Brief Definition
Expand Down Expand Up @@ -34,13 +36,14 @@ In the context of this tutorial, we will focus on the practical aspects of makin

Getting code to work on your own machine is, in principle, not too difficult. You can install the necessary software specific to your hardware and software, set up your environment, and run your code! Simple, right? (Writing code that actually works is another story, of course 😉.)

However, this is only half the battle. When you run your code on your machine, you are running it in an environment that you have set up and configured to your liking. You have installed the software *you* need, perhaps have set up specific paths, and configured various settings to your preferences.
However, this is only half the battle. When you run your code on your machine, you are running it in an environment that you have set up and configured to your liking. You have installed the software *you* need, perhaps have set up specific paths, and configured various settings to your preferences.

::: {.center}
::: center
```{r, out.width="85%", echo=FALSE}
knitr::include_graphics("assets/img/works_on_my_machine.jpg")
```
:::

## Someone Else's Computer

When you share your code with others, you are effectively asking them to run *your* code on *their* machine, and it is unlikely that their computer is set up exactly like yours given how many degrees of freedom there are in operating systems, programming languages, software, and the different versions of these. A non-exhaustive list of examples where there might be software discrepancies are detailed below.
Expand All @@ -56,7 +59,7 @@ It's already well-known that different operating systems can have different soft

However, it is also important to consider that different versions of the same operating system can have different software requirements. For example, some software might only be compatible with Windows 10 and not Windows 11.

::: {.center}
::: center
![](assets/img/mac_linux_windows.jpg)
:::

Expand All @@ -68,39 +71,36 @@ Less obvious, however, and more common as a pain-point, are the differences in v

Fortunately, most modern programming languages are cross-compatible across recent, major operating systems without issue.

::: {.center}
::: center
![](assets/img/r_python_julia.jpg)
:::

## Packages/Libraries (Add-Ons)

The most likely pain point for reproducibility is the software add-ons, or packages/libraries, that are used in a project. For example, in R, there are over 20,000 packages available on CRAN, and in Python, there are over 200,000 packages available on PyPI.

Keeping tracking of which packages are used and the specific versions of those packages is a major challenge in reproducibility.
Keeping track of which packages are used and the specific versions of those packages is a major challenge in reproducibility.

In the context of the R programming language, most packages are likewise compatible across operating systems.

::: {.center}
::: center
![](assets/img/r_packages.png)
:::

:::

<br>


## All of the Machines

::: callout-important
## Expect No One to Already Have the Required Software

1. Don't expect others to have the software you rely on.
2. Even if others have the software, don't expect them to have the **same version**.

1. Don't expect others to have the software you rely on.
2. Even if others have the software, don't expect them to have the **same version**.
:::

In summary, the software environment of a project can be incredibly complex, with many degrees of freedom. If you have ever tried to run someone else's code and it didn't work, it was likely due to one of these reasons. Moreover, it's not practical to manage most of these differences manually. For example, requesting that someone install Python v3.8.2, R v4.0.3, and a specific version of a package is technically possible, but exceptionally tedious and a poor use of time time. So this brings us to our core question: how do we set up a project to work on everybody's machines?[^3] **By managing our software dependencies**, described in the next chapter.
In summary, the software environment of a project can be incredibly complex, with many degrees of freedom. If you have ever tried to run someone else's code and it didn't work, it was likely due to one of these reasons. Moreover, it's not practical to manage most of these differences manually. For example, requesting that someone install Python v3.8.2, R v4.0.3, and a specific version of a package is technically possible, but exceptionally tedious and a poor use of time. So this brings us to our core question: how do we set up a project to work on everybody's machines?[^2] **By managing our software dependencies**, described in the next chapter.

[^3]: Within reason; many software and hardware configurations just simply were not meant to be, but most modern programming languages are cross-compatible across recent, major operating systems without issue.
[^2]: Within reason; many software and hardware configurations just simply were not meant to be, but most modern programming languages are cross-compatible across recent, major operating systems without issue.

<!-- footnotes -->
Loading

0 comments on commit fbc60c2

Please sign in to comment.