This repository (branch) contains the materials that will be used for the Pelican Origin Tutorial as part of the NDC-C Pathfinders Hackathon, on Nov. 13, 2024.
Hands-on Schedule
- "Deploy a Pelican Origin" (README.md) <- current document
- [Optional] "Using Data From the OSDF" (pelican-plugin/README.md)
Accompanying slides: https://go.wisc.edu/bh9u4g
The following instructions assume the computer you are using has Docker and the web host certificates for using https, and that you have permission to use the OSDF ITB data federation (osdf-itb.osg-htc.org
).
Participants of the Hackathon will be given access to a virtual machine that satisfies these requirements.
This tutorial uses Pelican v7.10.12.
Jump to:
- Setup
- Using the Pelican Client to Transfer Data
- Configuring Your Pelican Origin
- Initialize and Serve Your Pelican Origin
- Transferring Data with Your Pelican Origin
These instructions are for NDC-C Hackathon participants only and will not work for other users.
You must have registered prior to the tutorial to be given access to a virtual machine for use during the tutorial.
-
If you registered in advance, you can log in to your virtual machine with
ssh <comanage-username>@pelican-train.chtc.wisc.edu
where the
<comanage-username>
can be found in your registration information. -
You'll first be prompted to accept the host key for accessing the machine; enter "Yes".
-
You'll be prompted to login via a browser link - open the link to login through the institution you used when you registered.
-
After logging in in the browser, hit enter in your terminal.
-
You may again be prompted to accept a host key for accessing another machine; enter yes.
-
You will now be signed into a dedicated virtual machine for your use only.
You should see something like
[trainee@pelicantrain20## ~]$
for your command prompt.
-
Repeat steps 1-6 in another terminal window.
(You probably won't be prompted to accept the host keys this time.)
Later in the tutorial, you will be using one window to run the Pelican server and using the other window to transfer data from that server.
The following commands should be run in your terminal with the [trainee@pelicantrain20## ]$
prompt unless specified otherwise.
In your $HOME
directory on the virtual machine, download the materials of this repository by running the following command:
git clone https://github.com/PelicanPlatform/training-origin
We have provided some initial files/organization to make the following tutorial smoother. The rest of this document contains the commands that will be executed during the course of the tutorial. Explanations of what is being done and why are provided in the accompanying presentation.
These materials are in the
ndcc24-tutorial
branch of the repository, which is currently the default branch of the repository. In the future, the default branch will update to that of the latest tutorial, in which case you'll need to run the following command to obtain the materials of the current branch:git clone -b ndcc24-tutorial https://github.com/PelicanPlatform/training-origin
Introductory information on the Pelican Client can be found here: https://docs.pelicanplatform.org/getting-started/accessing-data.
Adapted from the installation instructions here: https://docs.pelicanplatform.org/install/linux-binary.
-
Move into the provided
pelican-client
directorycd $HOME/training-origin/pelican-client
-
Download the client tarball from the Pelican Platform. Use the command below to ensure you are using the correct version for this tutorial!
wget https://github.com/PelicanPlatform/pelican/releases/download/v7.10.12/pelican_Linux_x86_64.tar.gz
This link comes from the "Install" page of the documentation for Pelican (https://docs.pelicanplatform.org/install) for the file
pelican_Linux_x86_64.tar.gz
. -
Decompress the client
tar -xzf pelican_Linux_x86_64.tar.gz
This will create a directory
pelican-7.10.12
, which in turn contains the client binary.[trainee@pelicantrain20## pelican-client]$ ls pelican-7.10.12 pelican_Linux_x86_64.tar.gz [trainee@pelicantrain20## pelican-client]$ ls pelican-7.10.12 LICENSE pelican README.md
To make it easier to use, we will copy the pelican
binary into a folder that is a part of the default $PATH
.
-
Move into the folder with the
pelican
binary.cd pelican-7.10.12
or if you're not in the right directory, the following command should get you to the right place:
cd $HOME/training-origin/pelican-client/pelican-7.10.12
-
Create the directory for storing the
pelican
binary:mkdir -p $HOME/.local/bin
-
Copy the
pelican
binary into this new folder.cp ./pelican $HOME/.local/bin/
If you lost track of which directoy the
pelican
binary is located, running the following command should get you to the right place:cd $HOME/training-origin/pelican-client/pelican-7.10.12
-
Check that the
pelican
binary is in your PATH.Move to a different directory.
cd $HOME
Use
which
to find where thepelican
command is located. This should match the location you placed it, above.which pelican
Check the version of Pelican.
pelican --version
Should see something like this:
Version: 7.10.12 Build Date: 2024-11-07T18:18:26Z Build Commit: 8629adaa052f13e9f8a4291e1bf68e493b85deca Built By: goreleaser
You'll be downloading objects into your home directory.
-
Move into your home directory
cd ~
or
cd $HOME
-
"Get" the test object from the OSDF
pelican object get pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt downloaded-test.txt
The current directory should now contain the object
downloaded-test.txt
.[trainee@pelicantrain20## ~]$ ls downloaded-test.txt training-origin
-
Print out the file contents
cat downloaded-test.txt
You should see something like this:
[trainee@pelicantrain20## ~]$ cat downloaded-test.txt Hello, World!
We have pre-generated some of the files and directories for organizing and configuring the Pelican Origin for this tutorial. This structure is not required in order to run your own Pelican Origin.
-
Move to the provided
pelican-origin
directorycd $HOME/training-origin/pelican-origin
-
Explore the pre-generated files
[trainee@pelicantrain20## pelican-origin]$ ls config data start-origin.sh [trainee@pelicantrain20## pelican-origin]$ ls config/ pelican.yaml [trainee@pelicantrain20## pelican-origin]$ ls data/ test.txt [trainee@pelicantrain20## pelican-origin]$ cat data/test.txt Hello World, from the NDC-C Pathfinder Hackathon 2024!
The files are organized as such:
$HOME/training-origin/pelican-origin
├── config
│ └── pelican.yaml
├── data
│ └── test.txt
└── start-origin.sh
To make it easier to restart the Origin and serve a specific namespace, we are going to pre-generate the Issuer keys in the pelican-origin/config
directory.
-
Move to the origin config directory
cd $HOME/training-origin/pelican-origin/config
-
Generate the Issuer JWK files using the Pelican binary
pelican generate keygen
Future note: The developers are considering changing the syntax of this command to something like
pelican key generate
to properly follow the<command> <noun> <verb>
syntax, so this specific command may be different in the future. -
Confirm the Issuer JWK files were created
[trainee@pelicantrain20## config]$ ls issuer.jwk issuer-pub.jwks pelican.yaml
Pelican uses a YAML file to provide the configuration for its services, typically located at /etc/pelican/pelican.yaml
.
Each Origin has at least three unique entries in the configuration file: (i) the data federation URL it is joining, (ii) the "federation prefix" or "namespace" that it will serve in that data federation, and (iii) the hostname of the web host that is running the Origin.
For this tutorial, your Origin will be joining the test instance of the OSDF and serving the namespace /NDCC24-<vm-name>
from the web host <vm-name>.chtc.wisc.edu
.
We've provided most of the necessary configuration in the pelican.yaml
file you copied above, including the federation URL for the OSDF test instance, but you will need to update the config with the name of your specific virtual machine.
-
Update the config with the name of your virtual machine
For example, if you are logged into
pelicantrain2001
, you would run the following command:sed -i 's/<vm-name>/pelicantrain2001/g' pelican.yaml
or else use a command-line text editor to make the changes manually in the configuration file.
a. Namespace
Before:
- FederationPrefix: "/NDCC24-<vm-name>"
After:
- FederationPrefix: "/NDCC24-pelicantrain2001"
b. Hostname
Before:
Hostname: "<vm-name>.chtc.wisc.edu"
After:
Hostname: "pelicantrain2001.chtc.wisc.edu"
-
Explore the contents of the
pelican.yaml
fileUse
cat
or your favorite CLI tool to explore the contents of thepelican.yaml
file. Comments about the various sections are included.
Pelican Platform recommends using their provided Docker containers to run any Pelican Services, such as an Origin.
-
Move to the
pelican-origin
directorycd $HOME/training-origin/pelican-origin
You should see the following contents:
[trainee@pelicantrain20## pelican-origin]$ ls -R .: config data start-origin.sh ./config: issuer.jwk issuer-pub.jwks pelican.yaml ./data: test.txt
-
Start the Docker container by executing the
start-origin.sh
script, which contains the Docker run command (and some checks that you are in the right location)../start-origin.sh
-
Leave the Docker container running in this window!
When you run the above docker command, there will be a bunch of start-up information printed out (including the Pelican version information).
Then, there will be message saying "Pelican admin interface is not initialized" and provide a URL and a one-time code.
The URL will use the web host described in the config, i.e., the pelicantrain20##.chtc.wisc.edu
address you provided as the Hostname
in the pelican.yaml
file.
For example,
Pelican admin interface is not initialized
To initialize, login at https://pelicantrain2001.chtc.wisc.edu:8444/view/initialization/code/ with the following code:
185819
-
In your browser, go to the URL that your command printed out
-
Enter the one-time code to activate the interface
-
Provide an admin password. (If you close your browser, you will need to enter this password to login to the interface again.)
Some notes:
-
If you take too long to activate the interface, a new one-time code may be printed out in the terminal running your Docker container.
-
The password you provide will be saved in the
/etc/pelican
directory, but since that is located in a non-bound directory in the Docker container, it will not exist once the container exits. -
Unless you've mounted the
/etc/pelican
directory into the container, you will have to repeat the admin initialization step if you restart the Origin container.
Pelican Platform has web interfaces for all of the services that it can run, providing a more convenient way of interacting with and managing Pelican services.
For the Origin interface, there are status boxes for each component/connection that the Origin has (Directory, Federation, Registry, Web UI, XRootD). You can also see information about the data source and the corresponding namespace, and the permissions associated with the data. There is even a graph for monitoring the data transfer rates.
Furthermore, the web interfaces allow you to see and change the configuration for the service. The configuration can be accessed by clicking the gray wrench in the top left corner.
If there aren't any red boxes web interface for your Origin, then you are ready to transfer data from your Origin!
For the following commands, log in to your virtual machine from a new terminal window. Do not stop or close out of the terminal where you ran the Docker command to launch your Origin!!
If you accidentally stop the Origin service for some reason, return to the previous section (Initialize and Serve Your Pelican Origin) and repeat the steps.
By default, Pelican pulls objects through a caching system. When a new object is fetched, the object will first be copied to a nearby cache before being downloaded to the client. Subsequent pulls of an object with the same Pelican URL will fetch the copy from the nearby cache. This can greatly reduce the load on the origin serving the object.
After your Origin has been online for a few minutes, you can use the caching system to transfer the file, which is default mechanism of transfer.
-
Move to your home directory
cd $HOME
-
Get your object
pelican object get pelican://osdf-itb.osg-htc.org/NDCC24-pelicantrain20##/test.txt ./download.txt
You will need to change
pelicantrain20##
to the number used by your specific virtual machine, e.g.,pelicantrain2001
.If you do not change
pelicantrain20##
to the appropriate number for your virtual machine, you will get an error like this:ERROR[2024-07-02T13:29:40-05:00] Error while querying the Director: 404: No namespace found for path. Either it doesn't exist, or the Director is experiencing problems
-
Check the contents of the downloaded object
cat download.txt
The contents should match those of the file
$HOME/pelican-origin/data/test.txt
.
Now you will explore an important implication about how the caching system works.
-
Remove the previously downloaded objects
rm download.txt
-
Move to the
data
folder and rename thetest.txt
filecd $HOME/training-origin/pelican-origin/data mv test.txt renamed-test.txt
-
Move back to your home directory
cd $HOME
-
Try to download the object using caching system
pelican object get pelican://osdf-itb.osg-htc.org/NDCC24-pelicantrain20##/test.txt ./cacheread-test.txt
This will SUCCEED, even though an object with that name no longer exists at the origin!!
Because the object /NDCC24-pelicantrain20##/test.txt
had been previously transferred using the caching system, a copy of that object was kept in a cache.
By default, Pelican attempts to transfer objects from the nearest cache.
So the default transfer attempt succeeded for /NDCC24-pelicantrain20##/test.txt
because the Client downloaded the object from a cache.
If you were to attempt to download the object directly from the Origin, it would fail because that object no longer existed at the Origin.
This behavior applies generally to any change to objects stored in the Origin. If you change the name or contents of an object in the Origin, that change is not propagated to the caches where a copy of that object may be stored. And since by default the Client will download objects from a cache (if a copy of that object exists in a cache), that means the Client will download an older version of the object.
Worse still, the way that data is transferred through Pelican means that the object stored in the cache may be updated with the newer version, while in the middle of downloading the object from the cache. The result is that the first X bytes downloaded from the cache correspond to the old version, while the last Y bytes correspond to the new version - almost certainly resulting in a corrupted object at the client.
For this reason, any change to the content of an object at an origin should ALWAYS have a corresponding change in the object's name. (There is a development path under consideration for mitigating this issue.)