From 1f8d1a117383067f10b1305da349032c8d08785d Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 19 Jul 2024 14:07:13 -0500 Subject: [PATCH 01/25] Revise file transfer draft Added TOC, tarballs, minor formatting --- _uw-research-computing/osdf-fileXfer-draft.md | 57 ++++++++++++++++--- 1 file changed, 49 insertions(+), 8 deletions(-) diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md index a8025ee6..04241f39 100644 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ b/_uw-research-computing/osdf-fileXfer-draft.md @@ -9,7 +9,16 @@ guide: - htc --- -[toc] +{% capture content %} +1. [Data Storage Locations](#data-storage-locations) +2. [Understand your file sizes](#understand-your-file-sizes) + - [Use `ls` with `-lh` flags](#use-ls-with--lh-flags) + - [Use `du -h`](#use-ls-with--lh-flags) +3. [Using tarballs to consolidate many files](#using-tarballs-to-consolidate-many-files) +4. [Transferring Data to Jobs](#transferring-data-to-jobs) +5. [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) +{% endcapture %} +{% include /components/directory.html title="Table of Contents" %} # Data Storage Locations The HTC system has two primary locations where users can store files: `/home` and `/staging`. @@ -24,36 +33,68 @@ To know whether a file should be placed in `/home` or in `/staging`, you will ne The command `ls` stands for "list" and, by default, lists the files in your current directory. The flag `-l` stands for "long" and `-h` stands for "human-readable". When the flags are combined and passed to the `ls` command, it prints out the long metadata associated with the files and converts values such as file sizes into human-readable formats (instead of a computer readable format). ``` -NetID@submit$ ls -lh +[user@ap2002] $ ls -lh ``` +{:.term} ## Use `du -h` Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a human-readable format. ``` -NetID@submit$ du -h +[user@ap2002] $ du -h ``` +{:.term} +# Using tarballs to consolidate many files +Some computations require many smaller files. It is more efficient to transfer a single object that consolidates many smaller files than to transfer each file individually. One option to consolidate these files is to use a tarball, which can also compress your files. + +To create a tarball, use: +``` +[user@ap2002] $ tar -czf tarball.tar.gz files/to/be/compressed +``` +{:.term} + +When a directory is listed, the entire directory is compressed into the tarball. A list of objects may also be given. See the [`tar` manual page](https://www.gnu.org/software/tar/manual/html_node/index.html) for more options. The tarball object (i.e. `tarball.tar.gz`) can then be transferred using the protocols listed in the below section. + +Before running your computation, you may need to untar your tarball. To untar: +``` +[user@ap2002] $ tar -xzf tarball.tar.gz +``` +{:.term} # Transferring Data to Jobs -The HTCondor submit file `transfer_input_files =` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file is origionating from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: +The HTCondor submit file `transfer_input_files =` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: | Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | | ----------- | ----------- | ----------- | ----------- | | 0-500 MB | /home | transfer_input_files = input.txt | -| 500-10GB | /staging | transfer_input_files = **osdf:///chtc/staging/NetID/input.txt | -| 10GB + | /staging | transfer_input_files = **file:///staging/NetID/input.txt | +| 500-10GB | /staging | transfer_input_files = osdf:///chtc/staging/NetID/input.txt | +| 10GB + | /staging | transfer_input_files = file:///staging/NetID/input.txt | +***What's the situation for osdf:/// or file:///? If we are going to leave this as-is, we will probably need to explain why there's a difference. -## Transfer Data Back from Jobs to `/home` or `/staging` + +# Transfer Data Back from Jobs to `/home` or `/staging` When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. To transfer files or folders back to `/staging`, in your HTCondor submit file, use -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt", where `output1.txt` is the name of the output file or folder you would like transfered back to a /staging directory. +``` +transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt" +``` +{:.sub} +where `output1.txt` is the name of the output file or folder you would like transfered back to a `/staging` directory. If you have more than one file or folder to transfer back to `/staging`, use a semicolon (;) to seperate multiple files for HTCondor to transfer back like so: +``` transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; output2.txt = file:///staging/NetID/output2.txt" +``` +{:.sub} Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps =`. + +# Related pages +- [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) +- [Transfer files between CHTC and your computer](/uw-research-computing/transfer-files-computer) +- [Transfer files between CHTC and ResearchDrive](/uw-research-computing/transfer-data-researchdrive) \ No newline at end of file From 1726f2667061add58b1f02072be42df466109181 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 19 Jul 2024 14:30:29 -0500 Subject: [PATCH 02/25] Expand transfer_output_files, more formatting --- _uw-research-computing/osdf-fileXfer-draft.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md index 04241f39..b43b0701 100644 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ b/_uw-research-computing/osdf-fileXfer-draft.md @@ -63,21 +63,27 @@ Before running your computation, you may need to untar your tarball. To untar: {:.term} # Transferring Data to Jobs -The HTCondor submit file `transfer_input_files =` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: +The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: | Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | | ----------- | ----------- | ----------- | ----------- | -| 0-500 MB | /home | transfer_input_files = input.txt | -| 500-10GB | /staging | transfer_input_files = osdf:///chtc/staging/NetID/input.txt | -| 10GB + | /staging | transfer_input_files = file:///staging/NetID/input.txt | +| 0-500 MB | /home | `transfer_input_files = input.txt` | +| 500-10GB | /staging | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | +| 10GB + | /staging | `transfer_input_files = file:///staging/NetID/input.txt` | ***What's the situation for osdf:/// or file:///? If we are going to leave this as-is, we will probably need to explain why there's a difference. # Transfer Data Back from Jobs to `/home` or `/staging` -When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. +When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. Files in subdirectories are *not* transferred. Ensure that the files you want are in the top level directory by moving them or creating tarballs. + +If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use +``` +transfer_output_files = file1.txt, file2.txt +``` +{:.sub} To transfer files or folders back to `/staging`, in your HTCondor submit file, use ``` @@ -92,7 +98,7 @@ transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; outpu ``` {:.sub} -Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps =`. +Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps`. # Related pages - [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) From 5ecba19108e9e35c96c2b5388df2d8f57be27d99 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 13 Sep 2024 13:44:27 -0500 Subject: [PATCH 03/25] language update --- _uw-research-computing/osdf-fileXfer-draft.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md index b43b0701..abb1c739 100644 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ b/_uw-research-computing/osdf-fileXfer-draft.md @@ -3,8 +3,8 @@ highlighter: none layout: guide title: HTC Data Storage Locations guide: - order: 6 - category: FILL IN THIS BLANK + order: 4 + category: Handling Data in Jobs tag: - htc --- @@ -23,7 +23,7 @@ guide: # Data Storage Locations The HTC system has two primary locations where users can store files: `/home` and `/staging`. -The mechanisms behind `/home` and `/staging` that manage data are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. +The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. # Understand your file sizes @@ -34,6 +34,10 @@ The command `ls` stands for "list" and, by default, lists the files in your curr ``` [user@ap2002] $ ls -lh +-rw-r--r-- 1 user user 0 Sep 13 13:34 data.csv +-rw-r--r-- 1 user user 0 Sep 13 13:34 job.sub +drwxr-xr-x 2 user user 4.0K Sep 13 13:36 sample_dir +-rwxr-xr-x 1 user user 0 Sep 13 13:34 script.sh ``` {:.term} @@ -42,6 +46,8 @@ Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a hum ``` [user@ap2002] $ du -h +166M ./sample_dir +166M . ``` {:.term} From 4a76d52b655aa102ce94fc439f775058a50caefa Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 20 Sep 2024 12:39:48 -0500 Subject: [PATCH 04/25] Reformat header level --- _uw-research-computing/osdf-fileXfer-draft.md | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md index abb1c739..d3a7a5ab 100644 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ b/_uw-research-computing/osdf-fileXfer-draft.md @@ -15,33 +15,33 @@ guide: - [Use `ls` with `-lh` flags](#use-ls-with--lh-flags) - [Use `du -h`](#use-ls-with--lh-flags) 3. [Using tarballs to consolidate many files](#using-tarballs-to-consolidate-many-files) -4. [Transferring Data to Jobs](#transferring-data-to-jobs) +4. [Transfer Data to Jobs](#transfer-data-to-jobs) 5. [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} -# Data Storage Locations +## Data Storage Locations The HTC system has two primary locations where users can store files: `/home` and `/staging`. The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. -# Understand your file sizes +## Understand your file sizes To know whether a file should be placed in `/home` or in `/staging`, you will need to know it's file size (also known as the amount of "disk space" a file uses). There are many commands to print out your file sizes, but here are a few of our favorite: -## Use `ls` with `-lh` flags +### Use `ls` with `-lh` flags The command `ls` stands for "list" and, by default, lists the files in your current directory. The flag `-l` stands for "long" and `-h` stands for "human-readable". When the flags are combined and passed to the `ls` command, it prints out the long metadata associated with the files and converts values such as file sizes into human-readable formats (instead of a computer readable format). ``` [user@ap2002] $ ls -lh --rw-r--r-- 1 user user 0 Sep 13 13:34 data.csv --rw-r--r-- 1 user user 0 Sep 13 13:34 job.sub -drwxr-xr-x 2 user user 4.0K Sep 13 13:36 sample_dir --rwxr-xr-x 1 user user 0 Sep 13 13:34 script.sh +-rw-r--r-- 1 user user 237K Jul 17 11:25 data.csv +-rw-r--r-- 1 user user 723 Jul 17 11:56 job.sub +drwxr-xr-x 2 user user 4.0K Jul 17 13:36 sample_dir +-rw-r--r-- 1 user user 450 Jul 17 11:42 script.sh ``` {:.term} -## Use `du -h` +### Use `du -h` Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a human-readable format. ``` @@ -51,7 +51,7 @@ Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a hum ``` {:.term} -# Using tarballs to consolidate many files +## Using tarballs to consolidate many files Some computations require many smaller files. It is more efficient to transfer a single object that consolidates many smaller files than to transfer each file individually. One option to consolidate these files is to use a tarball, which can also compress your files. To create a tarball, use: @@ -68,7 +68,7 @@ Before running your computation, you may need to untar your tarball. To untar: ``` {:.term} -# Transferring Data to Jobs +## Transfer Data to Jobs The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: @@ -81,7 +81,7 @@ The HTCondor submit file `transfer_input_files` line should always be used to te ***What's the situation for osdf:/// or file:///? If we are going to leave this as-is, we will probably need to explain why there's a difference. -# Transfer Data Back from Jobs to `/home` or `/staging` +## Transfer Data Back from Jobs to `/home` or `/staging` When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. Files in subdirectories are *not* transferred. Ensure that the files you want are in the top level directory by moving them or creating tarballs. @@ -106,7 +106,7 @@ transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; outpu Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps`. -# Related pages +## Related pages - [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) - [Transfer files between CHTC and your computer](/uw-research-computing/transfer-files-computer) - [Transfer files between CHTC and ResearchDrive](/uw-research-computing/transfer-data-researchdrive) \ No newline at end of file From acdd6d4c2003ba2b69ff6dabc0c4f06ac2660403 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 30 Sep 2024 13:54:17 -0500 Subject: [PATCH 05/25] Remove disk usage/tar info --- _uw-research-computing/osdf-fileXfer-draft.md | 53 ++----------------- 1 file changed, 4 insertions(+), 49 deletions(-) diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md index d3a7a5ab..e5572087 100644 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ b/_uw-research-computing/osdf-fileXfer-draft.md @@ -10,13 +10,10 @@ guide: --- {% capture content %} -1. [Data Storage Locations](#data-storage-locations) -2. [Understand your file sizes](#understand-your-file-sizes) - - [Use `ls` with `-lh` flags](#use-ls-with--lh-flags) - - [Use `du -h`](#use-ls-with--lh-flags) -3. [Using tarballs to consolidate many files](#using-tarballs-to-consolidate-many-files) -4. [Transfer Data to Jobs](#transfer-data-to-jobs) -5. [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) +- [Data Storage Locations](#data-storage-locations) +- [Transfer Data to Jobs](#transfer-data-to-jobs) +- [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) +- [Related pages](#related-pages) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} @@ -26,48 +23,6 @@ The HTC system has two primary locations where users can store files: `/home` an The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. -## Understand your file sizes -To know whether a file should be placed in `/home` or in `/staging`, you will need to know it's file size (also known as the amount of "disk space" a file uses). There are many commands to print out your file sizes, but here are a few of our favorite: - -### Use `ls` with `-lh` flags -The command `ls` stands for "list" and, by default, lists the files in your current directory. The flag `-l` stands for "long" and `-h` stands for "human-readable". When the flags are combined and passed to the `ls` command, it prints out the long metadata associated with the files and converts values such as file sizes into human-readable formats (instead of a computer readable format). - -``` -[user@ap2002] $ ls -lh --rw-r--r-- 1 user user 237K Jul 17 11:25 data.csv --rw-r--r-- 1 user user 723 Jul 17 11:56 job.sub -drwxr-xr-x 2 user user 4.0K Jul 17 13:36 sample_dir --rw-r--r-- 1 user user 450 Jul 17 11:42 script.sh -``` -{:.term} - -### Use `du -h` -Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a human-readable format. - -``` -[user@ap2002] $ du -h -166M ./sample_dir -166M . -``` -{:.term} - -## Using tarballs to consolidate many files -Some computations require many smaller files. It is more efficient to transfer a single object that consolidates many smaller files than to transfer each file individually. One option to consolidate these files is to use a tarball, which can also compress your files. - -To create a tarball, use: -``` -[user@ap2002] $ tar -czf tarball.tar.gz files/to/be/compressed -``` -{:.term} - -When a directory is listed, the entire directory is compressed into the tarball. A list of objects may also be given. See the [`tar` manual page](https://www.gnu.org/software/tar/manual/html_node/index.html) for more options. The tarball object (i.e. `tarball.tar.gz`) can then be transferred using the protocols listed in the below section. - -Before running your computation, you may need to untar your tarball. To untar: -``` -[user@ap2002] $ tar -xzf tarball.tar.gz -``` -{:.term} - ## Transfer Data to Jobs The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: From 651e74bde488b51d8311ede46af7d164cde75f65 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 4 Oct 2024 14:32:19 -0500 Subject: [PATCH 06/25] Rename page, add file sizes to table, text changes --- _uw-research-computing/osdf-fileXfer-draft.md | 67 ------------------- 1 file changed, 67 deletions(-) delete mode 100644 _uw-research-computing/osdf-fileXfer-draft.md diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md deleted file mode 100644 index e5572087..00000000 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -highlighter: none -layout: guide -title: HTC Data Storage Locations -guide: - order: 4 - category: Handling Data in Jobs - tag: - - htc ---- - -{% capture content %} -- [Data Storage Locations](#data-storage-locations) -- [Transfer Data to Jobs](#transfer-data-to-jobs) -- [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) -- [Related pages](#related-pages) -{% endcapture %} -{% include /components/directory.html title="Table of Contents" %} - -## Data Storage Locations -The HTC system has two primary locations where users can store files: `/home` and `/staging`. - -The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. - - -## Transfer Data to Jobs -The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: - - -| Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | -| ----------- | ----------- | ----------- | ----------- | -| 0-500 MB | /home | `transfer_input_files = input.txt` | -| 500-10GB | /staging | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | -| 10GB + | /staging | `transfer_input_files = file:///staging/NetID/input.txt` | - -***What's the situation for osdf:/// or file:///? If we are going to leave this as-is, we will probably need to explain why there's a difference. - - -## Transfer Data Back from Jobs to `/home` or `/staging` - -When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. Files in subdirectories are *not* transferred. Ensure that the files you want are in the top level directory by moving them or creating tarballs. - -If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use -``` -transfer_output_files = file1.txt, file2.txt -``` -{:.sub} - -To transfer files or folders back to `/staging`, in your HTCondor submit file, use -``` -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt" -``` -{:.sub} -where `output1.txt` is the name of the output file or folder you would like transfered back to a `/staging` directory. - -If you have more than one file or folder to transfer back to `/staging`, use a semicolon (;) to seperate multiple files for HTCondor to transfer back like so: -``` -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; output2.txt = file:///staging/NetID/output2.txt" -``` -{:.sub} - -Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps`. - -## Related pages -- [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) -- [Transfer files between CHTC and your computer](/uw-research-computing/transfer-files-computer) -- [Transfer files between CHTC and ResearchDrive](/uw-research-computing/transfer-data-researchdrive) \ No newline at end of file From 0707294b58024a2f7ad1caf9823a133d318b8573 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 4 Oct 2024 14:39:31 -0500 Subject: [PATCH 07/25] Clarify language for transferring outputs --- .../htc-job-file-transfer.md | 70 +++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 _uw-research-computing/htc-job-file-transfer.md diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md new file mode 100644 index 00000000..4ae28937 --- /dev/null +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -0,0 +1,70 @@ +--- +highlighter: none +layout: guide +title: HTC Data Storage Locations +guide: + order: 4 + category: Handling Data in Jobs + tag: + - htc +--- + +{% capture content %} +- [Data Storage Locations](#data-storage-locations) +- [Transfer Data to Jobs](#transfer-data-to-jobs) +- [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) +- [Related pages](#related-pages) +{% endcapture %} +{% include /components/directory.html title="Table of Contents" %} + +## Data Storage Locations +The HTC system has two primary locations where users can place their files: +* `/home`: default location, good for smaller files to be transferred (<1 GB) +* `/staging`: for larger files to be transferred (>1 GB) + +The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes during file transfer in an HTCondor job. `/home` is more efficient for transferring smaller files, while `/staging` is more efficient at transferring larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. + +> If you need a `/staging` directory, [request one here](quota-request). + + +## Transfer Data to Jobs +The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: + +| Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | +| ----------- | ----------- | ----------- | ----------- | +| 0 - 1 GB | /home | `transfer_input_files = input.txt` | +| 1 GB - 30 GB | /staging | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | +| > 30 GB | /staging | `transfer_input_files = file:///staging/NetID/input.txt` | + +> Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. + +## Transfer Data Back from Jobs to `/home` or `/staging` + +When a job completes, by default, HTCondor will return **newly created or edited files only in top-level directory** back to your `/home` directory. **Files in subdirectories are *not* transferred.** Ensure that the files you want are in the top-level directory by moving them or [creating tarballs](transfer-files-computer#c-transferring-multiple-files). + +If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use +``` +transfer_output_files = file1.txt, file2.txt +``` +{:.sub} + +To transfer a file or folder back to `/staging`, you will need an additional line in your HTCondor submit file: +``` +transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt" +``` +{:.sub} + +where `output1.txt` is the name of the output file or folder you would like transferred back to a `/staging` directory. Ensure you have the right file transfer syntax (`osdf://` or `file:///` depending on the anticipated file size). + +If you have multiple files or folders to transfer back to `/staging`, use a semicolon (;) to separate each object: +``` +transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; output2.txt = file:///staging/NetID/output2.txt" +``` +{:.sub} + +Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps`. + +## Related pages +- [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) +- [Transfer files between CHTC and your computer](/uw-research-computing/transfer-files-computer) +- [Transfer files between CHTC and ResearchDrive](/uw-research-computing/transfer-data-researchdrive) \ No newline at end of file From f003ab39ba6fc7698b101303f98450bba577c696 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Wed, 16 Oct 2024 15:29:07 -0500 Subject: [PATCH 08/25] typo --- _uw-research-computing/htc-job-file-transfer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 4ae28937..609a5c11 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -33,7 +33,7 @@ The HTCondor submit file `transfer_input_files` line should always be used to te | Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | | ----------- | ----------- | ----------- | ----------- | | 0 - 1 GB | /home | `transfer_input_files = input.txt` | -| 1 GB - 30 GB | /staging | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | +| 1 GB - 30 GB | /staging | `transfer_input_files = osdf://chtc/staging/NetID/input.txt` | | > 30 GB | /staging | `transfer_input_files = file:///staging/NetID/input.txt` | > Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. From 83e43677ac0811d1a270689016f5b5eba7786322 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 18 Oct 2024 14:59:11 -0500 Subject: [PATCH 09/25] Moved old guides; updated new guides Small file transfer and SQUID are archived --- _layouts/file_avail.html | 19 +--- .../{ => archived}/file-avail-squid.md | 2 - .../{ => archived}/file-availability.md | 2 - .../file-avail-largedata.md | 101 ++++++------------ .../htc-job-file-transfer.md | 49 ++++++--- 5 files changed, 70 insertions(+), 103 deletions(-) rename _uw-research-computing/{ => archived}/file-avail-squid.md (99%) rename _uw-research-computing/{ => archived}/file-availability.md (99%) diff --git a/_layouts/file_avail.html b/_layouts/file_avail.html index 4817ab8f..9dadfede 100644 --- a/_layouts/file_avail.html +++ b/_layouts/file_avail.html @@ -9,9 +9,8 @@

Which Option is the Best for Your Files?

Input Sizes Output Sizes - Link to Guide File Location - How to Transfer + Syntax for transfer_input_files Availability, Security @@ -19,29 +18,17 @@

Which Option is the Best for Your Files?

0 - 100 MB per file, up to 500 MB per job 0 - 5 GB per job - Small Input/Output File Transfer via HTCondor /home - submit file; filename in transfer_input_files + No special syntax CHTC, UW Grid, and OSG; works for your jobs - - 100 MB - 1 GB per repeatedly-used file - Not available for output - Large Input File Availability Via Squid - /squid - submit file; http link in transfer_input_files - CHTC, UW Grid, and OSG; files are made *publicly-readable* via an HTTP address - - - 100 MB - TBs per job-specific file; repeatedly-used files > 1GB 4 GB - TBs per job - Large Input and Output File Availability Via Staging /staging - job executable; copy or move within the job + osdf:// or file:/// a portion of CHTC; accessible only to your jobs diff --git a/_uw-research-computing/file-avail-squid.md b/_uw-research-computing/archived/file-avail-squid.md similarity index 99% rename from _uw-research-computing/file-avail-squid.md rename to _uw-research-computing/archived/file-avail-squid.md index 4a99c994..91a560bb 100644 --- a/_uw-research-computing/file-avail-squid.md +++ b/_uw-research-computing/archived/file-avail-squid.md @@ -3,8 +3,6 @@ highlighter: none layout: file_avail title: Transfer Large Input Files Via Squid guide: - order: 1 - category: Handling Data in Jobs tag: - htc --- diff --git a/_uw-research-computing/file-availability.md b/_uw-research-computing/archived/file-availability.md similarity index 99% rename from _uw-research-computing/file-availability.md rename to _uw-research-computing/archived/file-availability.md index 1cf8d3ea..3b8c83ae 100644 --- a/_uw-research-computing/file-availability.md +++ b/_uw-research-computing/archived/file-availability.md @@ -4,8 +4,6 @@ layout: file_avail title: Small Input and Output File Availability Via HTCondor alt_title: Transfer Small Input and Output guide: - order: 0 - category: Handling Data in Jobs tag: - htc --- diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index be6466cf..b8e58022 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -48,13 +48,11 @@ familiar with:** Our large data staging location is only for input and output files that are individually too large to be managed by our other data movement -methods, HTCondor file transfer or SQUID. This includes individual input files +methods, HTCondor file transfer. This includes individual input files greater than 100MB and individual output files greater than 3-4GB. Users are expected to abide by this intended use expectation and follow the -instructions for using `/staging` written in this guide (e.g. files placed -in `/staging `should NEVER be listed in the submit file, but rather accessed -via the job's executable (aka .sh) script). +instructions for using `/staging` written in this guide. ## B. Access to Large Data Staging @@ -83,11 +81,11 @@ location (or any CHTC file system) at any time. ## D. Data Access Within Jobs - Staged large data will +Staged large data will be available only within the the CHTC pool, on a subset of our total capacity. -Staged data are owned by the user, and only the user's own +Staged data are owned by the user and only the user's own jobs can access these files (unless the user specifically modifies unix file permissions to make certain files available for other users). @@ -141,7 +139,7 @@ files directly into this user directory from your own computer: ``` $ scp large.file username@transfer.chtc.wisc.edu:/staging/username/ ``` -{.term} +{:.term} - If using a Windows computer: - Using a file transfer application, like WinSCP, directly drag the large @@ -157,66 +155,32 @@ back at a later date. Files can be taken off of `/staging` using similar mechanisms as uploaded files (as above). # 3. Using Staged Files in a Job +## A. Transferring Large Input Files +Staged files should be specified in the job submit file using the `osdf://` or `file:///` syntax, +depending on the size of the files to be transferred. [See this table for more information](htc-job-file-transfer#transferring-data-to-jobs-with-transfer_input_files). -As shown above, the staging directory for large data is `/staging/username`. -All interaction with files in this location should occur within your job's -main executable. - -## A. Accessing Large Input Files - -To use large data placed in the `/staging` location, add commands to your -job executable that copy input -from `/staging` into the working directory of the job. Program should then use -files from the working directory, being careful to remove the coiped -files from the working -directory before the completion of the job (so that they're not copied -back to the submit server as perceived output). - -Example, if executable is a shell script: +``` +transfer_input_files = osdf://chtc/staging/username/file +``` +{:.sub} ``` -#!/bin/bash -# -# First, copy the compressed tar file from /staging into the working directory, -# and un-tar it to reveal your large input file(s) or directories: -cp /staging/username/large_input.tar.gz ./ -tar -xzvf large_input.tar.gz -# -# Command for myprogram, which will use files from the working directory -./myprogram large_input.txt myoutput.txt -# -# Before the script exits, make sure to remove the file(s) from the working directory -rm large_input.tar.gz large_input.txt -# -# END +transfer_input_files = file:///staging/username/file ``` -{: .file} +{:.sub} -## B. Moving Large Output Files +## B. Transferring Large Output Files -If jobs produce large (more than 3-4GB) output files, have -your executable write the output file(s) to a location within -the working directory, and then make sure to move this large file to -the `/staging` folder, so that it's not transferred back to the home directory, as -all other "new" files in the working directory will be. +By default, any new or changed files in the top-level directory are transferred to your working directory on `/home`. If you have large output files, this is undesirable. -Example, if executable is a shell script: +Large outputs should be transferred to staging using the same file transfer protocols in conjunction with `transfer_output_remaps`. ``` -#!/bin/bash -# -# Command to save output to the working directory: -./myprogram myinput.txt output_dir/ -# -# Tar and mv output to staging, then delete from the job working directory: -tar -czvf large_output.tar.gz output_dir/ other_large_files.txt -mv large_output.tar.gz /staging/username/ -rm other_large_files.txt -# -# END +transfer_output_files = file1, file2 +transfer_output_remaps = "file1 = osdf://chtc/staging/username/file1; file2 = file:///staging/username/file2" ``` -{: .file} +{:.sub} ## C. Handling Standard Output (if needed) @@ -246,12 +210,17 @@ run from a script (bash) executable: # # tar and move large files to staging so they're not copied to the submit server: tar -czvf large_stdout.tar.gz large_std.out -cp large_stdout.tar.gz /staging/username/subdirectory -rm large_std.out large_stdout.tar.gz # END ``` {: .file} +We also need to tell HTCondor to transfer the large standard output using the file transfer protocols above. +``` +transfer_output_files = file1, large_stdout.tar.gz +transfer_output_remaps = "large_stdout.tar.gz = osdf://chtc/staging/username/large_stdout.tar.gz;" +``` +{:.sub} + # 4. Submit Jobs Using Staged Data In order to properly submit jobs using staged large data, always do the following: @@ -261,13 +230,9 @@ In order to properly submit jobs using staged large data, always do the followin In your submit file: -- **No large data in the submit file**: Do NOT list any `/staging` files in any of the submit file - lines, including: `executable, log, output, error, transfer_input_files`. Rather, your - job's ENTIRE interaction with files in `/staging` needs to occur - WITHIN each job's executable, when it runs within the job (as shown [above](#3-using-staged-files-in-a-job)) - **Request sufficient disk space**: Using `request_disk`, request an amount of disk -space that reflects the total of a) input data that each job will copy into - the job working directory from `/staging,` and b) any output that +space that reflects the total of (a) input data that each job will copy into + the job working directory from `/staging,` and (b) any output that will be created in the job working directory. - **Require access to `/staging`**: Include the CHTC specific attribute that requires servers with access to `/staging` @@ -285,8 +250,7 @@ log = myprogram.log output = $(Cluster).out error = $(Cluster).err -## Do NOT list the large data files here -transfer_input_files = myprogram +transfer_input_files = osdf://chtc/staging/username/myprogram, file:///staging/username/largedata.tar.gz # IMPORTANT! Require execute servers that can access /staging Requirements = (Target.HasCHTCStaging == true) @@ -296,11 +260,6 @@ Requirements = (Target.HasCHTCStaging == true) queue ``` -> **Note: in no way should files on `/staging` be specified in the submit file, -> directly or indirectly!** For example, do not use the `initialdir` option ( -> [Submitting Multiple Jobs in Individual Directories](multiple-job-dirs.html)) -> to specify a directory on `/staging`. - # 5. Checking your Quota, Data Use, and File Counts You can use the command `get_quotas` to see what disk diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 609a5c11..2f1180fb 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -1,9 +1,9 @@ --- highlighter: none layout: guide -title: HTC Data Storage Locations +title: Data Storage Locations on the HTC guide: - order: 4 + order: 1 category: Handling Data in Jobs tag: - htc @@ -19,29 +19,54 @@ guide: ## Data Storage Locations The HTC system has two primary locations where users can place their files: -* `/home`: default location, good for smaller files to be transferred (<1 GB) -* `/staging`: for larger files to be transferred (>1 GB) +### /home +* The default location for files and job submission +* Efficiently handles many files +* Smaller input files (<100 MB) should be placed here -The data management mechanisms behind `/home` and `/staging` that are different and are optimized to handle different file sizes during file transfer in an HTCondor job. `/home` is more efficient for transferring smaller files, while `/staging` is more efficient at transferring larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. +### /staging +* Expandable storage system but cannot efficiently handle many files +* Larger input files (>100 MB) should be placed here, including container images (.sif) + +The data management mechanisms behind `/home` and `/staging` are different and are optimized to handle different file sizes and numbers of files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. > If you need a `/staging` directory, [request one here](quota-request). -## Transfer Data to Jobs -The HTCondor submit file `transfer_input_files` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: +## Transferring Data to Jobs with `transfer_input_files` + +In the HTCondor submit file, `transfer_input_files` should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your job will change depending on the file size. | Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | | ----------- | ----------- | ----------- | ----------- | -| 0 - 1 GB | /home | `transfer_input_files = input.txt` | -| 1 GB - 30 GB | /staging | `transfer_input_files = osdf://chtc/staging/NetID/input.txt` | -| > 30 GB | /staging | `transfer_input_files = file:///staging/NetID/input.txt` | +| 0 - 100 MB | `/home` | `transfer_input_files = input.txt` | +| 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf://chtc/staging/NetID/input.txt` | +| > 30 GB | `/staging` | `transfer_input_files = file:///staging/NetID/input.txt` | + +Multiple input files and file transfer protocols can be specified and delimited by commas, as shown below: + +``` +# My job submit file + +transfer_input_files = file1, osdf://chtc/staging/username/file2, file:///staging/username/file3, dir1, dir2/ + +... other submit file details ... +``` +{:.sub} + +Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. + +### Important Note: File Transfers and Caching with `osdf://` +The `osdf://` file transfer protocol uses a [caching](https://en.wikipedia.org/wiki/Cache_(computing)) mechanism for input files to reduce file transfers over the network. This can affect users who refer to input files that are frequently modified. -> Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. +*If you are changing the contents of the input files frequently, you should rename the file or change its path to ensure the new version is transferred.* -## Transfer Data Back from Jobs to `/home` or `/staging` +## Transferring Data Back from Jobs to `/home` or `/staging` +### Default Behavior for Transferring Output Files When a job completes, by default, HTCondor will return **newly created or edited files only in top-level directory** back to your `/home` directory. **Files in subdirectories are *not* transferred.** Ensure that the files you want are in the top-level directory by moving them or [creating tarballs](transfer-files-computer#c-transferring-multiple-files). +### Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps` If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use ``` transfer_output_files = file1.txt, file2.txt From 3d121a1b90e9f19d3fafc431c898fb6983841275 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 18 Oct 2024 15:01:32 -0500 Subject: [PATCH 10/25] fix table of contents --- _uw-research-computing/htc-job-file-transfer.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 2f1180fb..260f7544 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -11,8 +11,13 @@ guide: {% capture content %} - [Data Storage Locations](#data-storage-locations) -- [Transfer Data to Jobs](#transfer-data-to-jobs) -- [Transfer Data Back from Jobs to `/home` or `/staging`](#transfer-data-back-from-jobs-to-home-or-staging) + * [/home](#home) + * [/staging](#staging) +- [Transferring Data to Jobs with `transfer_input_files`](#transferring-data-to-jobs-with-transfer_input_files) + * [Important Note: File Transfers and Caching with `osdf://`](#important-note-file-transfers-and-caching-with-osdf) +- [Transferring Data Back from Jobs to `/home` or `/staging`](#transferring-data-back-from-jobs-to-home-or-staging) + * [Default Behavior for Transferring Output Files](#default-behavior-for-transferring-output-files) + * [Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps`](#specify-which-output-files-to-transfer-with-transfer_output_files-and-transfer_output_remaps) - [Related pages](#related-pages) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} From b31a53e4f9135adaffd229df3999715265bda8e3 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 18 Oct 2024 15:11:55 -0500 Subject: [PATCH 11/25] Change layout to file_avail --- _uw-research-computing/htc-job-file-transfer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 260f7544..7e3821d0 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -1,6 +1,6 @@ --- highlighter: none -layout: guide +layout: file_avail title: Data Storage Locations on the HTC guide: order: 1 From b90dbce19f44ca592466cb00cdb7784f928da106 Mon Sep 17 00:00:00 2001 From: Amber Lim <59936462+xamberl@users.noreply.github.com> Date: Thu, 24 Oct 2024 14:26:49 -0500 Subject: [PATCH 12/25] Update _layouts/file_avail.html Co-authored-by: Christina K. --- _layouts/file_avail.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_layouts/file_avail.html b/_layouts/file_avail.html index 9dadfede..d1e8b37b 100644 --- a/_layouts/file_avail.html +++ b/_layouts/file_avail.html @@ -29,7 +29,7 @@

Which Option is the Best for Your Files?

4 GB - TBs per job /staging osdf:// or file:/// - a portion of CHTC; accessible only to your jobs + all of CHTC/external pools or a subset of CHTC From 66e0ded6fc369a1ff3253fe80cc8be53bd51b799 Mon Sep 17 00:00:00 2001 From: Amber Lim <59936462+xamberl@users.noreply.github.com> Date: Thu, 24 Oct 2024 14:27:16 -0500 Subject: [PATCH 13/25] Update _layouts/file_avail.html Co-authored-by: Christina K. --- _layouts/file_avail.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_layouts/file_avail.html b/_layouts/file_avail.html index d1e8b37b..ff187358 100644 --- a/_layouts/file_avail.html +++ b/_layouts/file_avail.html @@ -20,7 +20,7 @@

Which Option is the Best for Your Files?

0 - 5 GB per job /home No special syntax - CHTC, UW Grid, and OSG; works for your jobs + CHTC and external pools From 9e15f73857e3586d3b093de4885f2142fe000969 Mon Sep 17 00:00:00 2001 From: Amber Lim <59936462+xamberl@users.noreply.github.com> Date: Thu, 24 Oct 2024 14:28:28 -0500 Subject: [PATCH 14/25] Update _uw-research-computing/file-avail-largedata.md Co-authored-by: Christina K. --- _uw-research-computing/file-avail-largedata.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index b8e58022..9c50a68c 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -174,7 +174,7 @@ transfer_input_files = file:///staging/username/file By default, any new or changed files in the top-level directory are transferred to your working directory on `/home`. If you have large output files, this is undesirable. -Large outputs should be transferred to staging using the same file transfer protocols in conjunction with `transfer_output_remaps`. +Large outputs should be transferred to staging using the same file transfer protocols in HTCondor's `transfer_output_remaps` option: ``` transfer_output_files = file1, file2 From b967d0e4b7c35c147e7cf57ceb66942e87d61f3d Mon Sep 17 00:00:00 2001 From: Amber Lim <59936462+xamberl@users.noreply.github.com> Date: Thu, 24 Oct 2024 14:28:50 -0500 Subject: [PATCH 15/25] Update _uw-research-computing/htc-job-file-transfer.md Co-authored-by: Christina K. --- _uw-research-computing/htc-job-file-transfer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 7e3821d0..aad1fe6f 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -30,7 +30,7 @@ The HTC system has two primary locations where users can place their files: * Smaller input files (<100 MB) should be placed here ### /staging -* Expandable storage system but cannot efficiently handle many files +* Expandable storage system but cannot efficiently handle many small (few MB or less) files * Larger input files (>100 MB) should be placed here, including container images (.sif) The data management mechanisms behind `/home` and `/staging` are different and are optimized to handle different file sizes and numbers of files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. From 7912001ba12060dc641c2ff219b68be228de754b Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 24 Oct 2024 14:44:46 -0500 Subject: [PATCH 16/25] Update example for `transfer_input_files` --- _uw-research-computing/file-avail-largedata.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index 9c50a68c..43c9d3b7 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -160,12 +160,7 @@ Staged files should be specified in the job submit file using the `osdf://` or ` depending on the size of the files to be transferred. [See this table for more information](htc-job-file-transfer#transferring-data-to-jobs-with-transfer_input_files). ``` -transfer_input_files = osdf://chtc/staging/username/file -``` -{:.sub} - -``` -transfer_input_files = file:///staging/username/file +transfer_input_files = osdf://chtc/staging/username/file1, file:///staging/username/file2, file3 ``` {:.sub} From 7352fc0b36836237c11c3c04be832cc861ee45f0 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 24 Oct 2024 14:45:11 -0500 Subject: [PATCH 17/25] unpublish the archived pages --- _uw-research-computing/archived/file-avail-squid.md | 1 + _uw-research-computing/archived/file-availability.md | 1 + 2 files changed, 2 insertions(+) diff --git a/_uw-research-computing/archived/file-avail-squid.md b/_uw-research-computing/archived/file-avail-squid.md index 91a560bb..3dca9099 100644 --- a/_uw-research-computing/archived/file-avail-squid.md +++ b/_uw-research-computing/archived/file-avail-squid.md @@ -2,6 +2,7 @@ highlighter: none layout: file_avail title: Transfer Large Input Files Via Squid +published: false guide: tag: - htc diff --git a/_uw-research-computing/archived/file-availability.md b/_uw-research-computing/archived/file-availability.md index 3b8c83ae..8d870ce3 100644 --- a/_uw-research-computing/archived/file-availability.md +++ b/_uw-research-computing/archived/file-availability.md @@ -3,6 +3,7 @@ highlighter: none layout: file_avail title: Small Input and Output File Availability Via HTCondor alt_title: Transfer Small Input and Output +published: false guide: tag: - htc From 98b11a547b9838e6e5082c38544cae054a5f4f18 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 24 Oct 2024 14:46:45 -0500 Subject: [PATCH 18/25] Remove section for large std output --- .../file-avail-largedata.md | 39 ------------------- 1 file changed, 39 deletions(-) diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index 43c9d3b7..261224c6 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -177,45 +177,6 @@ transfer_output_remaps = "file1 = osdf://chtc/staging/username/file1; file2 = fi ``` {:.sub} -## C. Handling Standard Output (if needed) - -In some instances, your software may produce very large standard output -(what would typically be output to the command screen, if you ran the -command for yourself, instead of having HTCondor do it). Because such -standard output from your software will usually be captured by HTCondor -in the submit file "output" file, this "output" file WILL still be -transferred by HTCondor back to your home directory on the submit -server, which may be very bad for you and others, if that captured -standard output is very large. - -In these cases, it is useful to redirect the standard output of commands -in your executable to a file in the working directory, and then move it -into `/staging` at the end of the job. - -Example, if "`myprogram`" produces very large standard output, and is -run from a script (bash) executable: - -``` -#!/bin/bash -# -# script to run myprogram, -# -# redirecting large standard output to a file in the working directory: -./myprogram myinput.txt myoutput.txt > large_std.out -# -# tar and move large files to staging so they're not copied to the submit server: -tar -czvf large_stdout.tar.gz large_std.out -# END -``` -{: .file} - -We also need to tell HTCondor to transfer the large standard output using the file transfer protocols above. -``` -transfer_output_files = file1, large_stdout.tar.gz -transfer_output_remaps = "large_stdout.tar.gz = osdf://chtc/staging/username/large_stdout.tar.gz;" -``` -{:.sub} - # 4. Submit Jobs Using Staged Data In order to properly submit jobs using staged large data, always do the following: From 97c14aa9ae327bc7048088f9d145958ee9ee14b2 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 24 Oct 2024 14:51:12 -0500 Subject: [PATCH 19/25] Update links and toc --- .../file-avail-largedata.md | 26 ++++++++++++------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index 261224c6..abe42dd4 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -22,17 +22,25 @@ familiar with:** 1. Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables"). -2. CHTC's [Intro to Running HTCondor Jobs](helloworld.html) -3. CHTC's guide for [Typical File Transfer](file-availability.html) +2. CHTC's [Intro to Running HTCondor Jobs](htcondor-job-submission) +3. CHTC's guide for [Typical File Transfer](htc-job-file-transfer) {% capture content %} -1. [Policies and Intended Use](#1-policies-and-intended-use) -2. [Staging Large Data](#2-staging-large-data) -3. [Using Staged Files in a Job](#3-using-staged-files-in-a-job) - * [Accessing Large Input Files](#a-accessing-large-input-files) - * [Moving Large Output Files](#b-moving-large-output-files) -4. [Submit Jobs Using Staged Data](#4-submit-jobs-using-staged-data) -5. [Checking your Quota, Data Use, and File Counts](#5-checking-your-quota-data-use-and-file-counts) +- [1. Policies and Intended Use](#1-policies-and-intended-use) + * [A. Intended Use](#a-intended-use) + * [B. Access to Large Data Staging](#b-access-to-large-data-staging) + * [C. User Data Management Responsibilities](#c-user-data-management-responsibilities) + * [D. Data Access Within Jobs](#d-data-access-within-jobs) +- [2. Staging Large Data](#2-staging-large-data) + * [A. Get a Directory](#a-get-a-directory) + * [B. Reduce File Counts](#b-reduce-file-counts) + * [C. Use the Transfer Server](#c-use-the-transfer-server) + * [D. Remove Files After Jobs Complete](#d-remove-files-after-jobs-complete) +- [3. Using Staged Files in a Job](#3-using-staged-files-in-a-job) + * [A. Transferring Large Input Files](#a-transferring-large-input-files) + * [B. Transferring Large Output Files](#b-transferring-large-output-files) +- [4. Submit Jobs Using Staged Data](#4-submit-jobs-using-staged-data) +- [5. Checking your Quota, Data Use, and File Counts](#5-checking-your-quota-data-use-and-file-counts) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} From b8a8fcb5481501437ce6747e4eb5f78023fc4b75 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Thu, 24 Oct 2024 15:15:52 -0500 Subject: [PATCH 20/25] Add redirects; add info for datasets > 100GB --- _redirects/file-avail-squid.md | 5 +++++ _redirects/file-availability.md | 5 +++++ _uw-research-computing/htc-job-file-transfer.md | 1 + 3 files changed, 11 insertions(+) create mode 100644 _redirects/file-avail-squid.md create mode 100644 _redirects/file-availability.md diff --git a/_redirects/file-avail-squid.md b/_redirects/file-avail-squid.md new file mode 100644 index 00000000..c4d9dfbb --- /dev/null +++ b/_redirects/file-avail-squid.md @@ -0,0 +1,5 @@ +--- +layout: redirect +redirect_url: /uw-research-computing/htc-job-file-transfer +permalink: /uw-research-computing/file-avail-squid +--- diff --git a/_redirects/file-availability.md b/_redirects/file-availability.md new file mode 100644 index 00000000..8456db53 --- /dev/null +++ b/_redirects/file-availability.md @@ -0,0 +1,5 @@ +--- +layout: redirect +redirect_url: /uw-research-computing/htc-job-file-transfer +permalink: /uw-research-computing/file-availability +--- diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index aad1fe6f..64738e06 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -47,6 +47,7 @@ In the HTCondor submit file, `transfer_input_files` should always be used to tel | 0 - 100 MB | `/home` | `transfer_input_files = input.txt` | | 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf://chtc/staging/NetID/input.txt` | | > 30 GB | `/staging` | `transfer_input_files = file:///staging/NetID/input.txt` | +| > 100 GB | | For larger datasets (100GB+ per job), contact the facilitation team about the best strategy to stage your data | Multiple input files and file transfer protocols can be specified and delimited by commas, as shown below: From b1466770b8888bb6c9d5fc9adb0ab784838026eb Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Fri, 25 Oct 2024 11:02:49 -0500 Subject: [PATCH 21/25] Move checking quota to one page --- _uw-research-computing/check-quota.md | 63 ++++++++++++------- .../file-avail-largedata.md | 29 ++------- 2 files changed, 45 insertions(+), 47 deletions(-) diff --git a/_uw-research-computing/check-quota.md b/_uw-research-computing/check-quota.md index cfc83a59..ee39f0ec 100644 --- a/_uw-research-computing/check-quota.md +++ b/_uw-research-computing/check-quota.md @@ -10,29 +10,25 @@ guide: --- The following commands will allow you to monitor the amount of disk -space you are using in your home directory on our (or another) submit node and to determine the -amount of disk space you have been allotted (your quota). - -If you also have a `/staging` directory on the HTC system, see our -[staging guide](file-avail-largedata.html#5-checking-your-quota-data-use-and-file-counts) for -details on how to check your quota and usage. -\ -The default quota allotment on CHTC submit nodes is 20 GB with a hard -limit of 30 GB (at which point you cannot write more files).\ -\ -**Note: The CHTC submit nodes are not backed up, so you will want to +space you are using in your home directory on the access point and to determine the +amount of disk space you have been allotted (your quota). + +The default quota allotment in your `/home` directory is 20 GB with a hard +limit of 30 GB (at which point you cannot write more files). + +**Note: The CHTC access points are not backed up, so you should copy completed jobs to a secure location as soon as a batch completes, and then delete them on the submit node in order to make room for future -jobs.** If you need more disk space to run a single batch or concurrent -batches of jobs, please contact us ([Get Help!](get-help.html)). We have multiple ways of dealing with large disk space -requirements to make things easier for you. +jobs.** Disk space provided is intended for *active* calculations only, not permanent storage. +If you need more disk space to run a single batch or concurrent +batches of jobs, please contact us ([Get Help!](get-help.html)). We have multiple ways of dealing with large disk space requirements to make things easier for you. If you wish to change your quotas, please see [Request a Quota Change](quota-request). -**1. Checking Your User Quota and Usage** +**1. Checking Your `/home` Quota and Usage** ------------------------------------- -From any directory location within your home directory, type +From any directory location within your `/home` directory, use the command `quota -vs`. See the example below: ``` @@ -43,18 +39,39 @@ Disk quotas for user alice (uid 20384): ``` {:.term} -The output will list your total data usage under `blocks`, your soft +The output will list your total data usage under `space`, your soft `quota`, and your hard `limit` at which point your jobs will no longer -be allowed to save data. Each of the values given are in 1-kilobyte +be allowed to save data. Each value is given in 1-kilobyte blocks, so you can divide each number by 1024 to get megabytes (MB), and -again for gigabytes (GB). (It also lists information for ` files`, but -we don\'t typically allocate disk space by file count.) +again for gigabytes (GB). (It also lists information for number of `files`, but +we don't typically allocate disk space in `/home` by file count.) + +**2. Checking Your `/staging` Quota and Usage** +------------------------------------------------ +Users may have a `/staging` directory, meant for staging large files and data intended for +job submission. See our [Managing Large Data in HTC Jobs](file-avail-largedata) guide for +more information. + +To check your `/staging` quota, use the command `get_quotas /staging/username`. + +``` +[alice@submit]$ get_quotas /staging/alice +Path Quota(GB) Items Disk_Usage(GB) Items_Usage +/staging/alice 20 5 3.18969 5 +``` +{:.term} + +Your `/staging` directory has a disk and item quota. In the example above, the disk quota is +20 GB, and the items quota is 5 items. The current usage is printed in the following columns; +in the example, the user has used 3.19 GB and 5 items. + +To request a quota increase, [fill out our quota request form](quota-request). -**2. Checking the Size of Directories and Contents** +**3. Checking the Size of Directories and Contents** ------------------------------------------------ -Move to the directory you\'d like to check and type `du` . After several -moments (longer if you\'re directory contents are large), the command +Move to the directory you'd like to check and type `du` . After several +moments (longer if the contents of your directory are large), the command will add up the sizes of directory contents and output the total size of each contained directory in units of kilobytes with the total size of that directory listed last. See the example below: diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index abe42dd4..81a01f67 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -40,7 +40,7 @@ familiar with:** * [A. Transferring Large Input Files](#a-transferring-large-input-files) * [B. Transferring Large Output Files](#b-transferring-large-output-files) - [4. Submit Jobs Using Staged Data](#4-submit-jobs-using-staged-data) -- [5. Checking your Quota, Data Use, and File Counts](#5-checking-your-quota-data-use-and-file-counts) +- [5. Related Pages](#5-related-pages) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} @@ -224,27 +224,8 @@ Requirements = (Target.HasCHTCStaging == true) queue ``` -# 5. Checking your Quota, Data Use, and File Counts +# 5. Related Pages -You can use the command `get_quotas` to see what disk -and items quotas are currently set for a given directory path. -This command will also let you see how much disk is in use and how many -items are present in a directory: - -``` -[username@transfer ~]$ get_quotas /staging/username -``` -{:.term} - -Alternatively, the `ncdu` command can also be used to see how many -files and directories are contained in a given path: - -``` -[username@transfer ~]$ ncdu /staging/username -``` -{:.term} - -When `ncdu` has finished running, the output will give you a total file -count and allow you to navigate between subdirectories for even more -details. Type `q` when you\'re ready to exit the output viewer. More -info here: +* [Data Storage Locations on the HTC](htc-job-file-transfer) +* [Check Disk Quota and Usage](check-quota) +* [Request a /staging directory or quota change](quota-request) \ No newline at end of file From 7182b2a84a5981fd4043c03e8312231e2908d425 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 28 Oct 2024 11:10:44 -0500 Subject: [PATCH 22/25] Update title; add more to `transfer_output_remaps` --- _uw-research-computing/htc-job-file-transfer.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 64738e06..743b755f 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -1,7 +1,7 @@ --- highlighter: none layout: file_avail -title: Data Storage Locations on the HTC +title: Using and transferring data in jobs on the HTC system guide: order: 1 category: Handling Data in Jobs @@ -75,17 +75,17 @@ When a job completes, by default, HTCondor will return **newly created or edited ### Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps` If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use ``` -transfer_output_files = file1.txt, file2.txt +transfer_output_files = file1.txt, file2.txt, file3.txt ``` {:.sub} To transfer a file or folder back to `/staging`, you will need an additional line in your HTCondor submit file: ``` -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt" +transfer_output_remaps = "file1.txt = file:///staging/NetID/output1.txt; file2.txt = /home/NetId/outputs/output2.txt" ``` {:.sub} -where `output1.txt` is the name of the output file or folder you would like transferred back to a `/staging` directory. Ensure you have the right file transfer syntax (`osdf://` or `file:///` depending on the anticipated file size). +In this example above, `file1.txt` is remapped to the staging directory using the `file:///` transfer protocol and simultaneously renamed `output1.txt`. In addition, `file2.txt` is renamed to `output2.txt`and will be transferred to a different directory on `/home`. Ensure you have the right file transfer syntax (`osdf://` or `file:///` depending on the anticipated file size). If you have multiple files or folders to transfer back to `/staging`, use a semicolon (;) to separate each object: ``` From 0452a7dfc61cc67c18cb49c95feedf4107f7223b Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 9 Dec 2024 16:58:00 -0600 Subject: [PATCH 23/25] Find and replace osdf with 2 slashes to 3 --- _layouts/file_avail.html | 2 +- _uw-research-computing/file-avail-largedata.md | 8 ++++---- _uw-research-computing/htc-job-file-transfer.md | 12 ++++++------ 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/_layouts/file_avail.html b/_layouts/file_avail.html index ff187358..be922b4e 100644 --- a/_layouts/file_avail.html +++ b/_layouts/file_avail.html @@ -28,7 +28,7 @@

Which Option is the Best for Your Files?

100 MB - TBs per job-specific file; repeatedly-used files > 1GB 4 GB - TBs per job /staging - osdf:// or file:/// + osdf:/// or file:/// all of CHTC/external pools or a subset of CHTC diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index 81a01f67..fd2c7582 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -164,11 +164,11 @@ mechanisms as uploaded files (as above). # 3. Using Staged Files in a Job ## A. Transferring Large Input Files -Staged files should be specified in the job submit file using the `osdf://` or `file:///` syntax, +Staged files should be specified in the job submit file using the `osdf:///` or `file:///` syntax, depending on the size of the files to be transferred. [See this table for more information](htc-job-file-transfer#transferring-data-to-jobs-with-transfer_input_files). ``` -transfer_input_files = osdf://chtc/staging/username/file1, file:///staging/username/file2, file3 +transfer_input_files = osdf:///chtc/staging/username/file1, file:///staging/username/file2, file3 ``` {:.sub} @@ -181,7 +181,7 @@ Large outputs should be transferred to staging using the same file transfer prot ``` transfer_output_files = file1, file2 -transfer_output_remaps = "file1 = osdf://chtc/staging/username/file1; file2 = file:///staging/username/file2" +transfer_output_remaps = "file1 = osdf:///chtc/staging/username/file1; file2 = file:///staging/username/file2" ``` {:.sub} @@ -214,7 +214,7 @@ log = myprogram.log output = $(Cluster).out error = $(Cluster).err -transfer_input_files = osdf://chtc/staging/username/myprogram, file:///staging/username/largedata.tar.gz +transfer_input_files = osdf:///chtc/staging/username/myprogram, file:///staging/username/largedata.tar.gz # IMPORTANT! Require execute servers that can access /staging Requirements = (Target.HasCHTCStaging == true) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index 743b755f..fd24808a 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -14,7 +14,7 @@ guide: * [/home](#home) * [/staging](#staging) - [Transferring Data to Jobs with `transfer_input_files`](#transferring-data-to-jobs-with-transfer_input_files) - * [Important Note: File Transfers and Caching with `osdf://`](#important-note-file-transfers-and-caching-with-osdf) + * [Important Note: File Transfers and Caching with `osdf:///`](#important-note-file-transfers-and-caching-with-osdf) - [Transferring Data Back from Jobs to `/home` or `/staging`](#transferring-data-back-from-jobs-to-home-or-staging) * [Default Behavior for Transferring Output Files](#default-behavior-for-transferring-output-files) * [Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps`](#specify-which-output-files-to-transfer-with-transfer_output_files-and-transfer_output_remaps) @@ -45,7 +45,7 @@ In the HTCondor submit file, `transfer_input_files` should always be used to tel | Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | | ----------- | ----------- | ----------- | ----------- | | 0 - 100 MB | `/home` | `transfer_input_files = input.txt` | -| 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf://chtc/staging/NetID/input.txt` | +| 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | | > 30 GB | `/staging` | `transfer_input_files = file:///staging/NetID/input.txt` | | > 100 GB | | For larger datasets (100GB+ per job), contact the facilitation team about the best strategy to stage your data | @@ -54,7 +54,7 @@ Multiple input files and file transfer protocols can be specified and delimited ``` # My job submit file -transfer_input_files = file1, osdf://chtc/staging/username/file2, file:///staging/username/file3, dir1, dir2/ +transfer_input_files = file1, osdf:///chtc/staging/username/file2, file:///staging/username/file3, dir1, dir2/ ... other submit file details ... ``` @@ -62,8 +62,8 @@ transfer_input_files = file1, osdf://chtc/staging/username/file2, file:///stagin Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. -### Important Note: File Transfers and Caching with `osdf://` -The `osdf://` file transfer protocol uses a [caching](https://en.wikipedia.org/wiki/Cache_(computing)) mechanism for input files to reduce file transfers over the network. This can affect users who refer to input files that are frequently modified. +### Important Note: File Transfers and Caching with `osdf:///` +The `osdf:///` file transfer protocol uses a [caching](https://en.wikipedia.org/wiki/Cache_(computing)) mechanism for input files to reduce file transfers over the network. This can affect users who refer to input files that are frequently modified. *If you are changing the contents of the input files frequently, you should rename the file or change its path to ensure the new version is transferred.* @@ -85,7 +85,7 @@ transfer_output_remaps = "file1.txt = file:///staging/NetID/output1.txt; file2.t ``` {:.sub} -In this example above, `file1.txt` is remapped to the staging directory using the `file:///` transfer protocol and simultaneously renamed `output1.txt`. In addition, `file2.txt` is renamed to `output2.txt`and will be transferred to a different directory on `/home`. Ensure you have the right file transfer syntax (`osdf://` or `file:///` depending on the anticipated file size). +In this example above, `file1.txt` is remapped to the staging directory using the `file:///` transfer protocol and simultaneously renamed `output1.txt`. In addition, `file2.txt` is renamed to `output2.txt`and will be transferred to a different directory on `/home`. Ensure you have the right file transfer syntax (`osdf:///` or `file:///` depending on the anticipated file size). If you have multiple files or folders to transfer back to `/staging`, use a semicolon (;) to separate each object: ``` From b3831f986683ccadb39021f9e61c6456ec2c40cb Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 9 Dec 2024 17:01:23 -0600 Subject: [PATCH 24/25] Remove unused draft --- .../file-avail-largedata-test.md | 651 ------------------ 1 file changed, 651 deletions(-) delete mode 100644 _uw-research-computing/file-avail-largedata-test.md diff --git a/_uw-research-computing/file-avail-largedata-test.md b/_uw-research-computing/file-avail-largedata-test.md deleted file mode 100644 index 0522644d..00000000 --- a/_uw-research-computing/file-avail-largedata-test.md +++ /dev/null @@ -1,651 +0,0 @@ ---- -highlighter: none -layout: guide -title: Managing Large Data in HTC Jobs -published: false ---- - -# Data Transfer Solutions By File Size - -Due to the distributed nature of CHTC's High Throughput Computing (HTC) system, -your jobs will run on a server (aka an execute server) that is separate and -distinct from the server that your jobs are submitted from (aka the submit server). -This means that a copy of all the files needed to start your jobs must be -made available on the execute server. Likewise, any output files created -during the execution of your jobs, which are written to the execute server, -will also need to be transferred to a location that is accessible to you after your jobs complete. -**How input files are copied to the execute server and how output files are -copied back will depend on the size of these files.** This is illustrated via -the diagram below: - -![CHTC File Management Solutions](images/chtc-file-transfer.png) - -CHTC's data filesystem called "Staging" is a distinct location for -temporarily hosting files that are too large to be handled in a -high-throughput fashion via the default HTCondor file transfer -mechanism which is otherwise used for small files hosted in your `/home` -directory on your submit server. - -CHTC's `/staging` location is specifically intended for: - -- any individual input files >100MB -- input files totaling >500MB per job -- individual output files >4GB -- output files totaling >4GB per job - -This guide covers when and how to use `/staging` for jobs run in CHTC. - -# Table of Contents - -- [Who Should Use Staging](#use) -- [Policies and User Responsibilities](#policies-and-user-responsibilities) -- [Quickstart Instructions](#quickstart-instructions) -- [Get Access To Staging](#access) -- [Use The Transfer Server To Move Files To/From Staging](#transfer) -- [Submit Jobs With Input Files in Staging](#input) -- [Submit Jobs That Transfer Output Files To Staging](#output) -- [Tips For Success When Using Staging](#tips) -- [Managing Staging Data and Quotas](#quota) - - -# Who Should Use `/staging` - -`/staging` is a location specifically for hosting singularly larger input (>100MB) -and/or larger ouput (>4GB) files or when a job needs 500MB or more of total input -or will produce 4GB or more of total output. Job input and outupt of these -sizes are too large to be managed by CHTC's other data movement methods. - -**Default CHTC account creation does not include access to `/staging`.** -Access to `/staging` is provided as needed for supporting your data management -needs. If you think you need access to `/staging`, or would -like to know more about managing your data needs, please contact us at -. - -Files hosted in `/staging` are only excessible to jobs running in the CHTC pool. -About 50% of CHTC execute servers have access to `/staging`. Users will get -better job throughput if they are able to break up their work into smaller jobs -that each use or produce input and output files that do not require `/staging`. - -# Policies and User Responsibilities - -**USERS VIOLATING ANY OF THE POLICIES IN THIS GUIDE WILL -HAVE THEIR DATA STAGING ACCESS AND/OR CHTC ACCOUNT REVOKED UNTIL CORRECTIVE -MEASURES ARE TAKEN. CHTC STAFF RESERVE THE RIGHT TO REMOVE ANY -PROBLEMATIC USER DATA AT ANY TIME IN ORDER TO PRESERVE PERFORMANCE** - -

Jobs should NEVER be submitted from -/staging. All HTCondor job submissions must be performed from your -/home directory on the submit server and job log, -error, and output files should never be -written to /staging.

- -- **Backup your files**: As with all CHTC file spaces, CHTC does not back -up your files in `/staging`. - -- **Use bash script commands to access files in `/staging`**: Files placed in `/staging` -should **NEVER** be listed in the submit file, but rather accessed -via the job's executable (aka .sh) script. More details provided -in [Submit Jobs With Input Files in Staging](#input) -and [Submit Jobs That Transfer Output Files To Staging](#output). - -- **Use the transfer server**: We expect that users will only use our dedicated -transfer server, transfer.chtc.wisc.edu, instead of the submit server, -to upload and download appropriate files to and from `/staging`. Transferring -files to `/staging` with the submit server can negatively impact job submission for -you and other users. For more details, please see -[Use The Transfer Server To Move Files To/From Staging](#transfer) - -- **Quota control**:`/staging` directories include disk space and -items (i.e. directories and files) quotas. Quotas are necessary for -maintaning the stability and reliability of `/staging`. Quota changes can -be requested by emailing and -users can monitor quota settings and usage as described in -[Managing Staging Data and Quotas](#quota) - -- **Reduce file size and count**: We expect that users will use `tar` and -compression to reduce data size and file counts such that a single tarball -is needed and/or produced per job. More details provided in [Submit Jobs With Input Files in Staging](#input) -and [Submit Jobs That Transfer Output Files To Staging](#output). - -- **Shared group data**: `/staging` directories are owned by the user, -and only the user's own jobs can access these files. We can create shared group -`/staging` directories for sharing larger input and output files as needed. -[Contact us](mailto:chtc@cs.wisc.edu) to learn more. - -- **Remove data**: We expect that users will remove data from `/staging` as -soon as it is no longer needed for actively-running jobs. - -- CHTC staff reserve the right to remove data from `/staging` -(or any CHTC file system) at any time. - -# Quickstart Instructions - -1. Request access to `/staging`. - - * For more details, see [Get Access To Staging](#access) - -1. Review `/staging` [Policies and User Responsibilities](#policies-and-user-responsibilities) - -1. Prepare input files for hosting in `/staging`. - - * Compress files to reduce file size and speed up -file transfer. - - * If your jobs need multiple large input files, -use `tar` and `zip` to combine files so that only a single `tar` or `zip` -archive is needed per job. - -1. Use the transfer server, `transfer.chtc.wisc.edu`, to upload input -files to your `/staging` directory. - - * For more details, see [Use The Transfer Server To Move Files To/From Staging](#transfer). - - * For details, see [Submit Jobs With Input Files in Staging](#input). - -1. Create your HTCondor submit file. - - * Include the following submit detail to ensure that -your jobs will have access to your files in `/staging`: - - ``` {.sub} - requirements = (HasCHTCStaging =?= true) - ``` - -1. Create your executable bash script. - - * Use `cp` or `rsync` to copy large input -from `/staging` that is needed for the job. For example: - - ``` - cp /staging/username/my-large-input.tar.gz ./ - tar -xzf my-large-input.tar.gz - ``` - {:.file} - - * If the job will produce output >4GB this output should be -be compressed moved to `/staging` before job terminates. If multiple large output -files are created, use `tar` and `zip` to reduce file counts. For -example: - - ``` - tar -czf large_output.tar.gz output-file-1 output-file-2 output_dir/ - mv large_output.tar.gz /staging/username - ``` - {:.file} - - * Before the job completes, delete input copied from `/staging`, the -extracted large input file(s), and the uncompressed or untarred large output files. For example: - - ``` - rm my-large-input.tar.gz - rm my-large-input-file - rm output-file-1 output-file-2 - ``` - {:.file} - - * For more details about job submission using input from `/staging` or for hosting -output in `/staging`, please see [Submit Jobs With Input Files in Staging](#input) and -[Submit Jobs That Transfer Output Files To Staging](#output). - -1. Remove large input and output files `/staging` after jobs complete using -`transfer.chtc.wisc.edu`. - - - -# Get Access To `/staging` - -
Click to learn more -

- -CHTC accounts do not automatically include access to `/staging`. If you think -you need a `/staging` directory, please contact us at . So -we can process your request more quickly, please include details regarding -the number and size of the input and/or output files you plan to host in -`/staging`. You will also be granted access to out dedicated transfer -server upon creation of your `/staging` directory. - -*What is the path to my `/staging` directory?* -- Individual directories will be created at `/staging/username` -- Group directories will be created at `/staging/groups/group_name` - -*How much space will I have?* - -Your quota will be set based on your specific data needs. To see more -information about checking your quota and usage in staging, see the -end of this guide: [Managing Staging Data and Quotas](#quota) - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Use The Transfer Server To Move Files To/From `/staging` - -
Click to learn more -

- -![Use Transfer Server](images/use-transfer-staging.png) - -Our dedicated transfer server, `transfer.chtc.wisc.edu`, should be used to -upload and/or download your files to/from `/staging`. - -The transfer server is a separate server that is independent of the submit -server you otherwise use for job submission. By using the transfer server -for `/staging` data upload and download, you are helping to reduce network -bottlenecks on the submit server that could otherwise negatively impact -the submit server's performance and ability to manage and submit jobs. - -**Users should not use their submit server to upload or download files -to/from `staging` unless otherwise directed by CHTC staff.** - -*How do I connect to the transfer server?* -Users can access the transfer server the same way they access their -submit server (i.e. via Terminal app or PuTTY) using the following -hostname: `transfer.chtc.wisc.edu`. - -*How do I upload/download files to/from `staging`?* -Several options exist for moving data to/from `staging` including: - -- `scp` and `rsync` can be used from the terminal to move data -to/from your own computer or *another server*. For example: - - ``` - $ scp large.file username@transfer.chtc.wisc.edu:/staging/username/ - $ scp username@serverhostname:/path/to/large.file username@transfer.chtc.wisc.edu:/staging/username/ - ``` - {:.term} - - **Be sure to use the username assigned to you on the other submit server for the first - argument in the above example for uploading a large file from another server.** - -- GUI-based file transfer clients like WinSCP, FileZilla, and Cyberduck -can be used to move files to/from your personal computer. Be -sure to use `transfer.chtc.wisc.edu` when setting up the connection. - -- Globus file transfer can be used to transfer files to/from a Globus Endpoint. -See our guide [Using Globus To Transfer Files To and From CHTC](globus.html) -for more details. - -- `smbclient` is available for managing file transfers to/from file -servers that have `smbclient` installed, like DoIT's ResearchDrive. See our guide -[Transferring Files Between CHTC and ResearchDrive](transfer-data-researchdrive.html) -for more details. - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Submit Jobs With Input Files in `/staging` - -
Click to learn more -

- -![Staging File Transfer](images/staging-file-transfer.png) - -`/staging` is a distinct location for temporarily hosting your -individually larger input files >100MB in size or in cases when jobs -will need >500MB of total input. First, a copy of -the appropriate input files must be uploaded to your `/staging` directory -before your jobs can be submitted. As a reminder, individual input files <100MB -in size should be hosted in your `/home` directory. - -Both your submit file and bash script -must include the necessary information to ensure successful completion of -jobs that will use input files from `/staging`. The sections below will -provide details for the following steps: - -1. Prepare your input before uploading to `/staging` -2. Prepare your submit files for jobs that will use large input -files hosted in `/staging` -3. Prepare your executable bash script to access and use large input -files hosted in `/staging`, delete large input from job - -## Prepare Large Input Files For `\staging` - -**Organize and prepare your large input such that each job will use a single, -or as few as possible, large input files.** - -As described in our policies, data placed in `/staging` should be -stored in as few files as possible. Before uploading input files -to `/staging`, use file compression (`zip`, `gzip`, `bzip`) and `tar` to reduce -file sizes and total file counts such that only a single, or as few as -possible, input file(s) will be needed per job. - -If your large input will be uploaded from your personal computer -Mac and Linux users can create input tarballs by using the command `tar -czf` -within the Terminal. Windows users may also use a terminal if installed, -else several GUI-based `tar` applications are available, or ZIP can be used -in place of `tar`. - -The following examples demonstrate how to make a compressed tarball -from the terminal for two large input files named `file1.lrg` and -`file2.lrg` which will be used for a single job: - -``` -my-computer username$ tar -czf large_input.tar.gz file1.lrg file2.lrg -my-computer username$ ls -file1.lrg -file2.lrg -large_input.tar.gz -``` -{: .term} - -Alternatively, files can first be moved to a directory which can then -be compressed: - -``` -my-computer username$ mkdir large_input -my-computer username$ mv file1.lrg file2.lrg large_input/ -my-computer username$ tar -czf large_input.tar.gz large_input -my-computer username$ ls -F -large_input/ -large_input.tar.gz -``` -{: .term} - -After preparing your input, -use the transfer server to upload the tarballs to `/staging`. Instructions for -using the transfer server are provide in the above section -[Use The Transfer Server To Move Large Files To/From Staging](#transfer). - -## Prepare Submit File For Jobs With Input in `/staging` - -Not all CHTC execute servers have access to `/staging`. If your job needs access -to files in `/staging`, you must tell HTCondor to run your jobs on the approprite servers -via the `requirements` submit file attribute. **Be sure to request sufficient disk -space for your jobs in order to accomodate all job input and output files.** - -An example submit file for submitting a job that requires access to `/staging` -and which will transfer a smaller, <100MB, input file from `/home`: - -```{.sub} -# job with files in staging and input in home example - -log = my_job.$(Cluster).$(Process).log -error = my_job.$(Cluster).$(Process).err -output = my_job.$(Cluster).$(Process).out - -...other submit file details... - -# transfer small files from home -transfer_input_files = my_smaller_file - -requirements = (HasCHTCStaging =?= true) - -queue -``` - -**Remember:** If your job has any other requirments defined in -the submit file, you should combine them into a single `requirements` statement: - -```{.sub} -requirements = (HasCHTCStaging =?= true) && other requirements -``` - -## Use Job Bash Script To Access Input In `/staging` - -Unlike smaller, <100MB, files that are transferred from your home directory -using `transfer_input_files`, files placed in `/staging` should **NEVER** -be listed in the submit file. Instead, you must include additional -commands in the job's executable bash script that will copy (via `cp` or `rsync`) -your input in `/staging` to the job's working directory and extract ("untar") and -uncompress the data. - -**Additional commands should be included in your bash script to remove -any input files copied from `/staging` before the job terminates.** -HTCondor will think the files copied from `/staging` are newly generated -output files and thus, HTCondor will likely transfer these files back -to your home directory with other, real output. This can cause your `/home` -directory to quickly exceed its disk quota causing your jobs to -go on hold with all progress lost. - -Continuing our example, a bash script to copy and extract -`large_input.tar.gz` from `/staging`: - -``` -#!/bin/bash - -# copy tarball from staging to current working dir -cp /staging/username/large_input.tar.gz ./ - -# extract tarball -tar -xzf large_input.tar.gz - -...additional commands to be executed by job... - -# delete large input to prevent -# HTCondor from transferring back to submit server -rm large_input.tar.gz file1.lrg file2.lrg - -# END -``` -{:.file} - -As shown in the exmaple above, \*both\* the original tarball, `large_input.tar.gz`, and -the extracted files are deleted as a final step in the script. If untarring -`large_input.tar.gz` insteads creates a new subdirectory, then only the original tarball -needs to be deleted. - -

Want to speed up jobs with larger input data? -

- -If your your job will transfer >20GB worth of input file, then using `rm` to remove these -files before the job terminates can take a while to complete which will add -unnecessary runtime to your job. In this case, you can create a -subdirectory and move (`mv`) the large input to it - this will complete almost -instantaneously. Because these files will be in a subdirectory, HTCondor will -ignore them when determining with output files to transfer back to the submit server. - -For example: - -``` -# prevent HTCondor from transferring input file(s) back to submit server -mkdir ignore/ -mv large_input.tar.gz file1.lrg file2.lrg ignore/ -``` -{:.file} - -

-
- -## Remove Files From `/staging` After Jobs Complete - -Files in `/staging` are not backed up and `/staging` should not -be used as a general purpose file storage service. As with all -CHTC file spaces, data should be removed from `/staging` as -soon as it is no longer needed for actively-running jobs. Even if it -will be used in the future, your data should be deleted and copied -back at a later date. Files can be taken off of `/staging` using similar -mechanisms as uploaded files (as above). - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Submit Jobs That Transfer Output Files To `/staging` - -
Click to learn more -

- -![Staging File Transfer](images/staging-file-transfer.png) - -`/staging` is a distinct location for temporarily hosting -individual output files >4GB in size or in cases when >4GB -of output is produced by a single job. - -Both your submit file and job bash script -must include the necessary information to ensure successful completion of -jobs that will host output in `/staging`. The sections below will -provide details for the following steps: - -1. Prepare your submit files for jobs that will host output in `/staging` -2. Prepare your executable bash script to tar output and move to `/staging` - -## Prepare Submit File For Jobs That Will Host Output In `/staging` - -Not all CHTC execute servers have access to `/staging`. If your -will host output files in `/staging`, you must tell HTCondor to run -your jobs on the approprite servers via the `requirements` submit -file attribute: - -```{.sub} -# job that needs access to staging - -log = my_job.$(Cluster).$(Process).log -error = my_job.$(Cluster).$(Process).err -output = my_job.$(Cluster).$(Process).out - -...other submit file details... - -requirements = (HasCHTCStaging =?= true) - -queue -``` - -**Remember:** If your job has any other requirments defined in -the submit file, you should combine them into a single `requirements` statement: - -```{.sub} -requirements = (HasCHTCStaging =?= true) && other requirements -``` - -## Use Job Bash Script To Move Output To `/staging` - -Output generated by your job is written to the execute server -where the run jobs. For output that is large enough (>4GB) to warrant use -of `/staging`, you must include steps in the executable bash script of -your job that will package the output into a tarball and relocate it -to your `/staging` directory before the job completes. **This can be -acheived with a single `tar` command that directly writes the tarball -to your staging directory!** It is IMPORTANT that no other files be written -directly to your `/staging` directory during job execution except for -the below `tar` example. - -For example, if a job writes a larger ammount of output to -a subdirectory `output_dir/` along with an additional -larger output file `output.lrg`, the following steps will -package the all of the output into a single tarball that is -then moved to `/staging`. **Note:** `output.lrg` will still exist -in the job's working directory after creating the tarball and thus -must be deleted before job completes. - -``` -#!/bin/bash - -# Commands to execute job - -... - -# create tarball located in staging containing >4GB output -tar -czf /staging/username/large_output.tar.gz output_dir/ output.lrg - -# delete an remaining large files -rm output.lrg - -# END -``` -{: .file} - -If a job generates a single large file that will not shrink much when -compressed, it can be moved directly to staging. If a job generates -multiple files in a directory, or files can be substantially made smaller -by zipping them, the above example should be followed. - -``` -#!/bin/bash - -# Commands to execute job - -... - -# move single large output file to staging -mv output.lrg /staging/username/ - -# END -``` -{: .file} - -## Managing Larger `stdout` From Jobs - -Does your software produce a large amount of output that gets -saved to the HTCondor `output` file? Some software are written to -"stream" output directly to the terminal screen during interactive execution, but -when the software is executed non-interactively via HTCondor, the output is -instead saved in the `output` file designated in the HTCondor submit file. - -Because HTCondor will transfer `output` back to your home directory, if your -jobs produce HTCondor `output` files >4GB it is important to move this -data to `/staging` by redirecting the output of your job commands to a -separate file that gets packaged into a compressed tarball and relocated -to `/staging`: - -``` -#!/bin/bash - -# redirect standard output to a file in the working directory -./myprogram myinput.txt > large.stdout - -# create tarball located in staging containing >4GB output -tar -czf /staging/username/large.stdout.tar.gz large.stdout - -# delete large.stdout file -rm large.stdout - -# END -``` -{: .file} - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Tips For Success When Using `/staging` - -In order to properly submit jobs use `/staging` for managing larger -input and output file, always do the following: - -- **Submit from `/home`**: ONLY submit jobs from within your home directory - (`/home/username`), and NEVER from within `/staging`. - -- **No large data in the submit file**: Do NOT list any files from `/staging` in -your submit file and do NOT use `/staging` as a path for any submit file attributes -such as `executable, log, output, error, transfer_input_files`. -As described in this guide, all interaction with `/staging` will occur via -command in the executable bash script. - -- **Request sufficient disk space**: Using `request_disk`, request an amount of disk -space that reflects the total of a) input data that each job will copy into -the job working directory from `/staging` including the size of the tarball and the -extracted files b) any input transferred via `transfer_input_files`, -and c) any output that will be created in the job working directory. - -- **Require access to `/staging`**: Tell HTCondor that your jobs need to run on -execute servers that can access `/staging` using the following submit file attribute: - - ```{.sub} - Requirements = (Target.HasCHTCStaging == true) - ``` - -[Return to top of page](#data-transfer-solutions-by-file-size) - - -# Managing `/staging` Data and Quotas - -Use the command `get_quotas` to see what disk -and items quotas are currently set for a given directory path. -This command will also let you see how much disk is in use and how many -items are present in a directory: - -``` -[username@transfer ~]$ get_quotas /staging/username -``` -{:.term} - -[Return to top of page](#data-transfer-solutions-by-file-size) From 391b846729b2a9be534791b3724236fdfba31359 Mon Sep 17 00:00:00 2001 From: Amber Lim Date: Mon, 3 Mar 2025 17:15:15 -0600 Subject: [PATCH 25/25] Add specify `file:///` for group directories --- _uw-research-computing/htc-job-file-transfer.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md index fd24808a..b490cdcd 100644 --- a/_uw-research-computing/htc-job-file-transfer.md +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -46,6 +46,7 @@ In the HTCondor submit file, `transfer_input_files` should always be used to tel | ----------- | ----------- | ----------- | ----------- | | 0 - 100 MB | `/home` | `transfer_input_files = input.txt` | | 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | +| > 100 MB - 100 GB | `/staging/groups` | `transfer_input_files = file:///staging/NetID/input.txt` | | > 30 GB | `/staging` | `transfer_input_files = file:///staging/NetID/input.txt` | | > 100 GB | | For larger datasets (100GB+ per job), contact the facilitation team about the best strategy to stage your data |