From 9c2786cffb5c77a5ced835413041e6742224aaf3 Mon Sep 17 00:00:00 2001 From: Christopher Hakkaart Date: Tue, 19 Nov 2024 13:58:34 +0100 Subject: [PATCH 1/9] Add details about file download as per #5493 Signed-off-by: Christopher Hakkaart --- docs/working-with-files.md | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index aad769be96..ec9f76204d 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -228,29 +228,47 @@ In general, you should not need to manually copy files, because Nextflow will au ## Remote files -Nextflow can work with many kinds of remote files and objects using the same interface as for local files. The following protocols are supported: +Nextflow works with many types of remote files and objects using the same interface as for local files. The following protocols are supported: -- HTTP(S) / FTP (`http://`, `https://`, `ftp://`) +- HTTP(S)/FTP (`http://`, `https://`, `ftp://`) - Amazon S3 (`s3://`) - Azure Blob Storage (`az://`) - Google Cloud Storage (`gs://`) -To reference a remote file, simple specify the URL when opening the file: +Nextflow downloads remote files when tasks that reference them are created and they do not exist on the same filesystem as the work directory. When possible, standard libraries are used to download files. For example, HttpURLConnection is used for HTTP, and AWS Java SDK is used for S3. Implementations can be viewed under FileSystemProvider in the Nextflow codebase. + +To reference a remote file, simply specify the URL when opening the file: ```nextflow pdb = file('http://files.rcsb.org/header/5FID.pdb') ``` -You can then access it as a local file as described previously: +It can then be accessed as a local file: ```nextflow println pdb.text ``` +By default, downloaded files are staged in a subdirectory of the work directory. The subdirectory is named using the prefix `stage-`, followed by a hash. For example, `stage-XXXXXXXX`. + + + +Remote files are cached using the aforementioned hash. If multiple tasks request the same remote file at the same time, Nextflow will likely download a separate copy to separate folders. + + + +:::{note} +Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths. +::: + :::{note} -Not all operations are supported for all protocols. In particular, writing and directory listing are not supported for HTTP(S) and FTP paths. +A custom process can be used to download a file into a task directory instead of using built-in remote file staging. To be staged by Nextflow, the file name must be provided to the process as a val input instead of a path input. ::: :::{note} -Additional configuration may be required to work with cloud object storage (e.g. to authenticate with a private bucket). Refer to the respective page for each cloud storage provider for more information. +Additional configuration may be required to work with cloud object storage. For example, to authenticate with a private bucket. Refer to the respective page for each cloud storage provider for more information. ::: From bfbe94e25e0ced9a60f62080705d43df25df93f1 Mon Sep 17 00:00:00 2001 From: Ben Sherman Date: Thu, 12 Dec 2024 10:32:41 -0600 Subject: [PATCH 2/9] Update docs Signed-off-by: Ben Sherman --- docs/working-with-files.md | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index ec9f76204d..b0020e256c 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -235,40 +235,36 @@ Nextflow works with many types of remote files and objects using the same interf - Azure Blob Storage (`az://`) - Google Cloud Storage (`gs://`) -Nextflow downloads remote files when tasks that reference them are created and they do not exist on the same filesystem as the work directory. When possible, standard libraries are used to download files. For example, HttpURLConnection is used for HTTP, and AWS Java SDK is used for S3. Implementations can be viewed under FileSystemProvider in the Nextflow codebase. - To reference a remote file, simply specify the URL when opening the file: ```nextflow pdb = file('http://files.rcsb.org/header/5FID.pdb') ``` -It can then be accessed as a local file: +It can then be used in the same way as a local file: ```nextflow println pdb.text ``` -By default, downloaded files are staged in a subdirectory of the work directory. The subdirectory is named using the prefix `stage-`, followed by a hash. For example, `stage-XXXXXXXX`. - - - -Remote files are cached using the aforementioned hash. If multiple tasks request the same remote file at the same time, Nextflow will likely download a separate copy to separate folders. - - - :::{note} Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths. ::: :::{note} -A custom process can be used to download a file into a task directory instead of using built-in remote file staging. To be staged by Nextflow, the file name must be provided to the process as a val input instead of a path input. +Additional configuration may be required to work with cloud object storage. For example, to authenticate with a private bucket. Refer to the respective page for each cloud storage provider for more information. ::: +### Remote file staging + +In general, files do not need to be copied manually (e.g. using the `copyTo()` method). When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. + +Remote files are staged in a subdirectory of the work directory of the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be also reused by resumed runs with the same session ID. + :::{note} -Additional configuration may be required to work with cloud object storage. For example, to authenticate with a private bucket. Refer to the respective page for each cloud storage provider for more information. +Remote file staging can become a bottleneck for large runs where inputs must be staged into the work directory, for example, when inputs are stored in object storage but the work directory is in a shared filesystem. This is because Nextflow handles all of the file transfers. + +You can get around this bottleneck with a custom process that downloads the file(s), allowing you to stage many files with multiple parallel jobs. The file should be given as a `val` input instead of a `path` input to bypass the built-in remote file staging. + +Alternatively, you can use {ref}`fusion-page` with the work directory in object storage, in which case the remote files will be used directly by the tasks without any prior staging. ::: From 207f76143e6d623c4e6ebacab8ade3d1af0375bb Mon Sep 17 00:00:00 2001 From: Ben Sherman Date: Thu, 12 Dec 2024 10:35:31 -0600 Subject: [PATCH 3/9] Remove extraneous sentence Signed-off-by: Ben Sherman --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index b0020e256c..a8fe2482be 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -257,7 +257,7 @@ Additional configuration may be required to work with cloud object storage. For ### Remote file staging -In general, files do not need to be copied manually (e.g. using the `copyTo()` method). When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. +When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. Remote files are staged in a subdirectory of the work directory of the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be also reused by resumed runs with the same session ID. From 3e01f0d46f88854002ec16ff679836749fd018fb Mon Sep 17 00:00:00 2001 From: Christopher Hakkaart Date: Fri, 13 Dec 2024 13:03:21 +0100 Subject: [PATCH 4/9] Suggest new language Signed-off-by: Christopher Hakkaart --- docs/working-with-files.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index a8fe2482be..07fbde5919 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -252,19 +252,19 @@ Not all operations are supported for all protocols. For example, writing and dir ::: :::{note} -Additional configuration may be required to work with cloud object storage. For example, to authenticate with a private bucket. Refer to the respective page for each cloud storage provider for more information. +Additional configuration may be necessary for cloud object storage, such as authenticating with a private bucket. See the documentation for each cloud storage provider for further details. ::: ### Remote file staging -When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. +When a remote file is passed as an input to a process, Nextflow stages the file in the work directory using an appropriate Java SDK. -Remote files are staged in a subdirectory of the work directory of the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be also reused by resumed runs with the same session ID. +Remote files are staged in a subdirectory of the work directory with form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID. :::{note} -Remote file staging can become a bottleneck for large runs where inputs must be staged into the work directory, for example, when inputs are stored in object storage but the work directory is in a shared filesystem. This is because Nextflow handles all of the file transfers. +Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all file transfers. -You can get around this bottleneck with a custom process that downloads the file(s), allowing you to stage many files with multiple parallel jobs. The file should be given as a `val` input instead of a `path` input to bypass the built-in remote file staging. +To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. File should be given as a `val` input instead of a `path` input to bypass Nextflow's built-in remote file staging. -Alternatively, you can use {ref}`fusion-page` with the work directory in object storage, in which case the remote files will be used directly by the tasks without any prior staging. +Alternatively, use {ref}`fusion-page` with the work directory set to object storage. In this case, tasks can access remote files directly without any prior staging, eliminating the bottleneck. ::: From ea8db450e50faaeac183ef038f2ec9896799dd8d Mon Sep 17 00:00:00 2001 From: Chris Hakkaart Date: Fri, 13 Dec 2024 13:51:12 +0100 Subject: [PATCH 5/9] Update docs/working-with-files.md Co-authored-by: Ben Sherman Signed-off-by: Chris Hakkaart --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index 07fbde5919..333193a386 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -257,7 +257,7 @@ Additional configuration may be necessary for cloud object storage, such as auth ### Remote file staging -When a remote file is passed as an input to a process, Nextflow stages the file in the work directory using an appropriate Java SDK. +When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. Remote files are staged in a subdirectory of the work directory with form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID. From 622033dcf2e01654a42bacabe64a98039919c008 Mon Sep 17 00:00:00 2001 From: Chris Hakkaart Date: Fri, 13 Dec 2024 13:51:42 +0100 Subject: [PATCH 6/9] Update docs/working-with-files.md Co-authored-by: Ben Sherman Signed-off-by: Chris Hakkaart --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index 333193a386..25eec85328 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -259,7 +259,7 @@ Additional configuration may be necessary for cloud object storage, such as auth When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. -Remote files are staged in a subdirectory of the work directory with form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID. +Remote files are staged in a subdirectory of the work directory with the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID. :::{note} Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all file transfers. From a30e9d8c8b4ba5a67f8220cd2067b4056cbda33b Mon Sep 17 00:00:00 2001 From: Chris Hakkaart Date: Fri, 13 Dec 2024 13:52:00 +0100 Subject: [PATCH 7/9] Update docs/working-with-files.md Co-authored-by: Ben Sherman Signed-off-by: Chris Hakkaart --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index 25eec85328..3d98766546 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -262,7 +262,7 @@ When a remote file is passed as an input to a process, Nextflow stages the file Remote files are staged in a subdirectory of the work directory with the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID. :::{note} -Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all file transfers. +Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all of these file transfers. To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. File should be given as a `val` input instead of a `path` input to bypass Nextflow's built-in remote file staging. From a943820266729eb90eb2931599ffd35262fa2445 Mon Sep 17 00:00:00 2001 From: Chris Hakkaart Date: Fri, 13 Dec 2024 13:52:24 +0100 Subject: [PATCH 8/9] Update docs/working-with-files.md Co-authored-by: Ben Sherman Signed-off-by: Chris Hakkaart --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index 3d98766546..74d6258449 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -264,7 +264,7 @@ Remote files are staged in a subdirectory of the work directory with the form `s :::{note} Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all of these file transfers. -To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. File should be given as a `val` input instead of a `path` input to bypass Nextflow's built-in remote file staging. +To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. Files should be given as a `val` input instead of a `path` input to bypass Nextflow's built-in remote file staging. Alternatively, use {ref}`fusion-page` with the work directory set to object storage. In this case, tasks can access remote files directly without any prior staging, eliminating the bottleneck. ::: From e82417f2904cbd8e55c4a1fc117af4ca7039fa25 Mon Sep 17 00:00:00 2001 From: Ben Sherman Date: Mon, 16 Dec 2024 10:12:49 -0600 Subject: [PATCH 9/9] Apply suggestions from review Signed-off-by: Ben Sherman --- docs/working-with-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/working-with-files.md b/docs/working-with-files.md index 74d6258449..392d9ae849 100644 --- a/docs/working-with-files.md +++ b/docs/working-with-files.md @@ -257,7 +257,7 @@ Additional configuration may be necessary for cloud object storage, such as auth ### Remote file staging -When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK. +When a process input file resides on a different file system than the work directory, Nextflow copies the file into the work directory using an appropriate Java SDK. Remote files are staged in a subdirectory of the work directory with the form `stage-//`, where `` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID.