Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document remote file staging #5523

Merged
merged 10 commits into from
Dec 19, 2024
26 changes: 20 additions & 6 deletions docs/working-with-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,29 +228,43 @@ In general, you should not need to manually copy files, because Nextflow will au

## Remote files

Nextflow can work with many kinds of remote files and objects using the same interface as for local files. The following protocols are supported:
Nextflow works with many types of remote files and objects using the same interface as for local files. The following protocols are supported:

- HTTP(S) / FTP (`http://`, `https://`, `ftp://`)
- HTTP(S)/FTP (`http://`, `https://`, `ftp://`)
- Amazon S3 (`s3://`)
- Azure Blob Storage (`az://`)
- Google Cloud Storage (`gs://`)

To reference a remote file, simple specify the URL when opening the file:
To reference a remote file, simply specify the URL when opening the file:

```nextflow
pdb = file('http://files.rcsb.org/header/5FID.pdb')
```

You can then access it as a local file as described previously:
It can then be used in the same way as a local file:

```nextflow
println pdb.text
```

:::{note}
Not all operations are supported for all protocols. In particular, writing and directory listing are not supported for HTTP(S) and FTP paths.
Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths.
:::

:::{note}
Additional configuration may be required to work with cloud object storage (e.g. to authenticate with a private bucket). Refer to the respective page for each cloud storage provider for more information.
Additional configuration may be required to work with cloud object storage. For example, to authenticate with a private bucket. Refer to the respective page for each cloud storage provider for more information.
:::

### Remote file staging

In general, files do not need to be copied manually (e.g. using the `copyTo()` method). When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK.

Remote files are staged in a subdirectory of the work directory of the form `stage-<session-id>/<hash>/<filename>`, where `<hash>` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be also reused by resumed runs with the same session ID.

:::{note}
Remote file staging can become a bottleneck for large runs where inputs must be staged into the work directory, for example, when inputs are stored in object storage but the work directory is in a shared filesystem. This is because Nextflow handles all of the file transfers.

You can get around this bottleneck with a custom process that downloads the file(s), allowing you to stage many files with multiple parallel jobs. The file should be given as a `val` input instead of a `path` input to bypass the built-in remote file staging.

Alternatively, you can use {ref}`fusion-page` with the work directory in object storage, in which case the remote files will be used directly by the tasks without any prior staging.
:::
Loading