Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: cleanup some md links #2534

Merged
merged 14 commits into from
Jul 2, 2021
36 changes: 21 additions & 15 deletions content/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,22 +31,26 @@ specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.
The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the `dvc.yaml`
or `.dvc` file where the given `path` is found (`outs` field). The schema of the
URL returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).
URL returned depends on the [type][storage-types] of the `remote` used (see the
[Parameters](#parameters) section).

If the target is a directory, the returned URL will end in `.dir`. Refer to
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
and `dvc add` to learn more about how DVC handles data directories.
[Structure of cache directory] and `dvc add` to learn more about how DVC handles
data directories.

⚠️ This function does not check for the actual existence of the file or
directory in the remote storage.

💡 Having the resource's URL, it should be possible to download it directly with
an appropriate library, such as
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj)
or
[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get).
an appropriate library, such as [`boto3`] or [`paramiko`].

[storage-types]: /doc/command-reference/remote/add#supported-storage-types
[structure of cache directory]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[`boto3`]:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj
[`paramiko`]:
https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get

## Parameters

Expand Down Expand Up @@ -88,21 +92,23 @@ The script above prints
`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355`

This URL represents the location where the data is stored, and is built by
reading the corresponding `.dvc` file
([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc))
where the `md5` file hash is stored,
reading the corresponding `.dvc` file ([`get-started/data.xml.dvc`]) where the
`md5` file hash is stored,

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
path: get-started/data.xml
```

and the project configuration
([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config))
where the remote URL is saved:
and the project configuration ([`.dvc/config`]) where the remote URL is saved:

```ini
['remote "storage"']
url = https://remote.dvc.org/dataset-registry
```

[`.dvc/config`]:
https://github.com/iterative/dataset-registry/blob/master/.dvc/config
[`get-started/data.xml.dvc`]:
https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc
3 changes: 3 additions & 0 deletions content/docs/command-reference/destroy.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ set to an
in your project, DVC will replace them with the latest versions of the actual
files and directories first, so that your data is intact after destruction.

[external cache]:
/doc/use-cases/shared-development-server#configure-the-external-shared-cache

> Refer to [Project Structure](/doc/user-guide/project-structure) for more
> details on the directories and files deleted by this command.

Expand Down
29 changes: 15 additions & 14 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@

> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
> alternatives like the
> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or
> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage)
> strategies of `dvc add` and `dvc import-url` are more convenient. **Note**
> that external outputs are not pushed or pulled from/to
> [remote storage](/doc/command-reference/remote).
> alternatives like the [to-cache] or [to-remote] strategies of `dvc add` and
> `dvc import-url` are more convenient. **Note** that external outputs are not
> pushed or pulled from/to [remote storage].

[to-cache]: /doc/command-reference/add#example-transfer-to-the-cache
[to-remote]: /doc/command-reference/add#example-transfer-to-remote-storage
[remote storage]: /doc/command-reference/remote
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

There are cases when data is so large, or its processing is organized in such a
way, that its impossible to handle it in the local machine disk. For example
Expand Down Expand Up @@ -39,16 +40,17 @@ their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml`
> external cache, because it may cause data collisions: the hash of an external
> output could collide with that of a local file with different content.

> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.
> Note that [remote storage] is a different feature.

## Setting up an external cache

DVC requires that the project's <abbr>cache</abbr> is configured in the same
external location as the data that will be tracked (external outputs). This
avoids transferring files to the local environment and enables
[file linking](/doc/user-guide/large-dataset-optimization) within the external
storage.
avoids transferring files to the local environment and enables [file links]
within the external storage.

[file links]:
/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache
Comment on lines -50 to +53
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one both has an #anchor AND repeats later (line 188).


As an example, let's create a directory external to the workspace and set it up
as cache:
Expand Down Expand Up @@ -183,9 +185,8 @@ custom cache location for local paths outside of your project.

> Except for external data on different storage devices or partitions mounted on
> the same file system (e.g. `/mnt/raid/data`). In that case please setup an
> external cache in that same drive to enable
> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> and avoid copying data.
> external cache in that same drive to enable [file links] and avoid copying
> data.

```dvc
$ dvc add --external /home/shared/existing-data
Expand Down