From 773246b1e536ad67742f5cb6d4bff68ac899e303 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 24 Apr 2021 19:05:34 -0500 Subject: [PATCH 1/7] ref: -c option typos --- content/docs/command-reference/run.md | 2 +- content/docs/command-reference/stage/add.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 7c63244089..82d3ec20a8 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -193,7 +193,7 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' - `--outs-persist-no-cache ` - the same as `-outs-persist` except that outputs are not tracked by DVC (same as with `-O` above). -- `-c ` - the same as `-o` but also marks the +- `-c `, `--checkpoints ` - the same as `-o` but also marks the output as a [checkpoint](/doc/command-reference/exp/run#checkpoints). Implies `--no-exec`. This makes the stage incompatible with `dvc repro`. diff --git a/content/docs/command-reference/stage/add.md b/content/docs/command-reference/stage/add.md index e864ccdd5e..35d58a9d2f 100644 --- a/content/docs/command-reference/stage/add.md +++ b/content/docs/command-reference/stage/add.md @@ -191,7 +191,7 @@ data science experiments. - `--outs-persist-no-cache ` - the same as `-outs-persist` except that outputs are not tracked by DVC (same as with `-O` above). -- `-c ` - the same as `-o` but also marks the +- `-c `, `--checkpoints ` - the same as `-o` but also marks the output as a [checkpoint](/doc/command-reference/exp/run#checkpoints). Implies `--no-exec`. This makes the stage incompatible with `dvc repro`. From 5d759a9cb1ff3a17991d7a7c0586efa7250bd8d8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 2 Jun 2021 01:50:53 -0500 Subject: [PATCH 2/7] start: typo per https://github.com/iterative/dvc.org/pull/2507#discussion_r641972518 --- content/docs/start/data-and-model-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 998b6ba46d..0753515b62 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -8,7 +8,7 @@ Git.' # Get Started: Data Versioning How cool would it be to make Git handle arbitrarily large files and directories -with the same performance that you get with small code files? Imagine doing a +with the same performance it has with small code files? Imagine doing a `git clone` and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a `git checkout`. From a244b6b476fdd6d3270376637794617dbabd9651 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 3 Jun 2021 20:39:07 -0500 Subject: [PATCH 3/7] test: md link style (1) --- content/docs/user-guide/managing-external-data.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index eb4dae8b50..f36f128908 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -39,8 +39,9 @@ their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml` > external cache, because it may cause data collisions: the hash of an external > output could collide with that of a local file with different content. -> Note that [remote storage](/doc/command-reference/remote) is a different -> feature. +> Note that [remote storage][remote storage] is a different feature. + +[remote storage]: /doc/command-reference/remoter ## Setting up an external cache From 9721680c78216964e7b2353433d39b48b26e3c4f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 8 Jun 2021 14:38:53 -0500 Subject: [PATCH 4/7] guide: refactor md links in external data page --- .../docs/user-guide/managing-external-data.md | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index f36f128908..cfe8bd5564 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -2,12 +2,13 @@ > ⚠️ This is an advanced feature for very specific situations and not > recommended except if there's absolutely no other alternative. In most cases -> alternatives like the -> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or -> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) -> strategies of `dvc add` and `dvc import-url` are more convenient. **Note** -> that external outputs are not pushed or pulled from/to -> [remote storage](/doc/command-reference/remote). +> alternatives like the [to-cache] or [to-remote] strategies of `dvc add` and +> `dvc import-url` are more convenient. **Note** that external outputs are not +> pushed or pulled from/to [remote storage]. + +[to-cache]: /doc/command-reference/add#example-transfer-to-the-cache +[to-remote]: /doc/command-reference/add#example-transfer-to-remote-storage +[remote storage]: /doc/command-reference/remote There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example @@ -39,17 +40,17 @@ their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml` > external cache, because it may cause data collisions: the hash of an external > output could collide with that of a local file with different content. -> Note that [remote storage][remote storage] is a different feature. - -[remote storage]: /doc/command-reference/remoter +> Note that [remote storage] is a different feature. ## Setting up an external cache DVC requires that the project's cache is configured in the same external location as the data that will be tracked (external outputs). This -avoids transferring files to the local environment and enables -[file linking](/doc/user-guide/large-dataset-optimization) within the external -storage. +avoids transferring files to the local environment and enables [file links] +within the external storage. + +[file links]: + /doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache As an example, let's create a directory external to the workspace and set it up as cache: @@ -184,9 +185,8 @@ custom cache location for local paths outside of your project. > Except for external data on different storage devices or partitions mounted on > the same file system (e.g. `/mnt/raid/data`). In that case please setup an -> external cache in that same drive to enable -> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -> and avoid copying data. +> external cache in that same drive to enable [file links] and avoid copying +> data. ```dvc $ dvc add --external /home/shared/existing-data From 931150e833097273860e2ae167ee00150b05c4bc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 10 Jun 2021 09:22:18 -0500 Subject: [PATCH 5/7] start: undo typo fix --- content/docs/start/data-and-model-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 0753515b62..998b6ba46d 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -8,7 +8,7 @@ Git.' # Get Started: Data Versioning How cool would it be to make Git handle arbitrarily large files and directories -with the same performance it has with small code files? Imagine doing a +with the same performance that you get with small code files? Imagine doing a `git clone` and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a `git checkout`. From 3bed034262ef367d5aca0cf763676c457076ac4a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 28 Jun 2021 23:38:01 +0000 Subject: [PATCH 6/7] ref: md ref link --- content/docs/command-reference/destroy.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index 3f1721120d..7fde8267de 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -16,13 +16,15 @@ usage: dvc destroy [-h] [-q | -v] [-f] directory from the project. Note that the cache directory will be removed as well, unless it's -set to an -[external location](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -(by default a local cache is located in `.dvc/cache`). If you were using +set to an [external location][external cache] (by default a local cache is +located in `.dvc/cache`). If you were using [symlinks for linking](/doc/user-guide/large-dataset-optimization) data from the cache, DVC will replace them with the latest versions of the actual files and directories first, so that your data is intact after destruction. +[external cache]: + /doc/use-cases/shared-development-server#configure-the-external-shared-cache + > Refer to [Project Structure](/doc/user-guide/project-structure) for more > details on the directories and files deleted by this command. From d21be8ef98c8b9e897dd74467b872220ccd52d55 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 2 Jul 2021 05:29:29 +0000 Subject: [PATCH 7/7] api: use ref links --- content/docs/api-reference/get_url.md | 36 ++++++++++++++++----------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/content/docs/api-reference/get_url.md b/content/docs/api-reference/get_url.md index 2248b0a71c..ed6f80f4d5 100644 --- a/content/docs/api-reference/get_url.md +++ b/content/docs/api-reference/get_url.md @@ -31,22 +31,26 @@ specified by its `path` in a `repo` (DVC project), is stored. The URL is formed by reading the project's [remote configuration](/doc/command-reference/config#remote) and the `dvc.yaml` or `.dvc` file where the given `path` is found (`outs` field). The schema of the -URL returned depends on the -[type](/doc/command-reference/remote/add#supported-storage-types) of the -`remote` used (see the [Parameters](#parameters) section). +URL returned depends on the [type][storage-types] of the `remote` used (see the +[Parameters](#parameters) section). If the target is a directory, the returned URL will end in `.dir`. Refer to -[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) -and `dvc add` to learn more about how DVC handles data directories. +[Structure of cache directory] and `dvc add` to learn more about how DVC handles +data directories. ⚠️ This function does not check for the actual existence of the file or directory in the remote storage. 💡 Having the resource's URL, it should be possible to download it directly with -an appropriate library, such as -[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj) -or -[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get). +an appropriate library, such as [`boto3`] or [`paramiko`]. + +[storage-types]: /doc/command-reference/remote/add#supported-storage-types +[structure of cache directory]: + /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory +[`boto3`]: + https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj +[`paramiko`]: + https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get ## Parameters @@ -88,9 +92,8 @@ The script above prints `https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355` This URL represents the location where the data is stored, and is built by -reading the corresponding `.dvc` file -([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc)) -where the `md5` file hash is stored, +reading the corresponding `.dvc` file ([`get-started/data.xml.dvc`]) where the +`md5` file hash is stored, ```yaml outs: @@ -98,11 +101,14 @@ outs: path: get-started/data.xml ``` -and the project configuration -([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)) -where the remote URL is saved: +and the project configuration ([`.dvc/config`]) where the remote URL is saved: ```ini ['remote "storage"'] url = https://remote.dvc.org/dataset-registry ``` + +[`.dvc/config`]: + https://github.com/iterative/dataset-registry/blob/master/.dvc/config +[`get-started/data.xml.dvc`]: + https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc