Skip to content

Commit

Permalink
Merge branch 'master' into Contributing.md-&-README.md-chages
Browse files Browse the repository at this point in the history
  • Loading branch information
TC-MO authored Dec 4, 2023
2 parents 729be04 + 4d2c711 commit c85b404
Show file tree
Hide file tree
Showing 15 changed files with 410 additions and 384 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ This repository is the home of Apify's documentation, which you can find at [doc
> [!IMPORTANT]
> Before you contribute to Apify documentation with your first pull request, please read the following 2 articles:
>
> - [Contributing guidelines](CONTRIBUTING.md), where you learn about the project structure, local development, and testing.
> - [Style guide](#style-guide), here below, where you learn how to keep the documentation style consistent.
> - [Contributing guidelines](CONTRIBUTING.md), where you learn about the project structure, local development, testing, and setting up the [redirects](./CONTRIBUTING.md#redirects) to make sure we keep our SEO juice 🍊.
> - [Style guide](#style-guide), here below 👇, where you learn how to keep the documentation style consistent.
## Style guide

Expand Down
4 changes: 4 additions & 0 deletions nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,10 @@ server {
rewrite ^/sdk/python$ /sdk/python/ redirect;
rewrite ^/cli$ /cli/ redirect;

# legacy links in some Actor READMEs
rewrite ^/scraping/tutorial/introduction$ /academy/apify-scrapers/getting-started permanent;
rewrite ^/scraping/tutorial/web-scraper$ /academy/apify-scrapers/web-scraper permanent;

# Articles moved from the platform documentation to the Academy
# Web Scraping 101
rewrite ^/platform/web-scraping-101$ /academy/web-scraping-for-beginners redirect;
Expand Down
141 changes: 75 additions & 66 deletions sources/platform/storage/dataset.md

Large diffs are not rendered by default.

Binary file modified sources/platform/storage/images/datasets-app.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/datasets-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/find-store-id.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/key-value-stores-app.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/key-value-stores-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/overview-api.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/request-queue-app.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified sources/platform/storage/images/request-queue-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
243 changes: 23 additions & 220 deletions sources/platform/storage/index.md

Large diffs are not rendered by default.

101 changes: 54 additions & 47 deletions sources/platform/storage/key_value_store.md

Large diffs are not rendered by default.

105 changes: 56 additions & 49 deletions sources/platform/storage/request_queue.md

Large diffs are not rendered by default.

196 changes: 196 additions & 0 deletions sources/platform/storage/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: Usage
description: Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing.
sidebar_position: 9.1
category: platform
slug: /storage/usage
---

**Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing.**

---

## Dataset {#dataset}

[Dataset](./dataset.md) storage allows you to store a series of data objects, such as results from web scraping, crawling, or data processing jobs. You can export your datasets in JSON, CSV, XML, RSS, Excel, or HTML formats.

![Dataset graphic](../images/datasets-overview.png)

## Key-value store {#key-value-store}

The [key-value store](./key_value_store.md) is ideal for saving data records such as files, screenshots of web pages, and PDFs or for persisting your Actor's state. The records are accessible under a unique name and can be written and read quickly.

![Key-value store graphic](../images/key-value-overview.svg)


## Request queue {#request-queue}

[Request queues](./request_queue.md) allow you to dynamically maintain a queue of URLs of web pages. You can use this when recursively crawling websites: you start from initial URLs and add new links as they are found while skipping duplicates.

![Request queue graphic](../images/request-queue-overview.svg)

## Basic usage {#basic-usage}

There are several ways to access your storage:

* [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface.
* [JavaScript SDK](/sdk/js) - when building your own JavaScript Actor.
* [Python SDK](/sdk/python) - when building your own Python Actor.
* [JavaScript API client](/api/client/js) - to access your storages from any Node.js application.
* [Python API client](/api/client/python) - to access your storages from any Python application.
* [Apify API](/api/v2#/reference/key-value-stores) - to access your storages programmatically.

### Apify Console {#apify-console}

To access your storages via Apify Console, navigate to the [**Storage**](https://console.apify.com/storage) section in the left-side menu. From there, you can click through the tabs to view your key-value stores, datasets, and request queues, and you can click on the **API** button in the top right corner to view related API endpoints. To view a storage, click its **ID**.

![Storages in app](./images/datasets-app.png)

> Use the **Include unnamed storages** checkbox to either display or hide unnamed storages. By default Apify Console will display them.
You can edit your store's name by clicking on the **Actions** menu and selecting **Rename**.

Additionally, you can quickly share the contents and details of your storage by selecting **Share** under the **Actions** menu and providing either email, username or user ID.

![Storage API](./images/overview-api.png)

These URLs link to API _endpoints_—the places where your data is stored. Endpoints that allow you to _read_ stored information do not require an [authentication token](/api/v2#/introduction/authentication). Calls are authenticated using a hard-to-guess ID, allowing for secure sharing. However, operations such as _update_ or _delete_ require the authentication token.

> Never share a URL containing your authentication token, to avoid compromising your account's security. <br/>
> If the data you want to share requires a token, first download the data, then share it as a file.
### JavaScript SDK {#javascript-sdk}

The Apify [JavaScript SDK](https://github.com/apify/apify-sdk-js) is a JavaScript/Node.js library that provides tools for building your own Actors. Requires [Node.js](https://nodejs.org/en/) 16 or later.

### Python SDK {#python-sdk}

The Apify [Python SDK](https://github.com/apify/apify-sdk-python) is a Python library providing tools to build your own Actors. Requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above.

### JavaScript API client {#javascript-api-client}

The Apify [JavaScript API client](https://github.com/apify/apify-client-js) (`apify-client`) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or externally.

Go to the [client's documentation](/api/client/js/docs) for help with setup.

### Python API client {#python-api-client}

The Apify [Python API client](https://github.com/apify/apify-client-python) (`apify-client`) allows you to access your datasets from any Python application, whether it is running on the Apify platform or externally. Requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above.

Go to the [client's documentation](/api/client/python/docs/quick-start) for help with setup.

### Apify API {#apify-api}

The [Apify API](/api/v2#/reference/key-value-stores) allows you to access your storages programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results.

In most cases, when accessing your storages via API, you will need to provide a `store ID`, which you can do in the following formats:

* `WkzbQMuFYuamGv3YF` - the store's alphanumerical ID if the store is unnamed.
* `~store-name` - the store's name prefixed with tilde (`~`) character if the store is named (e.g. `~ecommerce-scraping-results`)
* `username~store-name` - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. `janedoe~ecommerce-scraping-results`). Note that in this case, the store's owner needs to grant you access first.

For read (GET) requests, it is enough to use a store's alphanumerical ID, since the ID is hard to guess and effectively serves as an authentication key.

With other request types and when using the `username~store-name`, however, you will need to provide your secret API token in your request's [`Authorization`](/api/v2#/introduction/authentication) header or as a query parameter. You can find your token on the [Integrations](https://console.apify.com/account?tab=integrations) page of your Apify account.

For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/datasets).

## Rate limiting {#rate-limiting}

All API endpoints limit their rate of requests to protect Apify servers from overloading. The default rate limit for storage objects is _30 requests per second_. However, there are exceptions limited to _200 requests per second_ per storage object, including:

* [Push items](/api/v2#/reference/datasets/item-collection/put-items) to dataset.
* CRUD ([add](/api/v2#/reference/request-queues/request-collection/add-request),
[get](/api/v2#/reference/request-queues/request-collection/get-request),
[update](/api/v2#/reference/request-queues/request-collection/update-request),
[delete](/api/v2#/reference/request-queues/request-collection/delete-request))
operations of _request queue_ requests.

If a client exceeds this limit, the API endpoints responds with the HTTP status code `429 Too Many Requests` and the following body:

```json
{
"error": {
"type": "rate-limit-exceeded",
"message": "You have exceeded the rate limit of ... requests per second"
}
}
```

Go to the [API documentation](/api/v2#/introduction/rate-limiting) for details and to learn what to do if you exceed the rate limit.

## Data retention {#data-retention}

Named datasets are retained indefinitely.
Unnamed datasets expire after 7 days unless otherwise specified.

### Preserving your storages {#preserving-storages}

To ensure indefinite retention of your storages, assign them a name. This can be done via Apify Console or through our API. First, you'll need your store's ID. You can find it in the details of the run that created it. In Apify Console, head over to your run's details and select the **Dataset**, **Key-value store**, or **Request queue** tab as appropriate. Check that store's details, and you will find its ID among them.

![Finding your store's ID](./images/find-store-id.png)

Find and open your storage by clicking the ID, click on the **Actions** menu, choose **Rename**, and enter its new name in the field. Your storage will now be preserved indefinitely.

To name your storage via API, get its ID from the run that generated it using the [Get run](/api/v2#/reference/actor-runs/run-object-and-its-storages/get-run) endpoint. You can then give it a new name using the `Update \[storage\]` endpoint. For example, [Update dataset](/api/v2#/reference/datasets/dataset/update-dataset).

Our SDKs and clients each have unique naming conventions for storages. For more information check out documentation:

SDKs:

* [JavaScript](/sdk/js)
* [Python](/sdk/python)

Clients:

* [JavaScript](/api/client/js/)
* [Python](/api/client/python/)

## Named and unnamed storages {#named-and-unnamed-storages}

The default storages for an Actor run are unnamed, identified only by an _ID_. This allows them to expire after 7 days (or longer on paid plans) conserving your storage space. If you want to preserve a storage, [assign it a name](#preserving-storages), and it will be retained indefinitely.

> Storages' names can be up to 63 characters long.
Named and unnamed storages are identical in all aspects except for their retention period. The key advantage of named storages is their ease in identifying and verifying the correct store.

For example, storage names `janedoe~my-storage-1` and `janedoe~web-scrape-results` are easier to tell apart than the alphanumerical IDs `cAbcYOfuXemTPwnIB` and `CAbcsuZbp7JHzkw1B`.

## Sharing {#sharing}

You can grant [access rights](../collaboration/index.md) to others Apify users to view or modify your storages. Check the [full list of permissions](../collaboration/list_of_permissions.md).

### Sharing storages between runs {#sharing-storages-between-runs}

Storage can be accessed from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run, provided you have its _name_ or _ID_. You can access and manage storages from other runs using the same methods or endpoints as with storages from your current run.

[Datasets](./dataset.md) and [key-value stores](./key_value_store.md) support concurrent use by multiple Actors. Thus, several Actors or tasks can simultaneously write data to a single dataset or key-value store. Similarly, multiple runs can read data from datasets and key-value stores at the same time.

[Request queues](./request_queue.md), on the other hand, only allow multiple runs to add new data. A request queue can only be processed by one Actor or task run at any one time.

> When multiple runs try to write data to a storage simultaneously, the order of data writing cannot be controlled. Data is written as each request is processed. <br/>
> Similar principle applies in key-value stores and request queues, when a delete request for a record precedes a read request for the same record, the read request will fail.
## Deleting storages {#deleting-storages}

Named storages are only removed upon your request.<br/>
You can delete storages in the following ways:

* [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page.
* [JavaScript SDK](/sdk/js) - using the `.drop()` method of the
[Dataset](/sdk/js/api/apify/class/Dataset#drop),
[Key-value store](/sdk/js/api/apify/class/KeyValueStore#drop),
or [Request queue](/sdk/js/api/apify/class/RequestQueue#drop) class.
* [Python SDK](/sdk/python) - using the `.drop()` method of the
[Dataset](/sdk/python/reference/class/Dataset#drop),
[Key-value store](/sdk/python/reference/class/KeyValueStore#drop),
or [Request queue](/sdk/python/reference/class/RequestQueue#drop) class.
* [JavaScript API client](/api/client/js) - using the `.delete()` method in the
[dataset](/api/client/js/reference/class/DatasetClient),
[key-value store](/api/client/js/reference/class/KeyValueStoreClient),
or [request queue](/api/client/js/reference/class/RequestQueueClient) clients.
* [Python API client](/api/client/python) - using the `.delete()` method in the
[dataset](/api/client/python#datasetclient),
[key-value store](/api/client/python/reference/class/KeyValueStoreClient),
or [request queue](/api/client/python/reference/class/RequestQueueClient) clients.
* [API](/api/v2#/reference/key-value-stores/store-object/delete-store) using the - `Delete [store]` endpoint, where `[store]` is the type of storage you want to delete.

0 comments on commit c85b404

Please sign in to comment.