diff --git a/README.md b/README.md index 269e8f40c..602ae61e3 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,8 @@ This repository is the home of Apify's documentation, which you can find at [doc > [!IMPORTANT] > Before you contribute to Apify documentation with your first pull request, please read the following 2 articles: > -> - [Contributing guidelines](CONTRIBUTING.md), where you learn about the project structure, local development, and testing. -> - [Style guide](#style-guide), here below, where you learn how to keep the documentation style consistent. +> - [Contributing guidelines](CONTRIBUTING.md), where you learn about the project structure, local development, testing, and setting up the [redirects](./CONTRIBUTING.md#redirects) to make sure we keep our SEO juice 🍊. +> - [Style guide](#style-guide), here below 👇, where you learn how to keep the documentation style consistent. ## Style guide diff --git a/nginx.conf b/nginx.conf index 2eee1e0a1..905ebb7bf 100644 --- a/nginx.conf +++ b/nginx.conf @@ -220,6 +220,10 @@ server { rewrite ^/sdk/python$ /sdk/python/ redirect; rewrite ^/cli$ /cli/ redirect; + # legacy links in some Actor READMEs + rewrite ^/scraping/tutorial/introduction$ /academy/apify-scrapers/getting-started permanent; + rewrite ^/scraping/tutorial/web-scraper$ /academy/apify-scrapers/web-scraper permanent; + # Articles moved from the platform documentation to the Academy # Web Scraping 101 rewrite ^/platform/web-scraping-101$ /academy/web-scraping-for-beginners redirect; diff --git a/sources/platform/storage/dataset.md b/sources/platform/storage/dataset.md index 11e3c5d0f..03bf40042 100644 --- a/sources/platform/storage/dataset.md +++ b/sources/platform/storage/dataset.md @@ -1,7 +1,7 @@ --- title: Dataset description: Store and export web scraping, crawling or data processing job results. Learn how to access and manage datasets in Apify Console or via API. -sidebar_position: 9.1 +sidebar_position: 9.2 slug: /storage/dataset --- @@ -14,43 +14,48 @@ import TabItem from '@theme/TabItem'; --- -Dataset storage enables you to sequentially save and retrieve data. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. +Dataset storage enables you to sequentially save and retrieve data. A unique dataset is automatically created and assigned to each Actor run when the first item is stored. -Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats. +Typically, datasets comprises results from web scraping, crawling, and data processing jobs. You can visualize this data in a table, where each object is forming a row and its attributes are represented as columns. You have the option to export data in various formats, including JSON, CSV, XML, Excel, HTML Table, RSS or JSONL. > Named datasets are retained indefinitely.
> Unnamed datasets expire after 7 days unless otherwise specified.
> [Learn more](./index#named-and-unnamed-storages) -Dataset storage is **append-only** - data can only be added and cannot be changed or deleted. +Dataset storage is _append-only_ - data can only be added and cannot be modified or deleted once stored. ## Basic usage {#basic-usage} -There are five ways to access your datasets: +There are several ways to access your datasets: -* [Apify Console](https://console.apify.com/storage?tab=datasets) - provides an easy-to-understand interface [[details](#apify-console)]. -* [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset) - when building your own JavaScript Actor [[details](#javascript-sdk)]. -* [Python SDK](sdk/python/docs/concepts/storages#working-with-datasets) - when building your own Python Actor [[details](#python-sdk)]. -* [JavaScript API client](/api/client/js/reference/class/DatasetClient) - to access your datasets from any Node.js application [[details](#javascript-api-client)]. -* [Python API client](/api/client/python/reference/class/DatasetClient) - to access your datasets from any Python application [[details](#python-api-client)]. -* [Apify API](/api/v2#/reference/datasets) - for accessing your datasets programmatically [[details](#apify-api)]. +* [Apify Console](https://console.apify.com/storage?tab=datasets) - provides an easy-to-understand interface. +* [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset) - when building your own JavaScript Actor. +* [Python SDK](sdk/python/docs/concepts/storages#working-with-datasets) - when building your own Python Actor. +* [JavaScript API client](/api/client/js/reference/class/DatasetClient) - to access your datasets from any Node.js application. +* [Python API client](/api/client/python/reference/class/DatasetClient) - to access your datasets from any Python application. +* [Apify API](/api/v2#/reference/datasets) - to access your datasets programmatically. ### Apify Console {#apify-console} In [Apify Console](https://console.apify.com), you can view your datasets in the [Storage](https://console.apify.com/storage) section under the [Datasets](https://console.apify.com/storage?tab=datasets) tab. -Only named datasets are displayed by default. Select the **Include unnamed datasets** checkbox to display all of your datasets. - ![Datasets in app](./images/datasets-app.png) -To view or download a dataset in the above-mentioned formats, click on its **Dataset ID**. Under the **Settings** tab, you can update the dataset's name (and, in turn, its [retention period](./index.md)) and -[access rights](../collaboration/index.md). Click on the `API` button to view and test the dataset's [API endpoints](/api/v2#/reference/datasets). +To view or download a dataset: + +1. Click on its **Dataset ID**. +2. Select the format & configure other options if desired in **Export dataset** section. +3. Click **Download**. + +Utilize the **Actions** menu to modify the dataset's name, which also affects its [retention period](./index#data-retention-data-retention), and to adjust [access rights](../collaboration/index.md). The **API** button allows you to explore and test the dataset's [API endpoints](/api/v2#/reference/datasets). ![Datasets detail view](./images/datasets-detail.png) ### JavaScript SDK {#javascript-sdk} -If you are building a JavaScript [Actor](../actors/index.mdx), you will be using the [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset). The dataset is represented by a [`Dataset`](/sdk/js/reference/class/Dataset) class. You can use the class to specify whether your data is stored locally or in the Apify cloud and push data to the datasets of your choice using the [`pushData()`](/sdk/js/reference/class/Dataset#pushData) method. You could also use other methods such as [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map) and [`reduce()`](/sdk/js/reference/class/Dataset#reduce), see the [example](/sdk/js/docs/examples/map-and-reduce). +When working with a JavaScript [Actor](../actors/index.mdx), the [JavaScript SDK](/sdk/js/docs/guides/result-storage#dataset) is an essential tool, especially for dataset management. It simplifies the tasks of storing and retrieving data, seamlessly integrating with the Actor's workflow. Key features of the SDK include the ability to append data, retrieve what is stored, and manage dataset properties effectively. Central to this functionality is the [`Dataset`](/sdk/js/reference/class/Dataset) class. This class allows you to determine where your data is stored - locally or in the Apify cloud. To add data to your chosen datasets, use the [`pushData()`](/sdk/js/reference/class/Dataset#pushData) method. + +Additionaly the SDK offers other methods like [`getData()`](/sdk/js/reference/class/Dataset#getData), [`map()`](/sdk/js/reference/class/Dataset#map), and [`reduce()`](/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](/sdk/js/docs/examples/map-and-reduce) section. If you have chosen to store your dataset locally, you can find it in the location below. @@ -58,7 +63,7 @@ If you have chosen to store your dataset locally, you can find it in the locatio {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` -**DATASET_ID** refers to the dataset's **name** or **ID**. The default dataset will be stored in the **default** directory. +`DATASET_ID` refers to the dataset's _name_ or _ID_. The default dataset will be stored in the _default_ directory. To add data to the default dataset, you can use the example below: @@ -79,9 +84,9 @@ await Actor.pushData([{ foo: 'hotel' }, { foo: 'cafe' }]); await Actor.exit(); ``` -> Make sure to use the `await` keyword when calling `pushData()`, otherwise the Actor process might finish before the data are stored. +> It's crucial to use the `await` keyword when calling `pushData()`, to ensure data storage completes before the Actor process terminates. -If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [Actor.openDataset()](/sdk/js/reference/class/Actor#openDataset) method. +If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [`Actor.openDataset()`](/sdk/js/reference/class/Actor#openDataset) method. ```js import { Actor } from 'apify'; @@ -99,7 +104,7 @@ await dataset.pushData({ foo: 'bar' }); await Actor.exit(); ``` -When using the [`getData()`](/sdk/js/reference/class/Dataset#getData) method, you can specify the data you retrieve using the `fields` option. It should be an array of field names (strings) that will be included in the results. To include all the results, exclude the `fields` parameter. +Utilize the `fields` option in the [`getData()`](/sdk/js/reference/class/Dataset#getData) method to specify which data fields to retrieve. This option accepts an array of fields names (string) to include in your results. ```js import { Actor } from 'apify'; @@ -118,19 +123,19 @@ const hotelAndCafeData = await dataset.getData({ await Actor.exit(); ``` -See the [JavaScript SDK documentation](/sdk/js/docs/guides/result-storage#dataset) and the `Dataset` class's [API reference](/sdk/js/reference/class/Dataset) for details on managing datasets with the JavaScript SDK. +Check out the [JavaScript SDK documentation](/sdk/js/docs/guides/result-storage#dataset) and the `Dataset` class's [API reference](/sdk/js/reference/class/Dataset) for details on managing datasets with the JavaScript SDK. ### Python SDK {#python-sdk} -If you are building a Python [Actor](../actors/index.mdx), you will be using the [Python SDK](/sdk/python/docs/concepts/storages#working-with-datasets). The dataset is represented by a [`Dataset`](/sdk/python/reference/class/Dataset) class. You can use the class to specify whether your data is stored locally or in the Apify cloud and push data to the datasets of your choice using the [`push_data()`](/sdk/python/reference/class/Dataset#push_data) method. You could also use other methods such as [`get_data()`](/sdk/python/reference/class/Dataset#get_data), [`map()`](/sdk/python/reference/class/Dataset#map) and [`reduce()`](/sdk/python/reference/class/Dataset#reduce). +For Python [Actors](../actors/index.mdx), the [Python SDK](/sdk/python/docs/concepts/storages#working-with-datasets) is essential. The dataset is represented by a [`Dataset`](/sdk/python/reference/class/Dataset) class. You can use this class to specify whether your data is stored locally or in the Apify cloud and push data to the datasets of your choice using the [`push_data()`](/sdk/python/reference/class/Dataset#push_data) method. For further data manipulation you could also use other methods such as [`get_data()`](/sdk/python/reference/class/Dataset#get_data), [`map()`](/sdk/python/reference/class/Dataset#map) and [`reduce()`](/sdk/python/reference/class/Dataset#reduce). -If you have chosen to store your dataset locally, you can find it in the location below. +For datasets stored locally, the data is located at the following path: ```text {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` -**DATASET_ID** refers to the dataset's **name** or **ID**. The default dataset will be stored in the **default** directory. +The `DATASET_ID` refers to the dataset's _name_ or _ID_. The default dataset will be stored in the _default_ directory. To add data to the default dataset, you can use the example below: @@ -146,7 +151,7 @@ async def main(): await Actor.push_data([{'foo': 'hotel'}, {'foo': 'cafe'}]) ``` -If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [Actor.open_dataset()](/sdk/python/reference/class/Actor#open_dataset) method. +If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [`Actor.open_dataset()`](/sdk/python/reference/class/Actor#open_dataset) method. ```python from apify import Actor @@ -160,7 +165,7 @@ async def main(): await dataset.push_data({'foo': 'bar'}) ``` -When using the [`get_data()`](/sdk/python/reference/class/Dataset#get_data) method, you can specify the data you retrieve using the `fields` option. It should be an array of field names (strings) that will be included in the results. To include all the results, exclude the `fields` parameter. +Utilize the `fields` option in the [`get_data()`](/sdk/python/reference/class/Dataset#get_data) method to specify which data fields to retrieve. This option accepts an array of fields names (string) to include in your results. ```python from apify import Actor @@ -173,11 +178,11 @@ async def main(): hotel_and_cafe_data = await dataset.get_data(fields=['hotel', 'cafe']) ``` -See the [Python SDK documentation](/sdk/python/docs/guides/result-storage#dataset) and the `Dataset` class's [API reference](/sdk/python/reference/class/Dataset) for details on managing datasets with the Python SDK. +For more information, visit our [Python SDK documentation](/sdk/python/docs/guides/result-storage#dataset) and the `Dataset` class's [API reference](/sdk/python/reference/class/Dataset) for details on managing datasets with the Python SDK. ### JavaScript API client {#javascript-api-client} -Apify's [JavaScript API client](/api/client/js/reference/class/DatasetClient) (`apify-client`) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or elsewhere. +The [JavaScript API client](/api/client/js/reference/class/DatasetClient) (`apify-client`) enables you access to your datasets from any Node.js application, whether hosted on the Apify platform or externally. After importing and initiating the client, you can save each dataset to a variable for easier access. @@ -187,13 +192,13 @@ const myDatasetClient = apifyClient.dataset('jane-doe/my-dataset'); You can then use that variable to [access the dataset's items and manage it](/api/client/js/reference/class/DatasetClient). -> When using the [`.listItems()`](/api/client/js/reference/class/DatasetClient#listItems) method, if you mention the same field name in the `field` and `omit` parameters, the `omit` parameter will prevail and the field will not be returned. +> When using the [`.listItems()`](/api/client/js/reference/class/DatasetClient#listItems) method, if you fill both `omit` and `field` parameters with the same value, then `omit` parameter will take precedence and the field is excluded from the results. -See the [JavaScript API client documentation](/api/client/js/reference/class/DatasetClient) for [help with setup](/api/client/js/docs) and more details. +Check out the [JavaScript API client documentation](/api/client/js/reference/class/DatasetClient) for [help with setup](/api/client/js/docs) and more details. ### Python API client {#python-api-client} -Apify's [Python API client](/api/client/python/reference/class/DatasetClient) (`apify-client`) allows you to access your datasets from any Python application, whether it is running on the Apify platform or elsewhere. +The [Python API client](/api/client/python/reference/class/DatasetClient) (`apify-client`) enables you access to your datasets from any Python application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each dataset to a variable for easier access. @@ -203,59 +208,57 @@ my_dataset_client = apify_client.dataset('jane-doe/my-dataset') You can then use that variable to [access the dataset's items and manage it](/api/client/python/reference/class/DatasetClient). -> When using the [`.list_items()`](/api/client/python/reference/class/DatasetClient#list_items) method, if you mention the same field name in the `field` and `omit` parameters, the `omit` parameter will prevail and the field will not be returned. +> When using the [`.list_items()`](/api/client/python/reference/class/DatasetClient#list_items) method, if you fill both `omit` and `field` parameters with the same value, then `omit` parameter will take precedence and the field is excluded from the results. -See the [Python API client documentation](/api/client/python/reference/class/DatasetClient) for [help with setup](/api/client/python/docs/quick-start) and more details. +Check out the [Python API client documentation](/api/client/python/reference/class/DatasetClient) for [help with setup](/api/client/python/docs/quick-start) and more details. ### Apify API {#apify-api} -The [Apify API](/api/v2#/reference/datasets) allows you to access your datasets programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results. +The [Apify API](/api/v2#/reference/datasets) enables you progammatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). -If you are accessing your datasets using the **username~store-name** [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) page of your Apify account. +If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations)tab of **Settings** page of your Apify account. -> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#introduction/authentication)). +> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](../integrations/api.md#authentication)). -To **get a list of your datasets**, send a GET request to the [Get list of datasets](/api/v2#/reference/datasets/get-list-of-datasets) endpoint. +To retrieve a list of you datasets, send a GET request to the [Get list of datasets](/api/v2#/reference/datasets/get-list-of-datasets) endpoint. ```text https://api.apify.com/v2/datasets ``` -To **get information about a dataset** such as its creation time and **item count**, send a GET request to the [Get dataset](/api/v2#/reference/datasets/dataset/get-dataset) endpoint. +To get information about a dataset such as its creation time and item count, send a GET request to the [Get dataset](/api/v2#/reference/datasets/dataset/get-dataset) endpoint. ```text https://api.apify.com/v2/datasets/{DATASET_ID} ``` -To **view a dataset's data**, send a GET request to the -[Get dataset items](/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. +To view a dataset's data, send a GET request to the [Get dataset items](/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. ```text https://api.apify.com/v2/datasets/{DATASET_ID}/items ``` -You can **specify which data are exported** by adding a comma-separated list of fields to the **fields** query parameter. Likewise, you can also omit certain fields using the **omit** parameter. +Control the data export by appending a comma-separated list of fields to the `fields` query parameter. Likewise, you can also omit certain fields using the `omit` parameter. -**If you both specify and omit the same field in a request, the**omit**parameter will prevail and the field will not be returned.** +>If you fill both `omit` and `field` parameters with the same value, then >`omit` parameter will take precedence and the field is excluded from the >results. -In addition, you can set the format in which you retrieve the data using the **?format=** parameter. The available formats are **json**, **jsonl**, **csv**, **html**, **xlsx**, **xml** and **rss**. The default value is **json**. +In addition, you can set the format in which you retrieve the data using the `?format=` parameter. The available formats are `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`. -To retrieve the **hotel** and **cafe** fields, you would send your GET request to the URL below. +To retrieve the `hotel` and `cafe` fields, you would send your GET request to the URL below. ```text https://api.apify.com/v2/datasets/{DATASET_ID}/items?format=json&fields=hotel%2Ccafe ``` -> Instead of commas, you will need to use the `%2C` code, which represents `,` in URL encoding.
-> Learn more about URL encoding [here](https://www.url-encode-decode.com). +> Use `%2C` instead of commas for URL encoding, as `%2C` represent a comma. For morn on URL encoding check out [this page](https://www.url-encode-decode.com) -To **add data to a dataset**, send a POST request, with a JSON object containing the data you want to add as the payload to the [Put items](/api/v2#/reference/datasets/item-collection/put-items) endpoint. +To add data to a dataset, issue a POST request to the [Put items](/api/v2#/reference/datasets/item-collection/put-items) endpoint with the data as a JSON object payload. ```text https://api.apify.com/v2/datasets/{DATASET_ID}/items ``` -> Pushing data to dataset via API is limited to **200** requests per second to prevent our servers from being overloaded. +> API data push to a dataset is capped at _200 requests per second_ to avoid overloading our servers. Example payload: @@ -273,14 +276,13 @@ Example payload: ] ``` -See the [API documentation](/api/v2#/reference/datasets) for a detailed breakdown of each API endpoint. +For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/datasets). ## Hidden fields {#hidden-fields} -Top-level fields starting with the `#` character are considered hidden. -These fields may be easily omitted when downloading the data from a dataset by providing the **skipHidden=1** or **clean=1** query parameters. This provides a convenient way to store debug information that should not appear in the final dataset. +Fields in a dataset that begin with a `#` are treated as hidden. You can exclude these fields when downloading data by using either `skipHidden=1` or `clean=1` in your query parameters. This feature is useful for excluding debug information from the final dataset output. -Below is an example of a dataset record containing hidden fields with an HTTP response and error. +The following example demonstates a dataset record with hiddent fields, including HTTP response and error details. ```json { @@ -296,11 +298,11 @@ Below is an example of a dataset record containing hidden fields with an HTTP re } ``` -Data without hidden fields are called "clean" and can be downloaded from [Apify Console](https://console.apify.com/storage?tab=datasets) using the "Clean items" link or via API using the **clean=true** or **clean=1** [URL parameters](/api/v2#/reference/datasets/item-collection/get-items). +Data excluding hidden fields, termed as "clean" data, can be downloaded from the [Apify Console](https://console.apify.com/storage?tab=datasets) using the **Clean items** option. Alternatively, you can download it via API by applying `clean=true` or `clean=1` as [URL parameters](/api/v2#/reference/datasets/item-collection/get-items). ## XML format extension {#xml-format-extension} -When you export results to XML or RSS formats, object property names become XML tags, while the corresponding values become the tags' children. +In `XML` and `RSS` export formats, object property name are converted into XML tags, and their corresponding values are represented as children of these tags. For example, the JavaScript object: @@ -338,7 +340,7 @@ becomes the following XML snippet: ``` -If the JavaScript object contains a property named `@`, its sub-properties are exported as attributes of the parent XML element. If the parent XML element does not have any child elements, its value is taken from a JavaScript object property named `#`. +In a JavaScript object, if a property is named `@`, its sub-properties are exported as attributes of the corresponding parent XML element. Additionally, when the parent XML element lacks child elements, its value is sourced from a property named `#` in the JavaScript Object. For example, the following JavaScript object: @@ -372,15 +374,15 @@ will be transformed to the following XML snippet: This feature is also useful when customizing your RSS feeds generated for various websites. -By default, the whole result is wrapped in an `` element, while each page object is contained in an `` element. You can change this using the `xmlRoot` and `xmlRow` URL parameters when GETting your data. +By default, the whole result is wrapped in an `` element, while each page object is contained in an `` element. You can change this using the `xmlRoot` and `xmlRow` URL parameters when retrieving your data with a GET request. ## Sharing {#sharing} -You can invite other Apify users to view or modify your datasets with the [access rights](../collaboration/index.md) system. See the [full list of permissions](../collaboration/list_of_permissions.md). +You can grant [access rights](../collaboration/index.md) to your dataset through the **Share** button under the **Actions** menu. For more details, check the [full list of permissions](../collaboration/list_of_permissions.md). ### Sharing datasets between runs {#sharing-datasets-between-runs} -You can access a dataset from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its **name** or **ID**. +You can access a dataset from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its _name_ or _ID_. To access a dataset from another run using the [JavaScript SDK](/sdk/js) or the [Python SDK](/sdk/python), open it using the same method as you would with any other dataset. @@ -413,34 +415,41 @@ async def main(): -In the [JavaScript API client](/api/client/js), you can access a dataset using [its client](/api/client/js/reference/class/DatasetClient). Once you've opened the dataset, read its contents and add new data like you would do with a dataset from your current run. +In the [JavaScript API client](/api/client/js/reference/class/DatasetClient) as well as in [Python API client](/api/client/python/reference/class/DatasetClient) , you can access a dataset using its client. Once you've opened the dataset, you can read its contents and add new data in the same manner as you would for a dataset from your current run. + + + ```js const otherDatasetClient = apifyClient.dataset('jane-doe/old-dataset'); ``` -Likewise, in the [Python API client](/api/client/python), you can access a dataset using [its client](/api/client/python/reference/class/DatasetClient). + + ```python other_dataset_client = apify_client.dataset('jane-doe/old-dataset') ``` + + + The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. See the [Storage overview](/platform/storage#sharing-storages-between-runs) for details on sharing storages between runs. ## Limits {#limits} -* Tabulated data storage formats (the ones that display the data in columns), such as HTML, CSV, and EXCEL, have a maximum limit of **3000** columns. All data that do not fit into this limit will not be retrieved. +* Data storage formats that use tabulation (like HTML, CSV, and EXCEL) are limited to a maximum of _3000_ columns. Data exceeding this limit will not be retrieved. -* When using the `pushData()` method, the size of the data is limited by the receiving API. Therefore, `pushData()` will only allow objects whose JSON representation is smaller than **9MB**. When an array is passed, none of the included objects may be larger than 9MB, however the array itself may be of any size. +* The `pushData()`method is constrained by the receiving API's size limit. It accepts objects with JSON size under _9MB_. While individual objects within an array must not exceed _9MB_, the overall size has no restriction. -* Dataset names can be up to 63 characters long. +* The maximum length for dataset names is 63 characters. ### Rate limiting {#rate-limiting} -When pushing data to a dataset via [API](/api/v2#/reference/datasets/item-collection/put-items), the request rate is limited to **200** per second per dataset. This helps protect Apify servers from being overloaded. +The rate limit for pushing data to a dataset through the [API](/api/v2#/reference/datasets/item-collection/put-items) is capped at _200 requests per second_ for each dataset, a measure to prevent overloading Apify servers. -All other dataset API [endpoints](/api/v2#/reference/datasets) are limited to **30** requests per second per dataset. +For all other dataset [API endpoints](/api/v2#/reference/datasets) , the rate limit is _30 requests per second_ for each dataset. -See the [API documentation](/api/v2#/introduction/rate-limiting) for details and to learn what to do if you exceed the rate limit. +Check out the [API documentation](/api/v2#/introduction/rate-limiting) for more information and guidance on actions to take if you exceed these rate limits. diff --git a/sources/platform/storage/images/datasets-app.png b/sources/platform/storage/images/datasets-app.png index 06d9d4e75..2fe5d3646 100644 Binary files a/sources/platform/storage/images/datasets-app.png and b/sources/platform/storage/images/datasets-app.png differ diff --git a/sources/platform/storage/images/datasets-detail.png b/sources/platform/storage/images/datasets-detail.png index e9652d08b..42bbaeef7 100644 Binary files a/sources/platform/storage/images/datasets-detail.png and b/sources/platform/storage/images/datasets-detail.png differ diff --git a/sources/platform/storage/images/find-store-id.png b/sources/platform/storage/images/find-store-id.png index ed00950a1..e9d7c2380 100644 Binary files a/sources/platform/storage/images/find-store-id.png and b/sources/platform/storage/images/find-store-id.png differ diff --git a/sources/platform/storage/images/key-value-stores-app.png b/sources/platform/storage/images/key-value-stores-app.png index 94fe63674..0aad087c3 100644 Binary files a/sources/platform/storage/images/key-value-stores-app.png and b/sources/platform/storage/images/key-value-stores-app.png differ diff --git a/sources/platform/storage/images/key-value-stores-detail.png b/sources/platform/storage/images/key-value-stores-detail.png index 5ed3c0d59..0c4eefbd0 100644 Binary files a/sources/platform/storage/images/key-value-stores-detail.png and b/sources/platform/storage/images/key-value-stores-detail.png differ diff --git a/sources/platform/storage/images/overview-api.png b/sources/platform/storage/images/overview-api.png index c3b865104..ddb87a72f 100644 Binary files a/sources/platform/storage/images/overview-api.png and b/sources/platform/storage/images/overview-api.png differ diff --git a/sources/platform/storage/images/request-queue-app.png b/sources/platform/storage/images/request-queue-app.png index 529d727b9..fe3f0b836 100644 Binary files a/sources/platform/storage/images/request-queue-app.png and b/sources/platform/storage/images/request-queue-app.png differ diff --git a/sources/platform/storage/images/request-queue-detail.png b/sources/platform/storage/images/request-queue-detail.png index 847937bd3..9995d0dcc 100644 Binary files a/sources/platform/storage/images/request-queue-detail.png and b/sources/platform/storage/images/request-queue-detail.png differ diff --git a/sources/platform/storage/index.md b/sources/platform/storage/index.md index e43b196f7..485c9fd47 100644 --- a/sources/platform/storage/index.md +++ b/sources/platform/storage/index.md @@ -6,229 +6,32 @@ category: platform slug: /storage --- +import Card from "@site/src/components/Card"; +import CardGrid from "@site/src/components/CardGrid"; + # Storage {#storage} -**Store anything from images and key-value pairs to structured output data. Learn how to access and manage your stored data from the Apify platform or via API.** +**Store anything from images and key-value pairs to structured output data. Learn how to access and manage your stored data on the Apify Console or via the API.** --- -The Apify platform includes four types of storage you can use both in your [Actors](../actors/index.mdx) and outside the Apify platform via [API](/api/v2#/): the [JavaScript SDK](/sdk/js), the [Python SDK](/sdk/python), the [JavaScript API client](/api/client/js), and the [Python API client](/api/client/python). - -This page contains a brief introduction of the three types of Apify Storage. - -* [Dataset](#dataset) - storage for data objects such as scraping output. -* [Key-value store](#key-value-store) - storage for arbitrary data records such as files, images, and strings. -* [Request queue](#request-queue) - a queue of URLs for your Actors to visit. - -You will then find [basic usage](#basic-usage) information relating to all types of storage. For example, how to manage your storage in [Apify Console](#apify-console), the basics of setting up the [JavaScript SDK and Crawlee](#javascript-sdk-and-crawlee), [Python SDK](#python-sdk), the [JavaScript API client](#javascript-api-client), and the [Python API client](/api/client/python). You will also find general information for using storage with the [Apify API](#apify-api). - -## Dataset {#dataset} - -[Dataset](./dataset.md) storage allows you to store a series of data objects such as results from web scraping, crawling or data processing jobs. You can export your datasets in JSON, CSV, XML, RSS, Excel or HTML formats. - -![Dataset graphic](../images/datasets-overview.png) - -The easiest way to access your datasets is via [Apify Console](https://console.apify.com/storage?tab=datasets), which provides a user-friendly interface for viewing or downloading the data and editing your datasets' properties. - -To manage your datasets, you can use the -[JavaScript SDK](/sdk/js/reference/class/Dataset), -[Python SDK](/sdk/python/reference/class/Dataset), -[JavaScript API client](/api/client/js/reference/class/DatasetClient), -[Python API client](/api/client/python#datasetclient), -or the [Apify API](/api/v2#/reference/datasets). - -[See the dataset documentation](./dataset.md) for details. - -## Key-value store {#key-value-store} - -The [key-value store](./key_value_store.md) is ideal for saving data records such as files, screenshots of web pages, and PDFs or for persisting your Actor's state. The records are accessible under a unique name and can be written and read quickly. - -![Key-value store graphic](../images/key-value-overview.svg) - -The easiest way to access your key-value stores is via -[Apify Console](https://console.apify.com/storage?tab=keyValueStores), which provides a user-friendly interface for viewing or downloading the data and editing your key-value stores' properties. - -To manage your key-value stores, you can use the -[JavaScript SDK](/sdk/js/reference/class/KeyValueStore), -[Python SDK](/sdk/python/reference/class/KeyValueStore), -[JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient), -[Python API client](/api/client/python/reference/class/KeyValueStoreClient), -or the [Apify API](/api/v2#/reference/key-value-stores). - -[See the key-value store documentation](./key_value_store.md) for details. - -## Request queue {#request-queue} - -[Request queues](./request_queue.md) allow you to dynamically maintain a queue of URLs of web pages. You can use this when recursively crawling websites: you start from initial URLs and add new links as they are found while skipping duplicates. - -![Request queue graphic](../images/request-queue-overview.svg) - -The easiest way to access your request queues is via -[Apify Console](https://console.apify.com/storage?tab=requestQueues), which provides a user-friendly interface for viewing your request queues and editing your queues' properties. - -To manage your request queues, you can use the -[JavaScript SDK](/sdk/js/reference/class/RequestQueue), -[Python SDK](/sdk/python/reference/class/RequestQueue), -[JavaScript API client](/api/client/js/reference/class/RequestQueueClient), -[Python API client](/api/client/python/reference/class/RequestQueueClient), -or the [Apify API](/api/v2#/reference/request-queues). - -[See the request queue documentation](./request_queue.md) for details. - -## Basic usage {#basic-usage} - -There are five ways to access your storage: - -* [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface [[details](#apify-console)]. -* JavaScript SDK ([Request storage](/sdk/js/docs/guides/request-storage), [Result storage](/sdk/js/docs/guides/result-storage)) - when building your own JavaScript Actor [[details](#javascript-sdk-and-crawlee)]. -* Python SDK ([Working with storages](/sdk/python/docs/concepts/storages)) - when building your own Python Actor [[detail](#python-sdk)]. -* [JavaScript API client](/api/client/js) - to access your storages from any Node.js application [[details](#javascript-api-client)]. -* [Python API client](/api/client/python) - to access your storages from any Python application [[details](#python-api-client)]. -* [Apify API](/api/v2#/reference/key-value-stores) - for accessing your storages programmatically [[details](#apify-api)]. - -### Apify Console {#apify-console} - -To access your storages from Apify Console, go to the [**Storage** section](https://console.apify.com/storage) in the left-side menu. From there, you can click through the tabs to view your key-value stores, datasets, request queues and related API endpoints. To view a storage, click its **ID**. - -![Storages in app](./images/datasets-app.png) - -> Only named storages are displayed by default. Select the **Include unnamed store** checkbox to display all of your storages. - -You can edit your stores' names by clicking their caption (ID or name) on their detail page. - -Under the **Settings** tab of their detail page, you can grant [access rights](../collaboration/index.md) to other Apify users. - -You can quickly share your storages' contents and details by sharing the URLs you find under the **API** tab in a store's detail page. - -![Storage API](./images/overview-api.png) - -These URLs provide links to API **endpoints**–the places where your data are stored. Endpoints that allow you to **read** stored information do not require an [authentication token](/api/v2#/introduction/authentication). The calls are authenticated using a hard-to-guess ID, so they can be shared freely. Operations such as **update** or **delete**, however, will need the authentication token. - -> Never share a URL containing your authentication token, as this will compromise your account's security.
-> If the data you want to share requires a token, first download the data, then share it as a file. - -### JavaScript SDK and Crawlee {#javascript-sdk-and-crawlee} - -The [Apify JavaScript SDK](/sdk/js) is a JavaScript/Node.js library providing tools to build your own Actors. [Crawlee](https://crawlee.dev/) is a JavaScript/Node.js library that allows you to build your own web scraping and automation solutions (it was formerly a part of the JavaScript SDK). Both libraries require [Node.js](https://nodejs.org/en/) 16 or later. - -See [Crawlee documentation](https://crawlee.dev/docs/quick-start) for setup instructions and to learn how to build your own crawlers and run them on the [Apify platform](https://crawlee.dev/docs/guides/apify-platform). - -### Python SDK {#python-sdk} - -The [Apify Python SDK](/sdk/python) is a Python library providing tools to build your own Actors. We do not currently have an alternative to Crawlee for Python, but we plan on developing it in the future. - -### JavaScript API client {#javascript-api-client} - -Apify's [JavaScript API client](/api/client/js) (`apify-client`) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or elsewhere. - -See the [client's documentation](/api/client/js/docs) for help with setup. - -### Python API client {#python-api-client} - -Apify's [Python API client](/api/client/python) (`apify-client`) allows you to access your datasets from any Python application, whether it is running on the Apify platform or elsewhere. - -See the [client's documentation](/api/client/python/docs/quick-start) for help with setup. - -### Apify API {#apify-api} - -The [Apify API](/api/v2#/reference/key-value-stores) allows you to access your storages programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results. - -In most cases, when accessing your storages via API, you will need to provide a **store ID**, which you can do in the following formats: - -* **WkzbQMuFYuamGv3YF** - the store's alphanumerical ID if the store is unnamed. -* **~store-name** - the store's name prefixed with tilde (`~`) character if the store is named (e.g. **~ecommerce-scraping-results**) -* **username~store-name** - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. **janedoe~ecommerce-scraping-results**). Note that in this case, the store's owner needs to grant you access first. - -For read (GET) requests, it is enough to use a store's alphanumerical ID, since the ID is hard to guess and effectively serves as an authentication key. - -With other request types and when using the **username~store-name**, however, you will need to provide your secret API token in your request's [`Authorization`](/api/v2#/introduction/authentication) header or as a query parameter. You can find your token on the [Integrations](https://console.apify.com/account?tab=integrations) page of your Apify account. - -See the [API documentation](/api/v2#/reference/datasets) for details and a breakdown of each storage API endpoint. - -## Rate limiting {#rate-limiting} - -All API endpoints limit their rate of requests to protect Apify servers from overloading. The default rate limit is **30** requests per second per storage object, with a few exceptions, which are limited to **200** requests per second per storage object: - -* [Push items](/api/v2#/reference/datasets/item-collection/put-items) to dataset. -* CRUD ([add](/api/v2#/reference/request-queues/request-collection/add-request), -[get](/api/v2#/reference/request-queues/request-collection/get-request), -[update](/api/v2#/reference/request-queues/request-collection/update-request), -[delete](/api/v2#/reference/request-queues/request-collection/delete-request)) -operations of **request queue** requests. - -If a client sends too many requests, the API endpoints respond with the HTTP status code `429 Too Many Requests` and the following body: - -```json -{ - "error": { - "type": "rate-limit-exceeded", - "message": "You have exceeded the rate limit of ... requests per second" - } -} -``` - -See the [API documentation](/api/v2#/introduction/rate-limiting) for details and to learn what to do if you exceed the rate limit. - -## Data retention {#data-retention} - -Unnamed storages expire after 7 days unless otherwise specified. Named storages are retained indefinitely. - -### Preserving your storages {#preserving-storages} - -To preserve your storages indefinitely, give them a name. You can do this in Apify Console or using our API. First, you'll need your store's ID. You can find it in the details of the run that created it. In Apify Console, head over to your run's details and select the **Dataset**, **Key-value store**, or **Request queue** tab as appropriate. Check that store's details, and you will find its ID among them. - -![Finding your store's ID](./images/find-store-id.png) - -Then, head over to the **Storage** menu, select the appropriate tab, and tick the **Include unnamed \[storages\]** box. Find and open your storage using the ID you just found, select the Settings tab, and enter its new name in the field. Your storage will now be preserved indefinitely. - -To name your storage via API, get its ID from the run that generated it using the [Get run](/api/v2#/reference/actor-runs/run-object-and-its-storages/get-run) endpoint. You can then give it a new name using the **Update \[storage\]** endpoint. For example, [Update dataset](/api/v2#/reference/datasets/dataset/update-dataset). - -The [JavaScript SDK](/sdk/js), [Crawlee](https://crawlee.dev/), The [Python SDK](/sdk/python), the [JavaScript](/api/client/js/) and [Python](/api/client/python/) clients have their own ways of naming storages - check their docs for details. - -## Named and unnamed storages {#named-and-unnamed-storages} - -The default storages for an Actor run are created without a name (with only an **ID**). This allows them to expire after 7 days (on the free plan, longer on paid plans) and not take up your storage space. If you want to preserve a storage, simply [give it a name](#preserving-storages), and it will be retained indefinitely. - -> Storages' names can be up to 63 characters long. - -Named and unnamed storages are the same in all regards except their retention period. The only difference is that named storages make it easier to verify you are using the correct store. - -For example, the storage names **janedoe~my-storage-1** and **janedoe~web-scrape-results** are easier to tell apart than the alphanumerical IDs **cAbcYOfuXemTPwnIB** and **CAbcsuZbp7JHzkw1B**. - -## Sharing {#sharing} - -You can invite other Apify users to view or modify your storages with the [access rights](../collaboration/index.md) system. See the [full list of permissions](../collaboration/list_of_permissions.md). - -### Sharing storages between runs {#sharing-storages-between-runs} - -Any storage can be accessed from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its **name** or **ID**. You can access and manage storages from other runs using the same methods or endpoints as with storages from your current run. - -[Datasets](./dataset.md) and [key-value stores](./key_value_store.md) can be used concurrently by multiple Actors. This means that multiple Actors or tasks running at the same time can **write** data to a single dataset or key-value store. The same applies for reading data – multiple runs can **read** data from datasets and key-value stores concurrently. - -[Request queues](./request_queue.md), on the other hand, only allow multiple runs to **add new data**. A request queue can only be processed by one Actor or task run at any one time. - -> When multiple runs try to write data to a storage at the same time, it isn't possible to control the order in which the data will be written. It will be written whenever the request is processed.
-> In key-value stores and request queues, the same applies for deleting records: if a request to delete a record is made shortly before a request to read that same record, the second request will fail. - -## Deleting storages {#deleting-storages} - -Named storages are only removed when you request it. You can delete storages in the following ways. +The Apify platform provides three types of storage accessible both within our [Apify Console](https://console.apify.com/storage) and externally through our API ([REST API](/api/v2#/), [JavaScript Client](/sdk/js) or [Python Client](/sdk/python) ) or SDKs ([JavaScript SDK](/api/client/js) or [Python SDK](/api/client/python)). + + + + + + -* [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page. -* [JavaScript SDK](/sdk/js) - using the `.drop()` method of the - [Dataset](/sdk/js/api/apify/class/Dataset#drop), - [Key-value store](/sdk/js/api/apify/class/KeyValueStore#drop), - or [Request queue](/sdk/js/api/apify/class/RequestQueue#drop) class. -* [Python SDK](/sdk/python) - using the `.drop()` method of the - [Dataset](/sdk/python/reference/class/Dataset#drop), - [Key-value store](/sdk/python/reference/class/KeyValueStore#drop), - or [Request queue](/sdk/python/reference/class/RequestQueue#drop) class. -* [JavaScript API client](/api/client/js) - using the `.delete()` method in the -[dataset](/api/client/js/reference/class/DatasetClient), -[key-value store](/api/client/js/reference/class/KeyValueStoreClient), -or [request queue](/api/client/js/reference/class/RequestQueueClient) clients. -* [Python API client](/api/client/python) - using the `.delete()` method in the -[dataset](/api/client/python#datasetclient), -[key-value store](/api/client/python/reference/class/KeyValueStoreClient), -or [request queue](/api/client/python/reference/class/RequestQueueClient) clients. -* [API](/api/v2#/reference/key-value-stores/store-object/delete-store) using the - **Delete [store]** endpoint, where **[store]** is the type of storage you want to delete. diff --git a/sources/platform/storage/key_value_store.md b/sources/platform/storage/key_value_store.md index f2a5cf56b..0c6d84d2e 100644 --- a/sources/platform/storage/key_value_store.md +++ b/sources/platform/storage/key_value_store.md @@ -1,7 +1,7 @@ --- title: Key-value store description: Store anything from Actor or task run results, JSON documents, or images. Learn how to access and manage key-value stores from Apify Console or via API. -sidebar_position: 9.2 +sidebar_position: 9.3 slug: /storage/key-value-store --- @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -The key-value store is simple storage that can be used for storing any kind of data. It can be JSON or HTML documents, zip files, images, or simply strings. The data are stored along with their [MIME content type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types). +The key-value store is simple storage that can be used for storing any kind of data. It can be JSON or HTML documents, zip files, images, or strings. The data are stored along with their [MIME content type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types). Each Actor run is assigned its own key-value store when it is created. The store contains the Actor's input, and, if necessary, other data such as its output. @@ -26,42 +26,42 @@ Key-value stores are mutable–you can both add entries and delete them. ## Basic usage -There are five ways to access your key-value stores: +There are several ways to access your key-value stores -* [Apify Console](https://console.apify.com/storage?tab=keyValueStores) - provides an easy-to-understand interface [[details](#apify-console)]. -* [JavaScript SDK](/sdk/js/docs/guides/result-storage#key-value-store) - when building your own JavaScript Actor [[details](#javascript-sdk)]. -* [Python SDK](sdk/python/docs/concepts/storages#working-with-key-value-stores) - when building your own Python Actor [[details](#python-sdk)]. -* [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient) - to access your key-value stores from any Node.js application [[details](#javascript-api-client)]. -* [Python API client](/api/client/python/reference/class/KeyValueStoreClient) - to access your key-value stores from any Python application [[details](#python-api-client)]. -* [Apify API](/api/v2#/reference/key-value-stores/get-items) - for accessing your key-value stores programmatically [[details](#apify-api)]. +* [Apify Console](https://console.apify.com/storage?tab=keyValueStores) - provides an easy-to-understand interface. +* [JavaScript SDK](/sdk/js/docs/guides/result-storage#key-value-store) - when building your own JavaScript Actor. +* [Python SDK](sdk/python/docs/concepts/storages#working-with-key-value-stores) - when building your own Python Actor. +* [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient) - to access your key-value stores from any Node.js application. +* [Python API client](/api/client/python/reference/class/KeyValueStoreClient) - to access your key-value stores from any Python application. +* [Apify API](/api/v2#/reference/key-value-stores/get-items) - for accessing your key-value stores programmatically. ### Apify Console In [Apify Console](https://console.apify.com), you can view your key-value stores in the [Storage](https://console.apify.com/storage) section under the [Key-value stores](https://console.apify.com/storage?tab=keyValueStores) tab. -Only named key-value stores are displayed by default. Select the **Include unnamed key-value stores** checkbox to display all of your stores. - ![Key-value stores in app](./images/key-value-stores-app.png) To view a key-value store's content, click on its **Store ID**. -Under the **Settings** tab, you can update the store's name (and, in turn, its [retention period](./index.md)) and [access rights](../collaboration/index.md). -Click on the `API` button to view and test a store's [API endpoints](/api/v2#/reference/key-value-stores). +Under the **Actions** menu, you can rename your store (and, in turn extend its [retention period](./usage#named-and-unnamed-storages)) and grant [access rights](../collaboration/index.md) using the **Share** button. +Click on the **API** button to view and test a store's [API endpoints](/api/v2#/reference/key-value-stores). ![Key-value stores detail](./images/key-value-stores-detail.png) ### JavaScript SDK -If you are building a JavaScript [Actor](../actors/index.mdx), you will be using the [JavaScript SDK](/sdk/js/docs/guides/result-storage#key-value-store). The key-value store is represented by a [`KeyValueStore`](/sdk/js/reference/class/KeyValueStore) class. You can use the class to specify whether your data is stored locally or in the Apify cloud, and get and set values using the [`getValue()`](/sdk/js/reference/class/KeyValueStore#getValue) and [`setValue()`](/sdk/js/reference/class/KeyValueStore#setValue) methods respectively, or iterate over your key-value store keys using the [`forEachKey()`](/sdk/js/reference/class/KeyValueStore#forEachKey) method. +When working with a Javascript [Actor](../actors/index.mdx), the [JavaScript SDK](/sdk/js/docs/guides/result-storage#key-value-store) is an essential tool, especially for key-value store management. The primary class for this purpose is the [`KeyValueStore`](/sdk/js/reference/class/KeyValueStore). This class allows you to decide whether your data will be stored locally or in the Apify cloud. For data manipulation, it offers the [`getValue()`](/sdk/js/reference/class/KeyValueStore#getValue) and [`setValue()`](/sdk/js/reference/class/KeyValueStore#setValue) methods to retrieve and assign values, respectively. + +Additionally, you can iterate over the keys in your store using the [`forEachKey()`](/sdk/js/reference/class/KeyValueStore#forEachKey) method. -Each Actor run is associated with the default key-value store, which is created for the Actor run. When running your Actors and storing data locally, you can pass its [input](../actors/running/input_and_output.md) using the **INPUT.json** file in the default key-value store directory. +Every Actor run is linked to a default key-value store that is automatically created for that specific run. If you're running your Actors and opt to store data locally, you can easily supply the [input](../actors/running/input_and_output.md) by placing an _INPUT.json_ file in the corresponding directory of the default key-value store. This method ensures that you Actor has all the necessary data readily available for its execution. -You can find **INPUT.json** and other key-value store files in the location below. +You can find _INPUT.json_ and other key-value store files in the location below. ```text {APIFY_LOCAL_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` -The default key-value store's ID is **default**. The {KEY} is the record's **key** and {EXT} corresponds to the record value's MIME content type. +The default key-value store's ID is _default_. The `{KEY}` is the record's _key_ and `{EXT}` corresponds to the record value's MIME content type. To manage your key-value stores, you can use the following methods. See the `KeyValueStore` class's [API reference](/sdk/js/reference/class/KeyValueStore) for the full list. @@ -113,23 +113,23 @@ await Actor.setValue( await Actor.exit(); ``` -The `Actor.getInput()` method is not only a shortcut to `Actor.getValue('INPUT')`; it is also compatible with `Actor.metamorph()` [[docs](../actors/development/programming_interface/metamorph.md)]. This is because a metamorphed Actor run's input is stored in the **INPUT-METAMORPH-1** key instead of **INPUT**, which hosts the original input. +The `Actor.getInput()` method is not only a shortcut to `Actor.getValue('INPUT')`; it is also compatible with [`Actor.metamorph()`](../actors/development/programming_interface/metamorph.md). This is because a metamorphed Actor run's input is stored in the _INPUT-METAMORPH-1_ key instead of _INPUT_, which hosts the original input. -See the [JavaScript SDK documentation](/sdk/js/docs/guides/result-storage#key-value-store) and the `KeyValueStore` class's [API reference](/sdk/js/reference/class/KeyValueStore) for details on managing your key-value stores with the JavaScript SDK. +Check out the [JavaScript SDK documentation](/sdk/js/docs/guides/result-storage#key-value-store) and the `KeyValueStore` class's [API reference](/sdk/js/reference/class/KeyValueStore) for details on managing your key-value stores with the JavaScript SDK. ### Python SDK -If you are building a Python [Actor](../actors/index.mdx), you will be using the [Python SDK](/sdk/python/docs/concepts/storages#working-with-key-value-stores). The key-value store is represented by a [`KeyValueStore`](/sdk/python/reference/class/KeyValueStore) class. You can use the `KeyValueStore` class to specify whether your data is stored locally or in the Apify cloud, and get and set values using the [`get_value()`](/sdk/python/reference/class/KeyValueStore#get_value) and [`set_value()`](/sdk/python/reference/class/KeyValueStore#set_value) methods respectively. +For Python [Actor](../actors/index.mdx), the [Python SDK](/sdk/python/docs/concepts/storages#working-with-key-value-stores) is essential. The key-value store is represented by a [`KeyValueStore`](/sdk/python/reference/class/KeyValueStore) class. You can use this class to specify whether your data is stored locally or in the Apify cloud. For further data manipulation it offers [`get_value()`](/sdk/python/reference/class/KeyValueStore#get_value) and [`set_value()`](/sdk/python/reference/class/KeyValueStore#set_value) methods to retrieve and assign values, respectively. -Each Actor run is associated with the default key-value store, which is created for the Actor run. When running your Actors and storing data locally, you can pass its [input](../actors/running/input_and_output.md) using the **INPUT.json** file in the default key-value store directory. +Every Actor run is linked to a default key-value store that is automatically created for that specific run. If you're running your Actors and opt to store data locally, you can easily supply the [input](../actors/running/input_and_output.md) by placing an _INPUT.json_ file in the corresponding directory of the default key-value store. This method ensures that you Actor has all the necessary data readily available for its execution. -You can find **INPUT.json** and other key-value store files in the location below. +You can find _INPUT.json_ and other key-value store files in the location below. ```text {APIFY_LOCAL_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` -The default key-value store's ID is **default**. The {KEY} is the record's **key** and {EXT} corresponds to the record value's MIME content type. +The default key-value store's ID is _default_. The {KEY} is the record's _key_ and {EXT} corresponds to the record value's MIME content type. To manage your key-value stores, you can use the following methods. See the `KeyValueStore` class [documentation](/sdk/python/reference/class/KeyValueStore) for the full list. @@ -165,13 +165,13 @@ async def main(): await Actor.set_value(key='OUTPUT', value=image_buffer, content_type='image/jpeg') ``` -The `Actor.get_input()` method is not only a shortcut to `Actor.get_value('INPUT')`; it is also compatible with `Actor.metamorph()` [[docs](../actors/development/programming_interface/metamorph.md)]. This is because a metamorphed Actor run's input is stored in the **INPUT-METAMORPH-1** key instead of **INPUT**, which hosts the original input. +The `Actor.get_input()` method is not only a shortcut to `Actor.get_value('INPUT')`; it is also compatible with [`Actor.metamorph()`](../actors/development/programming_interface/metamorph.md). This is because a metamorphed Actor run's input is stored in the _INPUT-METAMORPH-1_ key instead of _INPUT_, which hosts the original input. -See the [Python SDK documentation](/sdk/python/docs/guides/result-storage#key-value-store) and the `KeyValueStore` class's [API reference](/sdk/python/reference/class/KeyValueStore) for details on managing your key-value stores with the Python SDK. +Check out the [Python SDK documentation](/sdk/python/docs/guides/result-storage#key-value-store) and the `KeyValueStore` class's [API reference](/sdk/python/reference/class/KeyValueStore) for details on managing your key-value stores with the Python SDK. ### JavaScript API client -Apify's [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient) (`apify-client`) allows you to access your key-value stores from any Node.js application, whether it is running on the Apify platform or elsewhere. +The Apify [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient) (`apify-client`) enables you to access your key-value stores from any Node.js application, whether hosted on the Apify platform or externally. After importing and initiating the client, you can save each key-value store to a variable for easier access. @@ -181,11 +181,11 @@ const myKeyValStoreClient = apifyClient.keyValueStore('jane-doe/my-key-val-store You can then use that variable to [access the key-value store's items and manage it](/api/client/js/reference/class/KeyValueStoreClient). -See the [JavaScript API client documentation](/api/client/js/reference/class/KeyValueStoreClient) for [help with setup](/api/client/js/docs) and more details. +Check out the [JavaScript API client documentation](/api/client/js/reference/class/KeyValueStoreClient) for [help with setup](/api/client/js/docs) and more details. ### Python API client -Apify's [Python API client](/api/client/python/reference/class/KeyValueStoreClient) (`apify-client`) allows you to access your key-value stores from any Python application, whether it is running on the Apify platform or elsewhere. +The Apify [Python API client](/api/client/python/reference/class/KeyValueStoreClient) (`apify-client`) allows you to access your key-value stores from any Python application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each key-value store to a variable for easier access. @@ -195,35 +195,35 @@ my_key_val_store_client = apify_client.key_value_store('jane-doe/my-key-val-stor You can then use that variable to [access the key-value store's items and manage it](/api/client/python/reference/class/KeyValueStoreClient). -See the [Python API client documentation](/api/client/python/reference/class/KeyValueStoreClient) for [help with setup](/api/client/python/docs/quick-start) and more details. +Check out the [Python API client documentation](/api/client/python/reference/class/KeyValueStoreClient) for [help with setup](/api/client/python/docs/quick-start) and more details. ### Apify API -The [Apify API](/api/v2#/reference/key-value-stores) allows you to access your key-value stores programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results. +The [Apify API](/api/v2#/reference/key-value-stores) enables you programmatic acces to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). -If you are accessing your datasets using the **username~store-name** [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) page of your Apify account. +If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) tab of **Settings** page of your Apify account. -> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#introduction/authentication)). +> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](../integrations/api.md#authentication)). -To **get a list of your key-value stores**, send a GET request to the [Get list of key-value stores](/api/v2#/reference/key-value-stores/store-collection/get-list-of-key-value-stores) endpoint. +To retrieve a list of your key-value stores, send a GET request to the [Get list of key-value stores](/api/v2#/reference/key-value-stores/store-collection/get-list-of-key-value-stores) endpoint. ```text https://api.apify.com/v2/key-value-stores ``` -To **get information about a key-value store** such as its creation time and item count, send a GET request to the [Get store](/api/v2#/reference/key-value-stores/store-object/get-store) endpoint. +To get information about a key-value store such as its creation time and item count, send a GET request to the [Get store](/api/v2#/reference/key-value-stores/store-object/get-store) endpoint. ```text https://api.apify.com/v2/key-value-stores/{STORE_ID} ``` -To **get a record** (its value) from a key-value store, send a GET request to the [Get record](/api/v2#/reference/key-value-stores/key-collection/get-record) endpoint. +To get a record (its value) from a key-value store, send a GET request to the [Get record](/api/v2#/reference/key-value-stores/key-collection/get-record) endpoint. ```text https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} ``` -To **add a record** with a specific key in a key-value store, send a PUT request to the [Put record](/api/v2#/reference/key-value-stores/record/put-record) endpoint. +To add a record with a specific key in a key-value store, send a PUT request to the [Put record](/api/v2#/reference/key-value-stores/record/put-record) endpoint. ```text https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} @@ -238,29 +238,29 @@ Example payload: } ``` -To **delete a record**, send a DELETE request specifying the key from a key-value store to the [Delete record](/api/v2#/reference/key-value-stores/record/delete-record) endpoint. +To delete a record, send a DELETE request specifying the key from a key-value store to the [Delete record](/api/v2#/reference/key-value-stores/record/delete-record) endpoint. ```text https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} ``` -See the [API documentation](/api/v2#/reference/key-value-stores) for a detailed breakdown of each API endpoint. +For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/key-value-stores). ## Compression -In the past, every record uploaded using the [Put record](/api/v2#/reference/key-value-stores/record/put-record) endpoint was compressed using Gzip before uploading. This has changed. **Now, records are stored in the state you upload them. This means it is up to you if the record is stored compressed or uncompressed.** +Previously, when using the [Put record](/api/v2#/reference/key-value-stores/record/put-record) endpoint, every record was automatically compressed with Gzip before being uploaded. However, this process has been updated. _Now, record are stored exactly as you upload them._ This change means that it is up to you whether the record is stored compressed or uncompressed. You can compress a record and use the [Content-Encoding request header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding) to let our platform know which compression it uses. We recommend compressing large key-value records to save storage space and network traffic. -**If you use the [JavaScript SDK](/sdk/js/reference/class/KeyValueStore#setValue) or our [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient#setRecord), your files are compressed automatically by default.** We recommend using the JavaScript API client, which compresses your data before they are sent to our servers and decompresses them when you retrieve them. This makes your storage costs as low as possible. +_Using the [JavaScript SDK](/sdk/js/reference/class/KeyValueStore#setValue) or our [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient#setRecord) automatically compresses your files._ We advise utilizing the JavaScript API client for data compression prior to server upload and decompression upon retrieval, minimizing storage costs. ## Sharing -You can invite other Apify users to view or modify your key-value stores with the [access rights](../collaboration/index.md) system. See the [full list of permissions](../collaboration/list_of_permissions.md). +You can grant [access rights](../collaboration/index.md) to your key-value store through the **Share** button under the **Actions** menu. For more details check the [full list of permissions](../collaboration/list_of_permissions.md). ### Sharing key-value stores between runs -You can access a key-value store from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its **name** or **ID**. +You can access a key-value store from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its _name_ or _ID_. To access a key-value store from another run using the [JavaScript SDK](/sdk/js) or the [Python SDK](/sdk/python), open it using the same method as you would do with any other store. @@ -293,26 +293,33 @@ async def main(): -In the [JavaScript API client](/api/client/js), you can access a store using [its client](/api/client/js/reference/class/KeyValueStoreClient). Once you've opened a store, read and manage its contents like you would do with a key-value store from your current run. +In the [JavaScript API client](/api/client/js/reference/class/KeyValueStoreClient) as well as in [Python API client](/api/client/python/reference/class/KeyValueStoreClient), you can access a store using its client. Once you've opened a store, read and manage its contents like you would do with a key-value store from your current run. + + + ```js const otherStoreClient = apifyClient.keyValueStore('jane-doe/old-store'); ``` -Likewise, in the [Python API client](/api/client/python), you can access a store using [its client](/api/client/python/reference/class/KeyValueStoreClient). + + ```python other_store_client = apify_client.key_value_store('jane-doe/old-store') ``` + + + The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. -See the [Storage overview](/platform/storage#sharing-storages-between-runs) for details on sharing storages between runs. +Check out the [Storage overview](/platform/storage#sharing-storages-between-runs) for details on sharing storages between runs. ## Data consistency -Key-value storage uses the [AWS S3](https://aws.amazon.com/s3/) service. According to the [S3 documentation](https://aws.amazon.com/s3/consistency/), it provides **strong read-after-write** consistency. +Key-value storage uses the [AWS S3](https://aws.amazon.com/s3/) service. According to the [S3 documentation](https://aws.amazon.com/s3/consistency/), it provides _strong read-after-write_ consistency. ## Limits -* Key-value store names can be up to 63 characters long. +* The maximum length for key-value store is 63 characters. diff --git a/sources/platform/storage/request_queue.md b/sources/platform/storage/request_queue.md index dd560f85d..dea335944 100644 --- a/sources/platform/storage/request_queue.md +++ b/sources/platform/storage/request_queue.md @@ -1,7 +1,7 @@ --- title: Request queue description: Queue URLs for an Actor to visit in its run. Learn how to share your queues between Actor runs. Access and manage request queues from Apify Console or via API. -sidebar_position: 9.3 +sidebar_position: 9.4 slug: /storage/request-queue --- @@ -14,9 +14,9 @@ import TabItem from '@theme/TabItem'; --- -Request queues enable you to enqueue and retrieve requests such as URLs with an [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and other parameters. They are useful not only in web crawling, but anywhere you need to process a high number of URLs and enqueue new links. +Request queues enable you to enqueue and retrieve requests such as URLs with an [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and other parameters. They prove essential not only in web crawling scenarios but also in any situation requiring the management of a large number of URLs and the addition of new links. -Request queue storage supports both breadth-first and depth-first crawling orders, as well as custom data attributes. It allows you to query whether specific URLs were already found, push new URLs to the queue and fetch the next URLs to process. +The storage system for request queues accomoodates both breadth-first and depth-first crawling stategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs fo processing. > Named request queues are retained indefinitely.
> Unnamed request queues expire after 7 days unless otherwise specified.
@@ -24,35 +24,35 @@ Request queue storage supports both breadth-first and depth-first crawling order ## Basic usage {#basic-usage} -There are five ways to access your request queues: +There are several ways to access your request queues: -* [Apify Console](https://console.apify.com/storage?tab=requestQueues) - provides an easy-to-understand interface [[details](#apify-console)]. -* [JavaScript SDK](/sdk/js/docs/guides/result-storage#request-queue) - when building your own JavaScript Actor [[details](#javascript-sdk)]. -* [Python SDK](/sdk/python/docs/concepts/storages#working-with-request-queues) - when building your own Python Actor [[details](#python-sdk)]. -* [JavaScript API client](/api/client/js/reference/class/RequestQueueClient) - to access your request queues from any Node.js application [[details](#javascript-api-client)]. -* [Python API client](/api/client/python/reference/class/RequestQueueClient) - to access your request queues from any Python application [[details](#python-api-client)]. -* [Apify API](/api/v2#/reference/request-queues) - for accessing your request queues programmatically [[details](#apify-api)]. +* [Apify Console](https://console.apify.com/storage?tab=requestQueues) - provides an easy-to-understand interface. +* [JavaScript SDK](/sdk/js/docs/guides/result-storage#request-queue) - when building your own JavaScript Actor. +* [Python SDK](/sdk/python/docs/concepts/storages#working-with-request-queues) - when building your own Python Actor. +* [JavaScript API client](/api/client/js/reference/class/RequestQueueClient) - to access your request queues from any Node.js application. +* [Python API client](/api/client/python/reference/class/RequestQueueClient) - to access your request queues from any Python application. +* [Apify API](/api/v2#/reference/request-queues) - for accessing your request queues programmatically. ### Apify Console {#apify-console} -In [Apify Console](https://console.apify.com), you can view your request queues in the [Storage](https://console.apify.com/storage) section under the [Request queues](https://console.apify.com/storage?tab=requestQueues) tab. - -Only named request queues are displayed by default. Select the **Include unnamed request queues** checkbox to display all of your queues. +In the [Apify Console](https://console.apify.com), you can view your request queues in the [Storage](https://console.apify.com/storage) section under the [Request queues](https://console.apify.com/storage?tab=requestQueues) tab. ![Request queues in app](./images/request-queue-app.png) To view a request queue, click on its **Queue ID**. -Under the **Settings** tab, you can update the queue's name (and, in turn, its -[retention period](./index.md)) and [access rights](../collaboration/index.md). -Click on the `API` button to view and test a queue's [API endpoints](/api/v2#/reference/request-queues). +Under the **Actions** menu, you can rename your queue's name (and, in turn, its +[retention period](./usage#named-and-unnamed-storages)) and [access rights](../collaboration/index.md) using the **Share** button. +Click on the **API** button to view and test a queue's [API endpoints](/api/v2#/reference/request-queues). ![Request queues detail](./images/request-queue-detail.png) ### JavaScript SDK {#javascript-sdk} +When working with a JavaScript [Actor](../actors/index.mdx), the [JavaScript SDK](/sdk/js/docs/guides/request-storage#request-queue) is an essential tool, especially for request queue management. The primary class for this purpose is the [`RequestQueue`](/sdk/js/reference/class/RequestQueue) class. Use this class to decide whether your data is stored locally or in the Apify cloud. + If you are building a JavaScript [Actor](../actors/index.mdx), you will be using the [JavaScript SDK](/sdk/js/docs/guides/request-storage#request-queue). The request queue is represented by a [`RequestQueue`](/sdk/js/reference/class/RequestQueue) class. You can use the class to specify whether your data is stored locally or in the Apify cloud and [enqueue new URLs](/sdk/js/reference/class/RequestQueue#addRequests). -Each Actor run is associated with the default request queue, which is created for the Actor run when the first request is added to it. Typically, it is used to store URLs to crawl in the specific Actor run, however its usage is optional. You can also create **named queues** which can be shared between Actors or between Actor runs. +Every Actor run is automatically linked with a default request queue, initiated upon adding the first request. This queue is primarily utilized for storing URLs to be crawled during the particular Actor run, though its use is not mandatory. For enhanced flexibility, you can establish named queues. These named queues offer the advantage of being shareable across different Actors or various Actor runs, facilitating a more interconnected and efficient process. If you are storing your data locally, you can find your request queue at the following location. @@ -60,9 +60,9 @@ If you are storing your data locally, you can find your request queue at the fol {APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/{ID}.json ``` -The default request queue's ID is **default**. Each request in the queue is stored as a separate JSON file, where {ID} is a request ID. +The default request queue's ID is _default_. Each request in the queue is stored as a separate JSON file, where `{ID}` is a request ID. -To **open a request queue**, use the [`Actor.openRequestQueue()`](/sdk/js/reference/class/Actor#openRequestQueue) method. +To open a request queue, use the [`Actor.openRequestQueue()`](/sdk/js/reference/class/Actor#openRequestQueue) method. ```js // Import the JavaScript SDK into your project @@ -82,7 +82,7 @@ const queueWithName = await Actor.openRequestQueue('my-queue'); await Actor.exit(); ``` -Once a queue is open, you can manage it using the following methods. See the `RequestQueue` class's [API reference](/sdk/js/reference/class/RequestQueue) for the full list. +Once a queue is open, you can manage it using the following methods. Check out the `RequestQueue` class's [API reference](/sdk/js/reference/class/RequestQueue) for the full list. ```js // Import the JavaScript SDK into your project @@ -118,13 +118,13 @@ await queue.drop(); await Actor.exit(); ``` -See the [JavaScript SDK documentation](/sdk/js/docs/guides/request-storage#request-queue) and the `RequestQueue` class's [API reference](/sdk/js/reference/class/RequestQueue) for details on managing your request queues with the JavaScript SDK. +Check out the [JavaScript SDK documentation](/sdk/js/docs/guides/request-storage#request-queue) and the `RequestQueue` class's [API reference](/sdk/js/reference/class/RequestQueue) for details on managing your request queues with the JavaScript SDK. ### Python SDK {#python-sdk} -If you are building a Python [Actor](../actors/index.mdx), you will be using the [Python SDK](/sdk/python/docs/concepts/storages#working-with-request-queues). The request queue is represented by a [`RequestQueue`](/sdk/python/reference/class/RequestQueue) class. You can use the class to specify whether your data is stored locally or in the Apify cloud and [enqueue new URLs](/sdk/python/reference/class/RequestQueue#add_requests). +For Python [Actor](../actors/index.mdx) development, the [Python SDK](/sdk/python/docs/concepts/storages#working-with-request-queues) the in essential. The request queue is represented by [`RequestQueue`](/sdk/python/reference/class/RequestQueue) class. Utilize this class to determine whether your data is stored locally or in in the Apify cloud. For managing your data, it provides the capability to [enqueue new URLs](/sdk/python/reference/class/RequestQueue#add_requests), facilitating seamless integration and operation within your Actor. -Each Actor run is associated with the default request queue, which is created for the Actor run when the first request is added to it. Typically, it is used to store URLs to crawl in the specific Actor run, however its usage is optional. You can also create **named queues** which can be shared between Actors or between Actor runs. +Every Actor run is automatically connected to a default request queue, established specifically for that run upon the addition of the first request. If you're operating your Actors and choose to utilize this queue, it typically serves to store URLs for crawling in the respective Actor run, though its use is not mandatory. To extend functionality, you have the option to create named queue, which offer the flexibility to be shared among different Actors or across multiple Actor runs. If you are storing your data locally, you can find your request queue at the following location. @@ -132,9 +132,9 @@ If you are storing your data locally, you can find your request queue at the fol {APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/{ID}.json ``` -The default request queue's ID is **default**. Each request in the queue is stored as a separate JSON file, where {ID} is a request ID. +The default request queue's ID is _default_. Each request in the queue is stored as a separate JSON file, where `{ID}` is a request ID. -To **open a request queue**, use the [`Actor.open_request_queue()`](/sdk/python/reference/class/Actor#open_request_queue) method. +To _open a request queue_, use the [`Actor.open_request_queue()`](/sdk/python/reference/class/Actor#open_request_queue) method. ```python from apify import Actor @@ -179,11 +179,11 @@ async def main(): await queue.drop() ``` -See the [Python SDK documentation](/sdk/python/docs/guides/request-storage#request-queue) and the `RequestQueue` class's [API reference](/sdk/python/reference/class/RequestQueue) for details on managing your request queues with the Python SDK. +Check out the [Python SDK documentation](/sdk/python/docs/guides/request-storage#request-queue) and the `RequestQueue` class's [API reference](/sdk/python/reference/class/RequestQueue) for details on managing your request queues with the Python SDK. ### JavaScript API client {#javascript-api-client} -Apify's [JavaScript API client](/api/client/js/reference/class/RequestQueueClient) (`apify-client`) allows you to access your request queues from any Node.js application, whether it is running on the Apify platform or elsewhere. +The Apify [JavaScript API client](/api/client/js/reference/class/RequestQueueClient) (`apify-client`) enables you to access your request queues from any Node.js application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each request queue to a variable for easier access. @@ -193,11 +193,11 @@ const myQueueClient = apifyClient.requestQueue('jane-doe/my-request-queue'); You can then use that variable to [access the request queue's items and manage it](/api/client/js/reference/class/RequestQueueClient). -See the [JavaScript API client documentation](/api/client/js/reference/class/RequestQueueClient) for [help with setup](/api/client/js/docs) and more details. +Check out the [JavaScript API client documentation](/api/client/js/reference/class/RequestQueueClient) for [help with setup](/api/client/js/docs) and more details. ### Python API client {#python-api-client} -Apify's [Python API client](/api/client/python) (`apify-client`) allows you to access your request queues from any Python application, whether it is running on the Apify platform or elsewhere. +The Apify [Python API client](/api/client/python) (`apify-client`) allows you to access your request queues from any Python application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each request queue to a variable for easier access. @@ -207,35 +207,35 @@ my_queue_client = apify_client.request_queue('jane-doe/my-request-queue') You can then use that variable to [access the request queue's items and manage it](/api/client/python/reference/class/RequestQueueClient). -See the [Python API client documentation](/api/client/python/reference/class/RequestQueueClient) for [help with setup](/api/client/python/docs/quick-start) and more details. +Check out the [Python API client documentation](/api/client/python/reference/class/RequestQueueClient) for [help with setup](/api/client/python/docs/quick-start) and more details. ### Apify API {#apify-api} -The [Apify API](/api/v2#/reference/request-queues) allows you to access your request queues programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). +The [Apify API](/api/v2#/reference/request-queues) allows you programmatic access to your request queues using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). -If you are accessing your datasets using the **username~store-name** [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) page of your Apify account. +If you are accessing your datasets using the `username~store-name` [store ID format](./index.md), you will need to use your [secret API token](../integrations/index.mdx#api-token). You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) page of your Apify account. -> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#introduction/authentication)). +> When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](../integrations/api.md#authentication)). -To **get a list of your request queues**, send a GET request to the [Get list of request queues](/api/v2#/reference/request-queues/store-collection/get-list-of-request-queues) endpoint. +To get a list of your request queues, send a GET request to the [Get list of request queues](/api/v2#/reference/request-queues/store-collection/get-list-of-request-queues) endpoint. ```text https://api.apify.com/v2/request-queues ``` -To **get information about a request queue** such as its creation time and item count, send a GET request to the [Get request queue](/api/v2#/reference/request-queues/queue/get-request-queue) endpoint. +To get information about a request queue such as its creation time and item count, send a GET request to the [Get request queue](/api/v2#/reference/request-queues/queue/get-request-queue) endpoint. ```text https://api.apify.com/v2/request-queues/{QUEUE_ID} ``` -To **get a request from a queue**, send a GET request to the [Get request](/api/v2#/reference/request-queues/request/get-request) endpoint. +To get a request from a queue, send a GET request to the [Get request](/api/v2#/reference/request-queues/request/get-request) endpoint. ```text https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests/{REQUEST_ID} ``` -To **add a request to a queue**, send a POST request with the request to be added as a JSON object in the request's payload to the [Add request](/api/v2#/reference/request-queues/request-collection/add-request) endpoint. +To add a request to a queue, send a POST request with the request to be added as a JSON object in the request's payload to the [Add request](/api/v2#/reference/request-queues/request-collection/add-request) endpoint. ```text https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests @@ -251,7 +251,7 @@ Example payload: } ``` -To **update a request in a queue**, send a PUT request with the request to update as a JSON object in the request's payload to the [Update request](/api/v2#/reference/request-queues/request/update-request) endpoint. In the payload, specify the request's ID and add the information you want to update. +To update a request in a queue, send a PUT request with the request to update as a JSON object in the request's payload to the [Update request](/api/v2#/reference/request-queues/request/update-request) endpoint. In the payload, specify the request's ID and add the information you want to update. ```text https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests/{REQUEST_ID} @@ -272,15 +272,15 @@ Example payload: > > Example: `client-abc` -See the [API documentation](/api/v2#/reference/request-queues) for a detailed breakdown of each API endpoint. +For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/key-value-stores). ## Sharing {#sharing} -You can invite other Apify users to view or modify your request queues with the [access rights](../collaboration/index.md) system. See the [full list of permissions](../collaboration/list_of_permissions.md). +You can grant [access rights](../collaboration/index.md) to your request queue through the **Share** button under the **Actions** menu. For more details check the [full list of permissions](../collaboration/list_of_permissions.md). ### Sharing request queues between runs {#sharing-request-queues-between-runs} -You can access a request queue from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its **name** or **ID**. +You can access a request queue from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run as long as you know its _name_ or _ID_. To access a request queue from another run using the [JavaScript SDK](/sdk/js) or the [Python SDK](/sdk/python), open it using the same method like you would do with any other request queue. @@ -313,27 +313,34 @@ async def main(): -In the [JavaScript API client](/api/client/js), you can access a request queue using [its client](/api/client/js/reference/class/RequestQueueClient). Once you've opened the request queue, you can use it in your crawler or add new requests like you would do with a queue from your current run. +In the [JavaScript API client](/api/client/js/reference/class/RequestQueueClient) as well as in [Python API client](/api/client/python/reference/class/RequestQueueClient), you can access a request queue using its respective client. Once you've opened the request queue, you can use it in your crawler or add new requests like you would do with a queue from your current run. + + + ```js const otherQueueClient = apifyClient.requestQueue('jane-doe/old-queue'); ``` -Likewise, in the [Python API client](/api/client/python), you can access a request queue using [its client](/api/client/python/reference/class/RequestQueueClient). + + ```python other_queue_client = apify_client.request_queue('jane-doe/old-queue') ``` + + + The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. -See the [Storage overview](/platform/storage#sharing-storages-between-runs) for details on sharing storages between runs. +Check out the [Storage overview](/platform/storage#sharing-storages-between-runs) for details on sharing storages between runs. ## Limits {#limits} -* While multiple Actor or task runs can **add new requests** to a queue concurrently, only one run can **process a queue** at any one time. +* While multiple Actor or task runs can _add new requests_ to a queue concurrently, only one run can _process a queue_ at any one time. -* Request queue names can be up to 63 characters long. +* The maximum legnth for request queue nams is 63 characters. ### Rate limiting {#rate-limiting} @@ -342,8 +349,8 @@ CRUD ([add](/api/v2#/reference/request-queues/request-collection/add-request), [get](/api/v2#/reference/request-queues/request-collection/get-request), [update](/api/v2#/reference/request-queues/request-collection/update-request), [delete](/api/v2#/reference/request-queues/request-collection/delete-request)) -operation requests are limited to **200** per second per request queue. This helps protect Apify servers from being overloaded. +operation requests are limited to _200 requests per second_ per request queue. This helps protect Apify servers from being overloaded. -All other request queue API [endpoints](/api/v2#/reference/request-queues) are limited to **30** requests per second per request queue. +All other request queue API [endpoints](/api/v2#/reference/request-queues) are limited to _30 requests per second_ per request queue. -See the [API documentation](/api/v2#/introduction/rate-limiting) for details and to learn what to do if you exceed the rate limit. +Check out the [API documentation](/api/v2#/introduction/rate-limiting) for more information and guidance on actions to take if you exceed these rate limits. diff --git a/sources/platform/storage/usage.md b/sources/platform/storage/usage.md new file mode 100644 index 000000000..f42d18f97 --- /dev/null +++ b/sources/platform/storage/usage.md @@ -0,0 +1,196 @@ +--- +title: Usage +description: Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing. +sidebar_position: 9.1 +category: platform +slug: /storage/usage +--- + +**Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing.** + +--- + +## Dataset {#dataset} + +[Dataset](./dataset.md) storage allows you to store a series of data objects, such as results from web scraping, crawling, or data processing jobs. You can export your datasets in JSON, CSV, XML, RSS, Excel, or HTML formats. + +![Dataset graphic](../images/datasets-overview.png) + +## Key-value store {#key-value-store} + +The [key-value store](./key_value_store.md) is ideal for saving data records such as files, screenshots of web pages, and PDFs or for persisting your Actor's state. The records are accessible under a unique name and can be written and read quickly. + +![Key-value store graphic](../images/key-value-overview.svg) + + +## Request queue {#request-queue} + +[Request queues](./request_queue.md) allow you to dynamically maintain a queue of URLs of web pages. You can use this when recursively crawling websites: you start from initial URLs and add new links as they are found while skipping duplicates. + +![Request queue graphic](../images/request-queue-overview.svg) + +## Basic usage {#basic-usage} + +There are several ways to access your storage: + +* [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface. +* [JavaScript SDK](/sdk/js) - when building your own JavaScript Actor. +* [Python SDK](/sdk/python) - when building your own Python Actor. +* [JavaScript API client](/api/client/js) - to access your storages from any Node.js application. +* [Python API client](/api/client/python) - to access your storages from any Python application. +* [Apify API](/api/v2#/reference/key-value-stores) - to access your storages programmatically. + +### Apify Console {#apify-console} + +To access your storages via Apify Console, navigate to the [**Storage**](https://console.apify.com/storage) section in the left-side menu. From there, you can click through the tabs to view your key-value stores, datasets, and request queues, and you can click on the **API** button in the top right corner to view related API endpoints. To view a storage, click its **ID**. + +![Storages in app](./images/datasets-app.png) + +> Use the **Include unnamed storages** checkbox to either display or hide unnamed storages. By default Apify Console will display them. + +You can edit your store's name by clicking on the **Actions** menu and selecting **Rename**. + +Additionally, you can quickly share the contents and details of your storage by selecting **Share** under the **Actions** menu and providing either email, username or user ID. + +![Storage API](./images/overview-api.png) + +These URLs link to API _endpoints_—the places where your data is stored. Endpoints that allow you to _read_ stored information do not require an [authentication token](/api/v2#/introduction/authentication). Calls are authenticated using a hard-to-guess ID, allowing for secure sharing. However, operations such as _update_ or _delete_ require the authentication token. + +> Never share a URL containing your authentication token, to avoid compromising your account's security.
+> If the data you want to share requires a token, first download the data, then share it as a file. + +### JavaScript SDK {#javascript-sdk} + +The Apify [JavaScript SDK](https://github.com/apify/apify-sdk-js) is a JavaScript/Node.js library that provides tools for building your own Actors. Requires [Node.js](https://nodejs.org/en/) 16 or later. + +### Python SDK {#python-sdk} + +The Apify [Python SDK](https://github.com/apify/apify-sdk-python) is a Python library providing tools to build your own Actors. Requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above. + +### JavaScript API client {#javascript-api-client} + +The Apify [JavaScript API client](https://github.com/apify/apify-client-js) (`apify-client`) allows you to access your datasets from any Node.js application, whether it is running on the Apify platform or externally. + +Go to the [client's documentation](/api/client/js/docs) for help with setup. + +### Python API client {#python-api-client} + +The Apify [Python API client](https://github.com/apify/apify-client-python) (`apify-client`) allows you to access your datasets from any Python application, whether it is running on the Apify platform or externally. Requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above. + +Go to the [client's documentation](/api/client/python/docs/quick-start) for help with setup. + +### Apify API {#apify-api} + +The [Apify API](/api/v2#/reference/key-value-stores) allows you to access your storages programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results. + +In most cases, when accessing your storages via API, you will need to provide a `store ID`, which you can do in the following formats: + +* `WkzbQMuFYuamGv3YF` - the store's alphanumerical ID if the store is unnamed. +* `~store-name` - the store's name prefixed with tilde (`~`) character if the store is named (e.g. `~ecommerce-scraping-results`) +* `username~store-name` - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. `janedoe~ecommerce-scraping-results`). Note that in this case, the store's owner needs to grant you access first. + +For read (GET) requests, it is enough to use a store's alphanumerical ID, since the ID is hard to guess and effectively serves as an authentication key. + +With other request types and when using the `username~store-name`, however, you will need to provide your secret API token in your request's [`Authorization`](/api/v2#/introduction/authentication) header or as a query parameter. You can find your token on the [Integrations](https://console.apify.com/account?tab=integrations) page of your Apify account. + +For further details and a breakdown of each storage API endpoint, refer to the [API documentation](/api/v2#/reference/datasets). + +## Rate limiting {#rate-limiting} + +All API endpoints limit their rate of requests to protect Apify servers from overloading. The default rate limit for storage objects is _30 requests per second_. However, there are exceptions limited to _200 requests per second_ per storage object, including: + +* [Push items](/api/v2#/reference/datasets/item-collection/put-items) to dataset. +* CRUD ([add](/api/v2#/reference/request-queues/request-collection/add-request), +[get](/api/v2#/reference/request-queues/request-collection/get-request), +[update](/api/v2#/reference/request-queues/request-collection/update-request), +[delete](/api/v2#/reference/request-queues/request-collection/delete-request)) +operations of _request queue_ requests. + +If a client exceeds this limit, the API endpoints responds with the HTTP status code `429 Too Many Requests` and the following body: + +```json +{ + "error": { + "type": "rate-limit-exceeded", + "message": "You have exceeded the rate limit of ... requests per second" + } +} +``` + +Go to the [API documentation](/api/v2#/introduction/rate-limiting) for details and to learn what to do if you exceed the rate limit. + +## Data retention {#data-retention} + +Named datasets are retained indefinitely. +Unnamed datasets expire after 7 days unless otherwise specified. + +### Preserving your storages {#preserving-storages} + +To ensure indefinite retention of your storages, assign them a name. This can be done via Apify Console or through our API. First, you'll need your store's ID. You can find it in the details of the run that created it. In Apify Console, head over to your run's details and select the **Dataset**, **Key-value store**, or **Request queue** tab as appropriate. Check that store's details, and you will find its ID among them. + +![Finding your store's ID](./images/find-store-id.png) + +Find and open your storage by clicking the ID, click on the **Actions** menu, choose **Rename**, and enter its new name in the field. Your storage will now be preserved indefinitely. + +To name your storage via API, get its ID from the run that generated it using the [Get run](/api/v2#/reference/actor-runs/run-object-and-its-storages/get-run) endpoint. You can then give it a new name using the `Update \[storage\]` endpoint. For example, [Update dataset](/api/v2#/reference/datasets/dataset/update-dataset). + +Our SDKs and clients each have unique naming conventions for storages. For more information check out documentation: + +SDKs: + +* [JavaScript](/sdk/js) +* [Python](/sdk/python) + +Clients: + +* [JavaScript](/api/client/js/) +* [Python](/api/client/python/) + +## Named and unnamed storages {#named-and-unnamed-storages} + +The default storages for an Actor run are unnamed, identified only by an _ID_. This allows them to expire after 7 days (or longer on paid plans) conserving your storage space. If you want to preserve a storage, [assign it a name](#preserving-storages), and it will be retained indefinitely. + +> Storages' names can be up to 63 characters long. + +Named and unnamed storages are identical in all aspects except for their retention period. The key advantage of named storages is their ease in identifying and verifying the correct store. + +For example, storage names `janedoe~my-storage-1` and `janedoe~web-scrape-results` are easier to tell apart than the alphanumerical IDs `cAbcYOfuXemTPwnIB` and `CAbcsuZbp7JHzkw1B`. + +## Sharing {#sharing} + +You can grant [access rights](../collaboration/index.md) to others Apify users to view or modify your storages. Check the [full list of permissions](../collaboration/list_of_permissions.md). + +### Sharing storages between runs {#sharing-storages-between-runs} + +Storage can be accessed from any [Actor](../actors/index.mdx) or [task](../actors/running/tasks.md) run, provided you have its _name_ or _ID_. You can access and manage storages from other runs using the same methods or endpoints as with storages from your current run. + +[Datasets](./dataset.md) and [key-value stores](./key_value_store.md) support concurrent use by multiple Actors. Thus, several Actors or tasks can simultaneously write data to a single dataset or key-value store. Similarly, multiple runs can read data from datasets and key-value stores at the same time. + +[Request queues](./request_queue.md), on the other hand, only allow multiple runs to add new data. A request queue can only be processed by one Actor or task run at any one time. + +> When multiple runs try to write data to a storage simultaneously, the order of data writing cannot be controlled. Data is written as each request is processed.
+> Similar principle applies in key-value stores and request queues, when a delete request for a record precedes a read request for the same record, the read request will fail. + +## Deleting storages {#deleting-storages} + +Named storages are only removed upon your request.
+You can delete storages in the following ways: + +* [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page. +* [JavaScript SDK](/sdk/js) - using the `.drop()` method of the + [Dataset](/sdk/js/api/apify/class/Dataset#drop), + [Key-value store](/sdk/js/api/apify/class/KeyValueStore#drop), + or [Request queue](/sdk/js/api/apify/class/RequestQueue#drop) class. +* [Python SDK](/sdk/python) - using the `.drop()` method of the + [Dataset](/sdk/python/reference/class/Dataset#drop), + [Key-value store](/sdk/python/reference/class/KeyValueStore#drop), + or [Request queue](/sdk/python/reference/class/RequestQueue#drop) class. +* [JavaScript API client](/api/client/js) - using the `.delete()` method in the +[dataset](/api/client/js/reference/class/DatasetClient), +[key-value store](/api/client/js/reference/class/KeyValueStoreClient), +or [request queue](/api/client/js/reference/class/RequestQueueClient) clients. +* [Python API client](/api/client/python) - using the `.delete()` method in the +[dataset](/api/client/python#datasetclient), +[key-value store](/api/client/python/reference/class/KeyValueStoreClient), +or [request queue](/api/client/python/reference/class/RequestQueueClient) clients. +* [API](/api/v2#/reference/key-value-stores/store-object/delete-store) using the - `Delete [store]` endpoint, where `[store]` is the type of storage you want to delete.