Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update usage and resources #886

Merged
merged 10 commits into from
Mar 29, 2024
Merged

docs: update usage and resources #886

merged 10 commits into from
Mar 29, 2024

Conversation

TC-MO
Copy link
Contributor

@TC-MO TC-MO commented Mar 18, 2024

rewrites for clarity
standarized references
removed blockquotes
added docusaurus style admonitions

TC-MO added 2 commits March 18, 2024 08:08
rewrites for clarity
standarized references
removed blockquotes
added docusaurus style admonitions
@github-actions github-actions bot added the t-docs Issues owned by technical writing team. label Mar 18, 2024
@TC-MO TC-MO requested a review from valekjo March 18, 2024 13:11
Copy link
Member

@valekjo valekjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments :)

Maybe it would be worth to split the page to two?

Usage & resources
Optimization & tips?

sources/platform/actors/running/usage_and_resources.md Outdated Show resolved Hide resolved
sources/platform/actors/running/usage_and_resources.md Outdated Show resolved Hide resolved

## Requirements

Actors built on top of the [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can with the memory they have allocated. So, if you allocate 2 times more memory, the run should be 2 times faster and consume the same amount of compute units (1 * 1 = 0.5 * 2). Autoscaling for Python is not yet available, but it is planned for the near future.
Actors built with [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. So, if you double the allocated memory, the run should be twice as fast and consume the same amount of compute units (1 * 1 = 0.5 * 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checked with someone from delivery / tooling. Not sure if this is 100% true, as it contradicts a bit the cheerio section later (with mention of node threads)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan could we get some input here and in other issues mentioned by @valekjo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels correct. 4g memory = 1cpu core, so the note about max memory of 4g for cheerio makes sense, the node process will only use a single thread there (sidenote: that's how we have it implemented, we could try to leverage worker threads or other similar parallelization features of node to get around this).

cc @metalwarrior665 for field experience :]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, 4 GB is a target for non-browser Node.js actors because of the single core restriction.

  2. Also, Apify SDK doesn't use any autoscaling, there is nothing to scale there, that is only related to Crawlee.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as I understand, mention of Apify JS SDK should be removed from this paragraph, as it is only regarding Crawlee?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

sources/platform/actors/running/usage_and_resources.md Outdated Show resolved Hide resolved
Comment on lines +88 to +90
- Actors using [Puppeteer](https://pptr.dev/) or [Playwright](https://playwright.dev/) for real web browser rendering require at least `1024MB` of memory.
- Large and complex sites like [Google Maps](https://apify.com/drobnikj/crawler-google-places) require at least `4096MB` for optimal speed and [concurrency](https://crawlee.dev/api/core/class/AutoscaledPool#minConcurrency).
- Projects involving large amount of data in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better to be discussed with tooling / delivery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. The idea is that for a browser actor to start scaling concurrency, 1 GB is just not very useful but it can work as a minimum


### Maximum memory

Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than 4 GB of memory because Node.js cannot use more than 1 core.
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than `4096MB` of memory because Node.js cannot use more than 1 core.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change the link to somehting else - it's quite old and at the referenced site you have to create account to read it.

I'd again discuss with tooling / delivery which would be the nice articles to refer to. (I also mean the next one on multiple threads)

Comment on lines +106 to +111
When you run an Actor it generates platform usage that's charged to the user account. Platform usage comprises four main parts:

- **Compute units**: CPU and memory resources consumed by the Actor.
- **Data transfer**: Amount of data you transfered between web, Apify platform, and other external systems.
- **Data transfer**: The amount of data transferred between the web, Apify platform, and other external systems.
- **Proxy costs**: Residential or SERP proxy usage.
- **Storage operations**: Read, write, and other operations towards key-value store, dataset, and request queue.
- **Storage operations**: Read, write, and other operations performed on the Key-value store, Dataset, and Request queue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some info where to find the run usage - it's visible on run detail and in run list.

@TC-MO TC-MO requested a review from valekjo March 26, 2024 12:59

### Maximum memory

Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than 4 GB of memory because Node.js cannot use more than 1 core.
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://dev.to/arealesramirez/is-node-js-single-threaded-or-multi-threaded-and-why-ab1). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than `4096MB` of memory because Node.js cannot use more than 1 core.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe its "single thread process", not "single process thread"


## Requirements

Actors built on top of the [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can with the memory they have allocated. So, if you allocate 2 times more memory, the run should be 2 times faster and consume the same amount of compute units (1 * 1 = 0.5 * 2). Autoscaling for Python is not yet available, but it is planned for the near future.
Actors built with [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. So, if you double the allocated memory, the run should be twice as fast and consume the same amount of compute units (1 * 1 = 0.5 * 2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels correct. 4g memory = 1cpu core, so the note about max memory of 4g for cheerio makes sense, the node process will only use a single thread there (sidenote: that's how we have it implemented, we could try to leverage worker threads or other similar parallelization features of node to get around this).

cc @metalwarrior665 for field experience :]

@TC-MO TC-MO requested a review from metalwarrior665 March 28, 2024 16:18

> It is possible to [use multiple threads in Node.js-based Actor](https://dev.to/reevranj/multiple-threads-in-nodejs-how-and-what-s-new-b23) with some configuration. This can be useful if you need to offload a part of your workload.
It's possible to [use multiple threads in Node.js-based Actor](https://dev.to/reevranj/multiple-threads-in-nodejs-how-and-what-s-new-b23) with some configuration. This can be useful if you need to offload a part of your workload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have guide directly for Crawlee https://crawlee.dev/docs/3.7/guides/parallel-scraping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we can add horizontal scaling guide, especially since we have RequestQueueV2 now that supports multiple actors simultaneously.

@TC-MO TC-MO merged commit 7c35d4d into master Mar 29, 2024
7 checks passed
@TC-MO TC-MO deleted the update-usage-and-resources branch March 29, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-docs Issues owned by technical writing team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants