Skip to content

Commit

Permalink
Merge branch 'master' into update-actors-in-store
Browse files Browse the repository at this point in the history
  • Loading branch information
TC-MO authored Mar 13, 2024
2 parents acd538c + 279bd6a commit b9c3272
Show file tree
Hide file tree
Showing 38 changed files with 247 additions and 141 deletions.
4 changes: 2 additions & 2 deletions .github/styles/Apify/Capitalization.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ message: "The word '%s' should always be capitalized."
ignorecase: false
level: error
tokens:
- '\bactor\b'
- '\bactors\b'
- '(?<!\W)\bactor\b'
- '(?<!\W)\bactors\b'
- '(?<!@)\bapify\b(?!-\w+)'
- '(?<!\()\bhttps?://[^\s]*\bapify\b[^\s]*\b(?!\))|(?<!\[)\bhttps?://[^\s]*\bapify\b[^\s]*\b(?!\])'

Expand Down
34 changes: 34 additions & 0 deletions .github/workflows/lychee.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Lychee Link Checker

on: [pull_request]

jobs:
link-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Use Node.js 18
uses: actions/setup-node@v4
with:
node-version: 18
cache: 'npm'
cache-dependency-path: 'package-lock.json'
always-auth: 'true'
registry-url: 'https://npm.pkg.github.com/'
scope: '@apify-packages'

- name: Build docs
run: |
npm ci --force
npm run build
env:
APIFY_SIGNING_TOKEN: ${{ secrets.APIFY_SIGNING_TOKEN }}
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

- uses: lycheeverse/[email protected]
env:
GITHUB_TOKEN: ${{ secrets.APIFY_SERVICE_ACCOUNT_GITHUB_TOKEN }}
with:
fail: true
args: --base https://docs.apify.com --exclude-path 'build/versions.html' --max-retries 6 --verbose --no-progress --accept '100..=103,200..=299,403..=403, 429' './build/**/*.html'
2 changes: 1 addition & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:

- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v42.0.5
uses: tj-actions/changed-files@v42.1.0
with:
files: |
**/*.md
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/vale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:

- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v42.0.5
uses: tj-actions/changed-files@v42.1.0
with:
files: |
**/*.{md,mdx}
Expand Down
9 changes: 9 additions & 0 deletions .lycheeignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
http:\/\/localhost:3000.*
https:\/\/www\.youtube.*
\.(jpg|jpeg|png|gif|bmp|webp|svg)$
https:\/\/github\.com\/apify\/apify-docs\/edit\/[^ ]*a
https:\/\/docs\.apify\.com\/assets\/[^ ]*
file:\/\/\/.*
https://chrome\.google\.com/webstore/.*
https?:\/\/(www\.)?npmjs\.com\/.*
^https://apify\.com/api/og-image.*
2 changes: 1 addition & 1 deletion apify-docs-theme/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@apify/docs-theme",
"version": "1.0.106",
"version": "1.0.108",
"description": "",
"main": "./src/index.js",
"files": [
Expand Down
4 changes: 2 additions & 2 deletions apify-docs-theme/src/config.js
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ const themeConfig = ({
items: [
{
label: 'Reference',
href: `${absoluteUrl}/api/v2/`,
href: `${absoluteUrl}/api/v2`,
target: '_self',
rel: 'dofollow',
},
Expand Down Expand Up @@ -170,7 +170,7 @@ const themeConfig = ({
items: [
{
label: 'Reference',
href: `${absoluteUrl}/api/v2/`,
href: `${absoluteUrl}/api/v2`,
target: '_self',
rel: 'dofollow',
},
Expand Down
8 changes: 8 additions & 0 deletions apify-docs-theme/src/theme/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -922,6 +922,14 @@ html[data-theme='dark'] .actionLink:hover::after {
align-items: flex-start;
gap: 1.6rem;
align-self: stretch;
height: 100%;
}

.cardContentWrapperText {
display: flex;
flex-direction: column;
align-items: flex-start;
gap: 0.4rem;
}

.cardContentList {
Expand Down
2 changes: 1 addition & 1 deletion sources/academy/glossary/concepts/http_cookies.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ HTTP cookies are small pieces of data sent by the server to the user's web brows
2. To make the website show location-specific data (works for websites where you could set a zip code or country directly on the page, but unfortunately doesn't work for some location-based ads).
3. To make the website less suspicious of the crawler and let the crawler's traffic blend in with regular user traffic.

For local testing, we recommend using the [**EditThisCookie**](https://chrome.google.com/webstore/detail/editthiscookie/fngmhnnpilhplaeedifhccceomclgfbg?hl=en) Chrome extension.
For local testing, we recommend using the [**EditThisCookie**](https://chrome.google.com/webstore/detail/fngmhnnpilhplaeedifhccceomclgfbg) Chrome extension.
2 changes: 1 addition & 1 deletion sources/academy/glossary/tools/modheader.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ slug: /tools/modheader

If you read about [Postman](./postman.md), you might remember that you can use it to modify request headers before sending a request. This is great, but the main problem is that Postman can only make static requests - meaning, it is unable to load JavaScript or any [dynamic content](../concepts/dynamic_pages.md).

[ModHeader](https://chrome.google.com/webstore/detail/modheader/idgpnmonknjnojddfkpgkljpfnnfcklj?hl=en) is a Chrome extension which can be used to modify the HTTP headers of the requests you make with your browser. This means that, for example, if your scraper using a headless browser Puppeteer is being blocked due to an improper **User-Agent** header, you can use ModHeader to test the target website and quickly solve the issue.
[ModHeader](https://chrome.google.com/webstore/detail/idgpnmonknjnojddfkpgkljpfnnfcklj) is a Chrome extension which can be used to modify the HTTP headers of the requests you make with your browser. This means that, for example, if your scraper using a headless browser Puppeteer is being blocked due to an improper **User-Agent** header, you can use ModHeader to test the target website and quickly solve the issue.

## The ModHeader interface {#interface}

Expand Down
2 changes: 1 addition & 1 deletion sources/academy/glossary/tools/switchyomega.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ slug: /tools/switchyomega

---

SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/proxy-switchyomega/padekgcemlokbadohgkifijomclgjgif).
SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/padekgcemlokbadohgkifijomclgjgif).

After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various differnt connection profiles, as well as open the extension's options.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Before moving on, give these valuable resources a quick lookover:
- Refamiliarize with the various available data on the [Request object](https://crawlee.dev/api/core/class/Request).
- Learn about the [`failedRequestHandler` function](https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#failedRequestHandler).
- Understand how to use the [`errorHandler`](https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#errorHandler) function to handle request failures.
- Ensure you are comfortable using [key-value stores](/sdk/js/docs/guides/data-storage#key-value-store) and [datasets](/sdk/js/docs/api/dataset#__docusaurus), and understand the differences between the two storage types.
- Ensure you are comfortable using [key-value stores](/sdk/js/docs/guides/result-storage#key-value-store) and [datasets](/sdk/js/docs/guides/result-storage#dataset), and understand the differences between the two storage types.

## Knowledge check 📝 {#quiz}

Expand Down
10 changes: 5 additions & 5 deletions sources/academy/platform/get_most_of_actors/actor_readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ slug: /get-most-of-actors/actor-readme
## What should you add to your Actor README?

Aim for sections 1-6 below and try to include at least 300 words. You can move the sections around to some extent if it makes sense, e.g. 3 might come after 6. Consider using emojis as bullet points or otherwise trying to break up the text.
Aim for sections 16 below and try to include at least 300 words. You can move the sections around to some extent if it makes sense, e.g. 3 might come after 6. Consider using emojis as bullet points or otherwise trying to break up the text.

1. **What does (Actor name) do?**

- in 1-2 sentences describe what the Actor does and what it does not do
- in 12 sentences describe what the Actor does and what it does not do
- consider adding keywords like API, e.g. Instagram API
- always have a link to the target website in this section

Expand All @@ -43,12 +43,12 @@ Aim for sections 1-6 below and try to include at least 300 words. You can move t
3. **How much will it cost to scrape (target site)?**

- Simple text explaining what type of proxies are needed and how many platform credits (calculated mainly from consumption units) are needed for 1000 results.
- This is calculated from carrying out several runs (or from runs saved in the DB). @Zuzka can help if needed. [Information in this table](https://docs.google.com/spreadsheets/d/1NOkob1eYqTsRPTVQdltYiLUsIipvSFXswRcWQPtCW9M/edit#gid=1761542436), tab "cost of usage".
- This is calculated from carrying out several runs (or from runs saved in the DB).<!-- @Zuzka can help if needed. [Information in this table](https://docs.google.com/spreadsheets/d/1NOkob1eYqTsRPTVQdltYiLUsIipvSFXswRcWQPtCW9M/edit#gid=1761542436), tab "cost of usage". -->
- Here’s an example for this section:

> ## How much will it cost me to scrape Google Maps reviews?
>
> <br/> Apify provides you with $5 free usage credits to use every month on the Apify Free plan and you can get up to 100,000 reviews from this Google Maps Reviews Scraper for those credits. So 100k results will be completely free!
> <br/> Apify provides you with $5 free usage credits to use every month on the Apify Free plan and you can get up to 100,000 reviews from this Google Maps Reviews Scraper for those credits. This means 100k results will be completely free!
> <br/> But if you need to get more data or to get your data regularly you should grab an Apify subscription. We recommend our $49/month Starter plan - you can get up to 1 million Google Maps reviews every month with the $49 monthly plan! Or 10 million with the $499 Scale plan - wow!
4. **How to scrape (target site)**
Expand Down Expand Up @@ -94,4 +94,4 @@ If you want some general tips on how to make GitHub README that stands out, chec

## Next up {#next}

If you followed all the tips described above, your Actor README is almost good to go! In the [next lesson](./guidelines_for_writing.md) we will give you a few instructions on how you can create a tutorial for your Actor.
If you followed all the tips described above, your Actor README is almost good to go! In the [next lesson](./guidelines_for_writing.md) we will give you a few instructions on how you can create a tutorial for your Actor.
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@ slug: /api/run-actor-and-retrieve-data-via-api

---

The most popular way of [integrating](https://help.apify.com/en/collections/1669767-integrating-with-apify) the Apify platform with an external project/application is by programmatically running an [Actor](/platform/actors) or [task](/platform/actors/running/tasks), waiting for it to complete its run, then collecting its data and using it within the project. Though this process sounds somewhat complicated, it's actually quite easy to do; however, due to the plethora of features offered on the Apify platform, new users may not be sure how exactly to implement this type of integration. So, let's dive in and see how you can do it.
The most popular way of [integrating](https://help.apify.com/en/collections/1669769-integrations) the Apify platform with an external project/application is by programmatically running an [Actor](/platform/actors) or [task](/platform/actors/running/tasks), waiting for it to complete its run, then collecting its data and using it within the project. Though this process sounds somewhat complicated, it's actually quite easy to do; however, due to the plethora of features offered on the Apify platform, new users may not be sure how exactly to implement this type of integration. Let's dive in and see how you can do it.

> Remember to check out our [API documentation](/api/v2) with examples in different languages and a live API console. We also recommend testing the API with a nice desktop client like [Postman](https://www.getpostman.com/) or [Insomnia](https://insomnia.rest).
There are 2 main ways of using the Apify API:

Apify API offers two ways of interacting with it:

- [Synchronously](#synchronous-flow)
- [Asynchronously](#asynchronous-flow)
Expand All @@ -36,7 +37,7 @@ To run, or **call**, an Actor/task, you will need a few things:

- Some other optional settings if you'd like to change the default values (such as allocated memory or the build).

The URL for a [POST request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST) to run an actor looks like this:
The URL of [POST request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST) to run an actor looks like this:

```cURL
https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN
Expand Down Expand Up @@ -261,9 +262,9 @@ https://api.apify.com/v2/datasets/DATASET_ID/items

By default, it will return the data in JSON format with some metadata. The actual data are in the `items` array.

There are plenty of additional parameters that you can use. You can learn about them in the [documentation](/api/v2#/reference/datasets/item-collection/get-items). We will only mention that you can pass a `format` parameter that transforms the response into popular formats like CSV, XML, Excel, RSS, etc.
You can use plenty of additional parameters, to learn more about them, visit our API reference [documentation](/api/v2#/reference/datasets/item-collection/get-items). We will only mention that you can pass a `format` parameter that transforms the response into popular formats like CSV, XML, Excel, RSS, etc.

The items are paginated, which means you can ask only for a subset of the data. Specify this using the `limit` and `offset` parameters. There is actually an overall limit of 250,000 items that the endpoint can return per request. To retrieve more, you will need to send more requests incrementing the `offset` parameter.
The items are paginated, which means you can ask only for a subset of the data. Specify this using the `limit` and `offset` parameters. This endpoint has a limit of 250,000 items that it can return per request. To retrieve more, you will need to send more requests incrementing the `offset` parameter.

```cURL
https://api.apify.com/v2/datasets/DATASET_ID/items?format=csv&offset=250000
Expand Down
6 changes: 3 additions & 3 deletions sources/academy/tutorials/node_js/debugging_web_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@ jq.src = 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js';
document.getElementsByTagName('head')[0].appendChild(jq);
```

If that doesn't work because of CORS violation, you can install [this extension](https://chrome.google.com/webstore/detail/jquery-inject/iibfbhlfimdnkinkcenncoeejnmpemof) that injects jQuery on a button click.
If that doesn't work because of CORS violation, you can install [this extension](https://chrome.google.com/webstore/detail/ekkjohcjbjcjjifokpingdbdlfekjcgi) that injects jQuery on a button click.

There are 2 main ways how to test a pageFunction code in your console:
You can test a `pageFunction` code in two ways in your console:

## Pasting and running a small code snippet

Usually, you don't need to paste in the whole pageFunction as you can simply isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1.

I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). So my declaration will change from:
I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). My declaration will change from:

```js
const results = [];
Expand Down
Loading

0 comments on commit b9c3272

Please sign in to comment.