From fe6357b25544002ec9f0397127afd617e5c4f1d7 Mon Sep 17 00:00:00 2001 From: TCMO <92638966+TC-MO@users.noreply.github.com> Date: Thu, 22 Feb 2024 14:48:48 +0100 Subject: [PATCH] docs: remove dead link (#865) --- .../anti_scraping/mitigation/cloudflare_challenge.md | 4 ++-- vale.ini | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md index 25b547318..81aa5f119 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md @@ -11,7 +11,7 @@ slug: /anti-scraping/mitigation/cloudflare-challenge.md --- -There are a few strategies that can be employed if you find yourself stuck. One key strategy is to ensure that your browser fingerprint is consistent. In some cases, the default browser fingerprint may actually be more effective than an inconsistently generated fingerprint. Additionally, it may be beneficial to avoid masking a Linux browser to look like a Windows or macOS browser, although this will depend on the specific configuration of the website you are targeting. +If you find yourself stuck, there are few strategies that you can employ. One key strategy is to ensure that your browser fingerprint is consistent. In some cases, the default browser fingerprint may actually be more effective than an inconsistently generated fingerprint. Additionally, it may be beneficial to avoid masking a Linux browser to look like a Windows or macOS browser, although this will depend on the specific configuration of the website you are targeting. For those using Crawlee, the library provides out-of-the-box support for generating consistent fingerprints that are able to pass the Cloudflare challenge. However, it's important to note that in some cases, the Cloudflare challenge screen may return a 403 status code even if it is evaluating the fingerprint and the request is not blocked. This can cause the default Crawlee browser crawlers to throw an error and not wait until the challenge is submitted and the page is redirected to the target webpage. @@ -28,7 +28,7 @@ const crawler = new PlaywrightCrawler({ It's important to note that by removing default blocked status code handling, you should also add custom session retire logic on blocked pages to reduce retries. Additionally, you should add waiting logic to start the automation logic only after the Cloudflare challenge is solved and the page is redirected. This can be accomplished by waiting for a common selector that is available on all pages, such as a header logo. -In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However if you have small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. There are various captcha-solving services available, such as [AntiCaptcha](https://anti-captcha.com/) or [AnyCaptcha](https://anycaptcha.com/), that you can use for this purpose. For more info check the section about [Captchas](../techniques/captchas.md). +In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However if you have small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. You can find various captcha-solving services, such as [AntiCaptcha](https://anti-captcha.com/), that you can use for this purpose. For more info check the section about [Captchas](../techniques/captchas.md). ![Cloudflare captcha](https://images.ctfassets.net/slt3lc6tev37/6sN2VXiUaJpjxqVfTbZEJd/9a4e13cbf08ce29797167c133c534e1f/image1.png) diff --git a/vale.ini b/vale.ini index c150c4543..837d46457 100644 --- a/vale.ini +++ b/vale.ini @@ -20,3 +20,4 @@ Microsoft.Contractions = NO Microsoft.Foreign = NO Microsoft.We = NO Microsoft.Quotes = NO +Microsoft.ThereIs = NO