Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

Closed
matecsaj opened this issue Jan 6, 2025 · 8 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Jan 6, 2025

Run this, switch to Firefox, and run again. Chrome does not follow the JavaScript redirect as it should. The target website is sensitive to bots; you might find it necessary to add a proxy.

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_type='chromium'  # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',  # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Request URL: {context.request.url}')
        context.log.info(f'Response URL Expected: https://pinside.com/pinball/machine/addams-family')
        context.log.info(f'Response URL Actual: {context.response.url}')
        context.log.info(f'HTML {await context.page.content()}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

I have a vague memory of solving this exact problem on an old project, here a snippet of the code.

async with async_playwright() as p:
    browser = await p.chromium.launch(
                    args=["--disable-blink-features=AutomationControlled"],
                )
     context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                    user_agent=self.user_agent,
                    proxy=self.proxy,
                )
    page = await context.new_page()
    await page.goto(url, wait_until='networkidle').   # This might be what you need to fix the problem.
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 7, 2025
@janbuchar
Copy link
Collaborator

Hi @matecsaj! Did you try this with headless=False to see what is actually going on? Or investigate what kind of redirect mechanism they use?

Asking just in case, otherwise we can do that as the next step when investigating this 🙂

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 7, 2025

Yes, I did.

When not running 'headless' something flashes on the screen. It was hard to read because it only appeared for a moment, so I used 'context.log.info(f'HTML {await context.page.content()}' to get a long look at the screen. From the HTML dump, I determined that a Javascript LIKELY does a redirect.

@B4nan
Copy link
Member

B4nan commented Jan 7, 2025

Waiting for networkidle is something you should be doing as part of the request handler. Not everyone needs to wait for that, and it significantly slows down the page processing, which is why this cannot be done as a default.

I think you should be able to do await context.page.wait_for_load_state('networkidle') as the first thing in your handler to achieve that.

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 9, 2025

The target website is rejecting Chromium today, so I haven’t been able to determine whether your recommendation is effective when using Chromium. I’ve attempted multiple times throughout the day with different proxies, but the issue persists.

I also conducted a large test using Firefox and Camofox, and found that they do trigger the redirect, though not consistently. I’ve reduced the code to the essentials to clearly demonstrate the problem while incorporating your recommendation.

It’s possible that Crawlee is working as intended and the website’s anti-bot protection is employing clever tactics to discourage my attempts. Since I don’t specifically need to use Chromium, I’m content to let this go. Would you prefer to close this issue, or continue troubleshooting together?

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]

proxy_configuration = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )


async def main() -> None:
    crawler = PlaywrightCrawler(
        proxy_configuration=proxy_configuration,
        browser_type='chromium'     # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',     # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:

        requested_url = context.request.url

        # wait for the page to completely load
        try:
            await context.page.wait_for_load_state("networkidle")
            await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.wait_for_selector("div#someFinalContent")
        except TimeoutError as e:
            context.log.error(
                f"Timeout waiting for the page {requested_url} to load: {e}"
            )
            return
        else:
            await asyncio.sleep(5)  # Wait an additional ten seconds for good measure.

        # redirect check
        loaded_url = context.response.url
        if requested_url == loaded_url:
            context.log.error(f"Redirect failed on {context.request.url}")
        else:
            context.log.info(f'Redirect succeeded on {context.request.url} to {loaded_url}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

Output when using Chromium.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.054043 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1007, in __run_task_function
          await wait_for(
          ...<5 lines>...
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1105, in __run_request_handler
          await self._context_pipeline(context, self.router)
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 266, in _handle_blocked_request
          raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
      crawlee.errors.SessionError: Assuming the session is blocked based on HTTP status code 403
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=1 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌─────────────────────────────┬────────────────────────────┐
│ requests_finished           │ 0                          │
│ requests_failed             │ 1                          │
│ retry_histogram             │ [0, 0, 0, 0, 0, 0, 0, 0,   │
│                             │ 0, 1]                      │
│ request_avg_failed_duration │ 0.746194                   │
│ request_avg_finished_durat… │ None                       │
│ requests_finished_per_minu… │ 0                          │
│ requests_failed_per_minute  │ 6                          │
│ request_total_duration      │ 0.746194                   │
│ requests_total              │ 1                          │
│ crawler_runtime             │ 9.300798                   │
└─────────────────────────────┴────────────────────────────┘

Process finished with exit code 0

Output when using Firefox.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.044338 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Redirect succeeded on https://pinside.com/pinball/machine/2 to https://pinside.com/pinball/machine/addams-family
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 1         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [1]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 11.993283 │
│ requests_finished_per_minute  │ 5         │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 11.993283 │
│ requests_total                │ 1         │
│ crawler_runtime               │ 13.239656 │
└───────────────────────────────┴───────────┘

Process finished with exit code 0

@vdusek
Copy link
Collaborator

vdusek commented Jan 17, 2025

Hi @matecsaj,

as you mentioned, the website is rejecting chromium, even when using proxies. I tried switching to firefox, which worked fine and returned the expected output. Another alternative is to use the Camoufox. Below is an example:

import asyncio

from camoufox import AsyncNewBrowser
from typing_extensions import override

from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
            max_open_pages_per_browser=1,
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.
        )


async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Request URL: {context.request.url}')
        context.log.info('Response URL Expected: https://pinside.com/pinball/machine/addams-family')
        context.log.info(f'Response URL Actual: {context.response.url}')
        context.log.info(f'HTML {(await context.page.content())[:1000]}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])


if __name__ == '__main__':
    asyncio.run(main())
[crawlee.crawlers._playwright._playwright_crawler] INFO  Request URL: https://pinside.com/pinball/machine/2
[crawlee.crawlers._playwright._playwright_crawler] INFO  Response URL Expected: https://pinside.com/pinball/machine/addams-family
[crawlee.crawlers._playwright._playwright_crawler] INFO  Response URL Actual: https://pinside.com/pinball/machine/addams-family
[crawlee.crawlers._playwright._playwright_crawler] INFO  HTML <!DOCTYPE html><html lang="en" slick-uniqueid="3"><head> ...
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish

Unfortunately, I was not able to reproduce the issues with Chrome redirects.

@matecsaj
Copy link
Contributor Author

Greetings @vdusek,

Thanks for taking the time to review this. Your code ran fine on my machine as well.

However, when I increase the number of request URLs, only the first one works.

  • I added proxies, but still only the first request worked.
  • I set use_session_pool=False, yet the issue persisted.

How can I ensure every request behaves like the first, using a new proxy, new session, and resetting everything else?

Modified code

import asyncio

from camoufox import AsyncNewBrowser
from typing_extensions import override

from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

proxy_urls = ['http://102.223.186.246:8888', 'http://2.56.179.115:3128', 'http://178.48.68.61:18080', 'http://98.8.195.160:443', 'http://38.156.73.56:8080', 'http://54.210.223.246:8888', 'http://101.255.210.2:1111', 'http://37.46.62.98:8080', 'http://67.43.228.253:8247', 'http://147.45.45.71:8481', 'http://62.33.53.248:3128', 'http://31.56.78.197:8080', 'http://103.148.45.184:8080', 'http://157.20.244.77:8080', 'http://188.124.230.43:32199', 'http://103.153.246.129:3125', 'http://124.121.2.152:8080', 'http://95.215.8.225:3128', 'http://103.105.78.137:8080', 'http://77.242.98.39:8080', 'http://67.43.227.226:2499', 'http://202.145.10.251:8080', 'http://193.178.203.140:8080', 'http://67.43.227.227:3653', 'http://38.183.146.31:8080', 'http://103.156.86.76:8080', 'http://103.26.108.118:83', 'http://200.63.107.118:8089', 'http://35.73.28.87:3128', 'http://189.125.109.66:3128', 'http://72.10.160.93:17819', 'http://67.43.227.226:4309', 'http://103.154.77.73:89', 'http://128.199.254.13:9090', 'http://91.205.69.126:8080', 'http://143.107.199.248:8080', 'http://177.23.176.58:8080', 'http://67.43.236.20:3333', 'http://103.211.26.94:22', 'http://152.70.235.185:9002', 'http://67.43.236.19:20217', 'http://121.227.203.189:8089', 'http://114.223.62.84:8089', 'http://121.227.203.169:8089', 'http://119.8.182.222:3128', 'http://54.212.22.168:1080', 'http://72.10.164.178:11989', 'http://139.84.155.98:3129', 'http://122.53.59.191:8082', 'http://103.253.127.186:3125', 'http://92.45.196.83:3310', 'http://67.43.227.227:26905', 'http://185.138.120.109:8080', 'http://67.43.236.21:19573', 'http://188.132.221.170:8080', 'http://121.227.203.154:8089', 'http://78.187.15.92:3310', 'http://79.106.79.154:8989', 'http://67.43.228.250:10149', 'http://78.189.148.73:3310', 'http://160.202.42.156:8080', 'http://72.10.160.170:23177', 'http://72.10.164.178:12437', 'http://115.186.185.6:3128', 'http://103.91.206.107:8805', 'http://67.43.227.228:16327', 'http://170.239.205.188:999', 'http://113.192.3.42:8050', 'http://5.9.198.34:55555', 'http://124.6.168.26:8282', 'http://67.43.236.20:21331', 'http://36.50.11.198:8080', 'http://203.150.128.72:8080', 'http://92.60.190.79:3128', 'http://202.12.245.133:8080', 'http://72.10.160.90:13427', 'http://103.169.41.236:8080', 'http://43.243.140.58:10001', 'http://169.239.85.121:8080', 'http://103.165.157.79:8090', 'http://179.109.156.19:8080', 'http://103.141.105.74:55', 'http://177.234.192.14:999', 'http://45.123.142.69:8181', 'http://121.227.203.190:8089', 'http://103.242.107.226:8098', 'http://103.42.228.56:8080', 'http://169.159.128.116:8082', 'http://103.220.23.217:8080', 'http://103.189.254.2:8080', 'http://103.172.120.219:9999', 'http://103.105.78.10:3125', 'http://154.6.189.35:3128', 'http://43.136.68.232:8888', 'http://14.232.192.26:10947', 'http://177.39.139.14:9999', 'http://222.127.55.155:8082', 'http://103.111.207.138:32650', 'http://103.127.220.74:8181', 'http://103.80.98.16:8080', 'http://67.43.228.250:16373', 'http://103.67.85.74:8080', 'http://154.0.157.103:8080', 'http://103.177.235.207:84', 'http://103.115.242.194:8080', 'http://103.170.22.137:8089', 'http://78.187.45.169:3310', 'http://78.187.125.78:3310', 'http://119.82.242.200:8080', 'http://45.174.248.11:999', 'http://178.214.80.28:1981', 'http://103.159.96.146:3128', 'http://103.180.119.182:8087', 'http://67.43.228.254:20267', 'http://103.172.249.234:80', 'http://125.99.106.250:3128', 'http://80.78.75.80:8080', 'http://103.166.9.110:8080', 'http://190.97.226.44:999', 'http://178.252.136.27:8080', 'http://72.10.160.173:14783', 'http://121.227.203.191:8089', 'http://80.248.77.125:8081', 'http://103.75.96.70:8080', 'http://103.81.158.130:8080', 'http://131.196.219.128:8080', 'http://103.153.191.209:8080', 'http://103.36.8.37:8080', 'http://103.151.140.124:10609', 'http://110.136.61.7:8080', 'http://187.111.144.102:8080', 'http://103.86.117.53:1080', 'http://103.46.10.27:7777', 'http://45.178.55.2:999', 'http://190.82.105.122:43949', 'http://139.219.239.14:8080', 'http://177.11.190.84:8080', 'http://38.51.188.31:999', 'http://120.28.137.232:8082', 'http://61.247.185.50:8080', 'http://193.105.123.195:8123', 'http://38.159.229.98:999', 'http://112.19.241.37:19999', 'http://181.209.125.186:999', 'http://94.75.76.10:8080', 'http://72.10.160.173:24721', 'http://77.235.31.24:8080', 'http://67.43.236.20:20217', 'http://185.191.236.162:3128', 'http://78.187.9.206:3310', 'http://181.78.99.31:8080', 'http://103.36.10.223:8080', 'http://103.118.175.42:8080', 'http://203.153.121.130:8080', 'http://103.172.42.177:1111', 'http://190.94.212.82:999', 'http://103.154.91.250:8081', 'http://188.136.143.85:7060', 'http://188.132.222.4:8080', 'http://136.233.136.41:48976', 'http://103.217.213.124:32650', 'http://182.253.26.196:8080', 'http://202.5.60.46:5020', 'http://72.10.164.178:19695', 'http://202.51.214.134:8080', 'http://223.206.51.172:8080', 'http://102.164.252.150:8080', 'http://101.255.208.246:7888', 'http://103.154.77.204:89', 'http://102.0.9.114:8080', 'http://103.247.21.44:1111', 'http://103.70.79.3:8080', 'http://103.155.196.144:8080', 'http://43.130.15.168:3128', 'http://203.142.71.50:8080', 'http://38.50.165.51:999', 'http://38.50.165.56:999', 'http://103.59.44.213:8083', 'http://103.111.207.138:80', 'http://58.209.139.175:8089', 'http://103.46.11.27:7777', 'http://102.67.101.242:8080', 'http://103.158.121.38:8080', 'http://200.59.10.50:999', 'http://208.87.243.199:7878', 'http://51.75.86.68:3128', 'http://103.169.187.29:3125', 'http://150.107.245.121:8080', 'http://202.137.122.4:8082', 'http://213.149.182.98:8080', 'http://103.169.254.45:6080', 'http://190.186.1.126:999', 'http://186.167.80.234:8090', 'http://31.56.78.137:8080', 'http://200.10.28.185:8083', 'http://190.211.250.132:999', 'http://111.1.61.47:3128', 'http://181.209.82.195:999', 'http://103.190.171.213:8181', 'http://90.156.194.75:8026', 'http://190.95.202.210:999', 'http://67.43.227.226:3925', 'http://180.178.95.142:8080', 'http://103.126.86.29:9090', 'http://54.37.207.54:3128', 'http://101.255.208.238:7888', 'http://43.230.197.213:8080', 'http://14.177.236.212:55443', 'http://103.162.63.198:8181', 'http://126.209.9.30:8080', 'http://112.78.131.6:8080', 'http://175.106.10.227:7878', 'http://157.66.16.43:8070', 'http://203.210.84.100:8182', 'http://78.130.246.65:1881', 'http://41.203.213.211:8105', 'http://181.78.79.97:999', 'http://157.66.84.17:8080', 'http://96.0.147.177:443', 'http://109.201.13.186:8080', 'http://165.16.27.109:1981', 'http://190.52.97.27:999', 'http://77.228.182.122:8080', 'http://122.50.6.186:80', 'http://114.130.175.18:8080', 'http://186.0.144.141:9595', 'http://103.247.14.103:1111', 'http://103.151.47.90:3127', 'http://182.253.216.39:1080', 'http://121.200.48.58:8080', 'http://222.252.194.204:8080', 'http://202.57.25.91:1111', 'http://103.157.78.162:8080', 'http://124.217.39.83:8080', 'http://103.39.247.205:8080', 'http://101.251.204.174:8080', 'http://213.169.33.7:8001', 'http://38.43.150.69:999', 'http://72.10.160.90:20949', 'http://165.16.27.42:1981', 'http://171.4.116.113:8080', 'http://51.159.159.73:80', 'http://171.6.88.79:8080', 'http://124.106.173.56:8082', 'http://103.227.186.69:6080', 'http://171.228.130.95:26639', 'http://131.100.51.247:999', 'http://122.51.39.108:20051', 'http://27.131.250.150:8080', 'http://38.52.221.186:999', 'http://47.90.205.231:33333', 'http://182.253.140.250:8080', 'http://183.88.241.167:8080', 'http://103.151.17.201:8080', 'http://157.100.57.180:999', 'http://105.28.176.41:9812', 'http://103.48.70.57:83', 'http://179.1.110.88:8080', 'http://45.153.165.67:999', 'http://103.162.221.3:8080', 'http://177.154.37.197:9090', 'http://103.63.26.231:8080', 'http://187.103.105.22:8999', 'http://101.37.12.43:1080', 'http://103.211.107.62:8080', 'http://157.15.144.250:8080', 'http://103.190.78.12:8082', 'http://167.86.99.29:3128', 'http://103.180.118.138:1111', 'http://46.243.9.113:8080', 'http://126.209.2.2:8081', 'http://121.101.135.46:8089', 'http://103.143.169.9:84', 'http://186.96.97.203:999', 'http://67.43.236.19:12719', 'http://103.48.160.42:96', 'http://45.190.76.115:999', 'http://103.173.72.97:1111', 'http://103.152.100.221:8080', 'http://170.81.77.132:2222', 'http://45.167.23.31:999', 'http://89.135.59.71:8090', 'http://190.95.183.242:2020', 'http://67.43.236.22:14019', 'http://145.40.97.148:9401', 'http://103.88.90.117:8080', 'http://46.161.196.10:8077', 'http://122.51.39.108:20021', 'http://58.209.137.233:8089', 'http://204.157.185.4:999', 'http://217.66.215.86:8080', 'http://36.92.60.234:8080', 'http://154.70.135.87:8080', 'http://101.255.165.130:1111', 'http://24.49.117.86:8888', 'http://38.159.232.148:999', 'http://103.124.138.76:1111', 'http://114.130.153.70:58080', 'http://103.69.60.8:8080', 'http://103.145.149.36:8080', 'http://103.155.168.88:8299', 'http://129.151.233.36:3128', 'http://103.153.96.100:8181', 'http://103.172.42.227:8080', 'http://110.78.186.81:8080', 'http://103.193.144.123:8080', 'http://216.229.112.25:8080', 'http://103.186.204.52:8089', 'http://103.144.18.86:8080', 'http://103.160.182.33:8080', 'http://38.156.73.229:8080', 'http://201.77.98.131:999', 'http://180.180.151.65:8080', 'http://120.28.216.166:8081', 'http://180.191.36.128:8181', 'http://103.123.168.202:3932', 'http://202.179.69.216:58080', 'http://200.24.153.151:999', 'http://38.45.45.61:999', 'http://102.214.165.241:1981', 'http://170.239.205.185:999', 'http://103.148.130.37:8090', 'http://190.52.100.8:999', 'http://88.80.150.3:8080', 'http://31.43.52.216:41890', 'http://112.198.131.71:8082', 'http://58.136.170.59:8080', 'http://145.40.97.148:9400', 'http://103.180.118.207:7777', 'http://190.97.238.82:999', 'http://13.234.24.116:3128', 'http://103.133.26.106:8080', 'http://113.192.31.135:8080', 'http://46.161.194.88:8085', 'http://103.234.31.0:8080', 'http://200.34.227.28:8080', 'http://79.127.118.57:8080', 'http://45.179.71.90:3180', 'http://43.249.143.242:3128', 'http://58.208.159.211:8089', 'http://154.90.49.84:9090', 'http://103.159.46.45:83', 'http://45.5.116.145:999', 'http://103.156.74.154:8080', 'http://181.78.86.105:999', 'http://41.33.219.131:1981', 'http://119.95.237.19:8080', 'http://1.0.170.50:8080', 'http://217.52.247.85:1976']

request_urls = [f"https://pinside.com/pinball/machine/{pinside_id}" for pinside_id in range(2, 12)]

class CamoufoxPlugin(PlaywrightBrowserPlugin):
    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
            max_open_pages_per_browser=1,
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.
        )


async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
        proxy_configuration=ProxyConfiguration(proxy_urls=proxy_urls),
        use_session_pool=False,  # don't use sessions, start every fetch fresh
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        if context.request.url != context.response.url:
            context.log.warning(f'Success - {context.request.url} redirected to to {context.response.url}')
        else:
            context.log.info(f'Failure - Request URL: {context.request.url}  Proxy: {context.proxy_info}')
            context.log.info(f'HTML {(await context.page.content())[:1000]}')

    await crawler.run(request_urls)


if __name__ == '__main__':
    asyncio.run(main())

Summary Output

[crawlee.crawlers._playwright._playwright_crawler] WARN  Success - https://pinside.com/pinball/machine/2 redirected to to https://pinside.com/pinball/machine/addams-family
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/5  Proxy: ProxyInfo(url='http://77.242.98.39:8080', scheme='http', hostname='77.242.98.39', port=8080, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/4  Proxy: ProxyInfo(url='http://67.43.227.226:2499', scheme='http', hostname='67.43.227.226', port=2499, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/10  Proxy: ProxyInfo(url='http://200.63.107.118:8089', scheme='http', hostname='200.63.107.118', port=8089, username='', password='', session_id=None, proxy_tier=None)

Complete Output

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch.py 
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.001004 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 0         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [0]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ None      │
│ requests_finished_per_minute  │ 0         │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 0.0       │
│ requests_total                │ 0         │
│ crawler_runtime               │ 60.004844 │
└───────────────────────────────┴───────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 7; desired_concurrency = 7; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://pinside.com/pinball/machine/3", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/5  Proxy: ProxyInfo(url='http://77.242.98.39:8080', scheme='http', hostname='77.242.98.39', port=8080, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/4  Proxy: ProxyInfo(url='http://67.43.227.226:2499', scheme='http', hostname='67.43.227.226', port=2499, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  HTML <!DOCTYPE html><html lang="en-US" dir="ltr"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131;font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}body{display:flex;flex-direction:column;height:100vh;min-height:100vh}.main-content{margin:8rem auto;max-width:60rem;padding-left:1.5rem}@media (width <= 720px){.main-content{margin-top:4rem}}.h2{font-size:1.5rem;font-weight:500;line-height:2.25rem}@media (width <= 720px){.h2{font-size:1.25rem;line-height:1.5rem}}#challenge-error-text{background-image:url(data:image/svg+xml;base64,PHN2Zy
[crawlee.crawlers._playwright._playwright_crawler] INFO  HTML <!DOCTYPE html><html lang="en-US" dir="ltr"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131;font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}body{display:flex;flex-direction:column;height:100vh;min-height:100vh}.main-content{margin:8rem auto;max-width:60rem;padding-left:1.5rem}@media (width <= 720px){.main-content{margin-top:4rem}}.h2{font-size:1.5rem;font-weight:500;line-height:2.25rem}@media (width <= 720px){.h2{font-size:1.25rem;line-height:1.5rem}}#challenge-error-text{background-image:url(data:image/svg+xml;base64,PHN2Zy
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
      Call log:
        - navigating to "https://pinside.com/pinball/machine/9", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] INFO  Failure - Request URL: https://pinside.com/pinball/machine/10  Proxy: ProxyInfo(url='http://200.63.107.118:8089', scheme='http', hostname='200.63.107.118', port=8089, username='', password='', session_id=None, proxy_tier=None)
[crawlee.crawlers._playwright._playwright_crawler] INFO  HTML <!DOCTYPE html><html lang="en-US" dir="ltr"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131;font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}body{display:flex;flex-direction:column;height:100vh;min-height:100vh}.main-content{margin:8rem auto;max-width:60rem;padding-left:1.5rem}@media (width <= 720px){.main-content{margin-top:4rem}}.h2{font-size:1.5rem;font-weight:500;line-height:2.25rem}@media (width <= 720px){.h2{font-size:1.25rem;line-height:1.5rem}}#challenge-error-text{background-image:url(data:image/svg+xml;base64,PHN2Zy
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://pinside.com/pinball/machine/7", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://pinside.com/pinball/machine/6", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://pinside.com/pinball/machine/11", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 179, in _navigate
          response = await context.page.goto(context.request.url)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 8973, in goto
          await self._impl_obj.goto(
              url=url, timeout=timeout, waitUntil=wait_until, referer=referer
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_page.py", line 551, in goto
          return await self._main_frame.goto(**locals_to_params(locals()))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_frame.py", line 145, in goto
          await self._channel.send("goto", locals_to_params(locals()))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 61, in send
          return await self._connection.wrap_api_call(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ...<2 lines>...
          )
          ^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://pinside.com/pinball/machine/8", waiting until "load"
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 3          │
│ requests_failed               │ 6          │
│ retry_histogram               │ [0, 0, 9]  │
│ request_avg_failed_duration   │ 30.405075  │
│ request_avg_finished_duration │ 23.125161  │
│ requests_finished_per_minute  │ 1          │
│ requests_failed_per_minute    │ 2          │
│ request_total_duration        │ 251.805933 │
│ requests_total                │ 9          │
│ crawler_runtime               │ 120.006515 │
└───────────────────────────────┴────────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 1; desired_concurrency = 9; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] WARN  Success - https://pinside.com/pinball/machine/2 redirected to to https://pinside.com/pinball/machine/addams-family
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=26 unique_errors=2
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 4          │
│ requests_failed               │ 6          │
│ retry_histogram               │ [0, 0, 10] │
│ request_avg_failed_duration   │ 30.405075  │
│ request_avg_finished_duration │ 28.976533  │
│ requests_finished_per_minute  │ 2          │
│ requests_failed_per_minute    │ 2          │
│ request_total_duration        │ 298.336582 │
│ requests_total                │ 10         │
│ crawler_runtime               │ 153.183921 │
└───────────────────────────────┴────────────┘

Process finished with exit code 0

@vdusek
Copy link
Collaborator

vdusek commented Jan 22, 2025

Hi @matecsaj,

I used your range of URLs with PlaywrightCrawler and integrated it with Camoufox. Here's the code:

import asyncio

from camoufox import AsyncNewBrowser
from typing_extensions import override

from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

URLS = [f'https://pinside.com/pinball/machine/{pinside_id}' for pinside_id in range(2, 12)]


class CamoufoxPlugin(PlaywrightBrowserPlugin):
    @override
    async def new_browser(self) -> PlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        return PlaywrightBrowserController(
            browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
            max_open_pages_per_browser=1,
            header_generator=None,  # This turns off the crawlee header_generation. Camoufox has its own.
        )


async def main() -> None:
    browser_pool = BrowserPool(plugins=[CamoufoxPlugin()])
    crawler = PlaywrightCrawler(browser_pool=browser_pool)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Request URL: {context.request.url}')
        context.log.info(f'Response URL actual: {context.response.url}')
        context.log.info(f'HTML {(await context.page.content())[:1000]}')

    await crawler.run(URLS)


if __name__ == '__main__':
    asyncio.run(main())

I wasn't blocked and didn't even need to use proxies. Here's the output:

[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 10         │
│ requests_failed               │ 0          │
│ retry_histogram               │ [10]       │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 12.222674  │
│ requests_finished_per_minute  │ 13         │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 122.226744 │
│ requests_total                │ 10         │
│ crawler_runtime               │ 44.703722  │
└───────────────────────────────┴────────────┘

The issue likely lies with the proxies you're using. When I tested them, I also encountered frequent blocks and timeouts. Output:

┌───────────────────────────────┬──────────────┐
│ requests_finished             │ 5            │
│ requests_failed               │ 5            │
│ retry_histogram               │ [4, 1, 4, 1] │
│ request_avg_failed_duration   │ 30.854656    │
│ request_avg_finished_duration │ 15.771269    │
│ requests_finished_per_minute  │ 2            │
│ requests_failed_per_minute    │ 2            │
│ request_total_duration        │ 233.129626   │
│ requests_total                │ 10           │
│ crawler_runtime               │ 138.829328   │
└───────────────────────────────┴──────────────┘

These proxies are appear to be of low quality and may already be blacklisted by anti-bot/scraping systems.

@matecsaj
Copy link
Contributor Author

Thank you, @vdusek, for identifying the root cause. I apologize for mistakenly reporting this as an issue with Crawlee and for any time I may have wasted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants