-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877
Comments
Hi @matecsaj! Did you try this with Asking just in case, otherwise we can do that as the next step when investigating this 🙂 |
Yes, I did. When not running 'headless' something flashes on the screen. It was hard to read because it only appeared for a moment, so I used 'context.log.info(f'HTML {await context.page.content()}' to get a long look at the screen. From the HTML dump, I determined that a Javascript LIKELY does a redirect. |
Waiting for I think you should be able to do |
The target website is rejecting Chromium today, so I haven’t been able to determine whether your recommendation is effective when using Chromium. I’ve attempted multiple times throughout the day with different proxies, but the issue persists. I also conducted a large test using Firefox and Camofox, and found that they do trigger the redirect, though not consistently. I’ve reduced the code to the essentials to clearly demonstrate the problem while incorporating your recommendation. It’s possible that Crawlee is working as intended and the website’s anti-bot protection is employing clever tactics to discourage my attempts. Since I don’t specifically need to use Chromium, I’m content to let this go. Would you prefer to close this issue, or continue troubleshooting together? import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext # V0.5.0
from crawlee.proxy_configuration import ProxyConfiguration
# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
proxies,
# higher tier, more expensive, used as a fallback
]
)
async def main() -> None:
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration,
browser_type='chromium' # fails - it does not follow the JavaScript redirect
# browser_type='firefox', # works
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
requested_url = context.request.url
# wait for the page to completely load
try:
await context.page.wait_for_load_state("networkidle")
await context.page.wait_for_load_state("domcontentloaded")
# await context.page.wait_for_selector("div#someFinalContent")
except TimeoutError as e:
context.log.error(
f"Timeout waiting for the page {requested_url} to load: {e}"
)
return
else:
await asyncio.sleep(5) # Wait an additional ten seconds for good measure.
# redirect check
loaded_url = context.response.url
if requested_url == loaded_url:
context.log.error(f"Redirect failed on {context.request.url}")
else:
context.log.info(f'Redirect succeeded on {context.request.url} to {loaded_url}')
await crawler.run(['https://pinside.com/pinball/machine/2'])
if __name__ == '__main__':
asyncio.run(main()) Output when using Chromium.
Output when using Firefox.
|
Hi @matecsaj, as you mentioned, the website is rejecting import asyncio
from camoufox import AsyncNewBrowser
from typing_extensions import override
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
class CamoufoxPlugin(PlaywrightBrowserPlugin):
@override
async def new_browser(self) -> PlaywrightBrowserController:
if not self._playwright:
raise RuntimeError('Playwright browser plugin is not initialized.')
return PlaywrightBrowserController(
browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
max_open_pages_per_browser=1,
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
)
async def main() -> None:
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Request URL: {context.request.url}')
context.log.info('Response URL Expected: https://pinside.com/pinball/machine/addams-family')
context.log.info(f'Response URL Actual: {context.response.url}')
context.log.info(f'HTML {(await context.page.content())[:1000]}')
await crawler.run(['https://pinside.com/pinball/machine/2'])
if __name__ == '__main__':
asyncio.run(main())
Unfortunately, I was not able to reproduce the issues with Chrome redirects. |
Greetings @vdusek, Thanks for taking the time to review this. Your code ran fine on my machine as well. However, when I increase the number of request URLs, only the first one works.
How can I ensure every request behaves like the first, using a new proxy, new session, and resetting everything else? Modified code import asyncio
from camoufox import AsyncNewBrowser
from typing_extensions import override
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
proxy_urls = ['http://102.223.186.246:8888', 'http://2.56.179.115:3128', 'http://178.48.68.61:18080', 'http://98.8.195.160:443', 'http://38.156.73.56:8080', 'http://54.210.223.246:8888', 'http://101.255.210.2:1111', 'http://37.46.62.98:8080', 'http://67.43.228.253:8247', 'http://147.45.45.71:8481', 'http://62.33.53.248:3128', 'http://31.56.78.197:8080', 'http://103.148.45.184:8080', 'http://157.20.244.77:8080', 'http://188.124.230.43:32199', 'http://103.153.246.129:3125', 'http://124.121.2.152:8080', 'http://95.215.8.225:3128', 'http://103.105.78.137:8080', 'http://77.242.98.39:8080', 'http://67.43.227.226:2499', 'http://202.145.10.251:8080', 'http://193.178.203.140:8080', 'http://67.43.227.227:3653', 'http://38.183.146.31:8080', 'http://103.156.86.76:8080', 'http://103.26.108.118:83', 'http://200.63.107.118:8089', 'http://35.73.28.87:3128', 'http://189.125.109.66:3128', 'http://72.10.160.93:17819', 'http://67.43.227.226:4309', 'http://103.154.77.73:89', 'http://128.199.254.13:9090', 'http://91.205.69.126:8080', 'http://143.107.199.248:8080', 'http://177.23.176.58:8080', 'http://67.43.236.20:3333', 'http://103.211.26.94:22', 'http://152.70.235.185:9002', 'http://67.43.236.19:20217', 'http://121.227.203.189:8089', 'http://114.223.62.84:8089', 'http://121.227.203.169:8089', 'http://119.8.182.222:3128', 'http://54.212.22.168:1080', 'http://72.10.164.178:11989', 'http://139.84.155.98:3129', 'http://122.53.59.191:8082', 'http://103.253.127.186:3125', 'http://92.45.196.83:3310', 'http://67.43.227.227:26905', 'http://185.138.120.109:8080', 'http://67.43.236.21:19573', 'http://188.132.221.170:8080', 'http://121.227.203.154:8089', 'http://78.187.15.92:3310', 'http://79.106.79.154:8989', 'http://67.43.228.250:10149', 'http://78.189.148.73:3310', 'http://160.202.42.156:8080', 'http://72.10.160.170:23177', 'http://72.10.164.178:12437', 'http://115.186.185.6:3128', 'http://103.91.206.107:8805', 'http://67.43.227.228:16327', 'http://170.239.205.188:999', 'http://113.192.3.42:8050', 'http://5.9.198.34:55555', 'http://124.6.168.26:8282', 'http://67.43.236.20:21331', 'http://36.50.11.198:8080', 'http://203.150.128.72:8080', 'http://92.60.190.79:3128', 'http://202.12.245.133:8080', 'http://72.10.160.90:13427', 'http://103.169.41.236:8080', 'http://43.243.140.58:10001', 'http://169.239.85.121:8080', 'http://103.165.157.79:8090', 'http://179.109.156.19:8080', 'http://103.141.105.74:55', 'http://177.234.192.14:999', 'http://45.123.142.69:8181', 'http://121.227.203.190:8089', 'http://103.242.107.226:8098', 'http://103.42.228.56:8080', 'http://169.159.128.116:8082', 'http://103.220.23.217:8080', 'http://103.189.254.2:8080', 'http://103.172.120.219:9999', 'http://103.105.78.10:3125', 'http://154.6.189.35:3128', 'http://43.136.68.232:8888', 'http://14.232.192.26:10947', 'http://177.39.139.14:9999', 'http://222.127.55.155:8082', 'http://103.111.207.138:32650', 'http://103.127.220.74:8181', 'http://103.80.98.16:8080', 'http://67.43.228.250:16373', 'http://103.67.85.74:8080', 'http://154.0.157.103:8080', 'http://103.177.235.207:84', 'http://103.115.242.194:8080', 'http://103.170.22.137:8089', 'http://78.187.45.169:3310', 'http://78.187.125.78:3310', 'http://119.82.242.200:8080', 'http://45.174.248.11:999', 'http://178.214.80.28:1981', 'http://103.159.96.146:3128', 'http://103.180.119.182:8087', 'http://67.43.228.254:20267', 'http://103.172.249.234:80', 'http://125.99.106.250:3128', 'http://80.78.75.80:8080', 'http://103.166.9.110:8080', 'http://190.97.226.44:999', 'http://178.252.136.27:8080', 'http://72.10.160.173:14783', 'http://121.227.203.191:8089', 'http://80.248.77.125:8081', 'http://103.75.96.70:8080', 'http://103.81.158.130:8080', 'http://131.196.219.128:8080', 'http://103.153.191.209:8080', 'http://103.36.8.37:8080', 'http://103.151.140.124:10609', 'http://110.136.61.7:8080', 'http://187.111.144.102:8080', 'http://103.86.117.53:1080', 'http://103.46.10.27:7777', 'http://45.178.55.2:999', 'http://190.82.105.122:43949', 'http://139.219.239.14:8080', 'http://177.11.190.84:8080', 'http://38.51.188.31:999', 'http://120.28.137.232:8082', 'http://61.247.185.50:8080', 'http://193.105.123.195:8123', 'http://38.159.229.98:999', 'http://112.19.241.37:19999', 'http://181.209.125.186:999', 'http://94.75.76.10:8080', 'http://72.10.160.173:24721', 'http://77.235.31.24:8080', 'http://67.43.236.20:20217', 'http://185.191.236.162:3128', 'http://78.187.9.206:3310', 'http://181.78.99.31:8080', 'http://103.36.10.223:8080', 'http://103.118.175.42:8080', 'http://203.153.121.130:8080', 'http://103.172.42.177:1111', 'http://190.94.212.82:999', 'http://103.154.91.250:8081', 'http://188.136.143.85:7060', 'http://188.132.222.4:8080', 'http://136.233.136.41:48976', 'http://103.217.213.124:32650', 'http://182.253.26.196:8080', 'http://202.5.60.46:5020', 'http://72.10.164.178:19695', 'http://202.51.214.134:8080', 'http://223.206.51.172:8080', 'http://102.164.252.150:8080', 'http://101.255.208.246:7888', 'http://103.154.77.204:89', 'http://102.0.9.114:8080', 'http://103.247.21.44:1111', 'http://103.70.79.3:8080', 'http://103.155.196.144:8080', 'http://43.130.15.168:3128', 'http://203.142.71.50:8080', 'http://38.50.165.51:999', 'http://38.50.165.56:999', 'http://103.59.44.213:8083', 'http://103.111.207.138:80', 'http://58.209.139.175:8089', 'http://103.46.11.27:7777', 'http://102.67.101.242:8080', 'http://103.158.121.38:8080', 'http://200.59.10.50:999', 'http://208.87.243.199:7878', 'http://51.75.86.68:3128', 'http://103.169.187.29:3125', 'http://150.107.245.121:8080', 'http://202.137.122.4:8082', 'http://213.149.182.98:8080', 'http://103.169.254.45:6080', 'http://190.186.1.126:999', 'http://186.167.80.234:8090', 'http://31.56.78.137:8080', 'http://200.10.28.185:8083', 'http://190.211.250.132:999', 'http://111.1.61.47:3128', 'http://181.209.82.195:999', 'http://103.190.171.213:8181', 'http://90.156.194.75:8026', 'http://190.95.202.210:999', 'http://67.43.227.226:3925', 'http://180.178.95.142:8080', 'http://103.126.86.29:9090', 'http://54.37.207.54:3128', 'http://101.255.208.238:7888', 'http://43.230.197.213:8080', 'http://14.177.236.212:55443', 'http://103.162.63.198:8181', 'http://126.209.9.30:8080', 'http://112.78.131.6:8080', 'http://175.106.10.227:7878', 'http://157.66.16.43:8070', 'http://203.210.84.100:8182', 'http://78.130.246.65:1881', 'http://41.203.213.211:8105', 'http://181.78.79.97:999', 'http://157.66.84.17:8080', 'http://96.0.147.177:443', 'http://109.201.13.186:8080', 'http://165.16.27.109:1981', 'http://190.52.97.27:999', 'http://77.228.182.122:8080', 'http://122.50.6.186:80', 'http://114.130.175.18:8080', 'http://186.0.144.141:9595', 'http://103.247.14.103:1111', 'http://103.151.47.90:3127', 'http://182.253.216.39:1080', 'http://121.200.48.58:8080', 'http://222.252.194.204:8080', 'http://202.57.25.91:1111', 'http://103.157.78.162:8080', 'http://124.217.39.83:8080', 'http://103.39.247.205:8080', 'http://101.251.204.174:8080', 'http://213.169.33.7:8001', 'http://38.43.150.69:999', 'http://72.10.160.90:20949', 'http://165.16.27.42:1981', 'http://171.4.116.113:8080', 'http://51.159.159.73:80', 'http://171.6.88.79:8080', 'http://124.106.173.56:8082', 'http://103.227.186.69:6080', 'http://171.228.130.95:26639', 'http://131.100.51.247:999', 'http://122.51.39.108:20051', 'http://27.131.250.150:8080', 'http://38.52.221.186:999', 'http://47.90.205.231:33333', 'http://182.253.140.250:8080', 'http://183.88.241.167:8080', 'http://103.151.17.201:8080', 'http://157.100.57.180:999', 'http://105.28.176.41:9812', 'http://103.48.70.57:83', 'http://179.1.110.88:8080', 'http://45.153.165.67:999', 'http://103.162.221.3:8080', 'http://177.154.37.197:9090', 'http://103.63.26.231:8080', 'http://187.103.105.22:8999', 'http://101.37.12.43:1080', 'http://103.211.107.62:8080', 'http://157.15.144.250:8080', 'http://103.190.78.12:8082', 'http://167.86.99.29:3128', 'http://103.180.118.138:1111', 'http://46.243.9.113:8080', 'http://126.209.2.2:8081', 'http://121.101.135.46:8089', 'http://103.143.169.9:84', 'http://186.96.97.203:999', 'http://67.43.236.19:12719', 'http://103.48.160.42:96', 'http://45.190.76.115:999', 'http://103.173.72.97:1111', 'http://103.152.100.221:8080', 'http://170.81.77.132:2222', 'http://45.167.23.31:999', 'http://89.135.59.71:8090', 'http://190.95.183.242:2020', 'http://67.43.236.22:14019', 'http://145.40.97.148:9401', 'http://103.88.90.117:8080', 'http://46.161.196.10:8077', 'http://122.51.39.108:20021', 'http://58.209.137.233:8089', 'http://204.157.185.4:999', 'http://217.66.215.86:8080', 'http://36.92.60.234:8080', 'http://154.70.135.87:8080', 'http://101.255.165.130:1111', 'http://24.49.117.86:8888', 'http://38.159.232.148:999', 'http://103.124.138.76:1111', 'http://114.130.153.70:58080', 'http://103.69.60.8:8080', 'http://103.145.149.36:8080', 'http://103.155.168.88:8299', 'http://129.151.233.36:3128', 'http://103.153.96.100:8181', 'http://103.172.42.227:8080', 'http://110.78.186.81:8080', 'http://103.193.144.123:8080', 'http://216.229.112.25:8080', 'http://103.186.204.52:8089', 'http://103.144.18.86:8080', 'http://103.160.182.33:8080', 'http://38.156.73.229:8080', 'http://201.77.98.131:999', 'http://180.180.151.65:8080', 'http://120.28.216.166:8081', 'http://180.191.36.128:8181', 'http://103.123.168.202:3932', 'http://202.179.69.216:58080', 'http://200.24.153.151:999', 'http://38.45.45.61:999', 'http://102.214.165.241:1981', 'http://170.239.205.185:999', 'http://103.148.130.37:8090', 'http://190.52.100.8:999', 'http://88.80.150.3:8080', 'http://31.43.52.216:41890', 'http://112.198.131.71:8082', 'http://58.136.170.59:8080', 'http://145.40.97.148:9400', 'http://103.180.118.207:7777', 'http://190.97.238.82:999', 'http://13.234.24.116:3128', 'http://103.133.26.106:8080', 'http://113.192.31.135:8080', 'http://46.161.194.88:8085', 'http://103.234.31.0:8080', 'http://200.34.227.28:8080', 'http://79.127.118.57:8080', 'http://45.179.71.90:3180', 'http://43.249.143.242:3128', 'http://58.208.159.211:8089', 'http://154.90.49.84:9090', 'http://103.159.46.45:83', 'http://45.5.116.145:999', 'http://103.156.74.154:8080', 'http://181.78.86.105:999', 'http://41.33.219.131:1981', 'http://119.95.237.19:8080', 'http://1.0.170.50:8080', 'http://217.52.247.85:1976']
request_urls = [f"https://pinside.com/pinball/machine/{pinside_id}" for pinside_id in range(2, 12)]
class CamoufoxPlugin(PlaywrightBrowserPlugin):
@override
async def new_browser(self) -> PlaywrightBrowserController:
if not self._playwright:
raise RuntimeError('Playwright browser plugin is not initialized.')
return PlaywrightBrowserController(
browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
max_open_pages_per_browser=1,
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
)
async def main() -> None:
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
proxy_configuration=ProxyConfiguration(proxy_urls=proxy_urls),
use_session_pool=False, # don't use sessions, start every fetch fresh
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
if context.request.url != context.response.url:
context.log.warning(f'Success - {context.request.url} redirected to to {context.response.url}')
else:
context.log.info(f'Failure - Request URL: {context.request.url} Proxy: {context.proxy_info}')
context.log.info(f'HTML {(await context.page.content())[:1000]}')
await crawler.run(request_urls)
if __name__ == '__main__':
asyncio.run(main()) Summary Output
Complete Output
|
Hi @matecsaj, I used your range of URLs with import asyncio
from camoufox import AsyncNewBrowser
from typing_extensions import override
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
URLS = [f'https://pinside.com/pinball/machine/{pinside_id}' for pinside_id in range(2, 12)]
class CamoufoxPlugin(PlaywrightBrowserPlugin):
@override
async def new_browser(self) -> PlaywrightBrowserController:
if not self._playwright:
raise RuntimeError('Playwright browser plugin is not initialized.')
return PlaywrightBrowserController(
browser=await AsyncNewBrowser(self._playwright, **self._browser_launch_options),
max_open_pages_per_browser=1,
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
)
async def main() -> None:
browser_pool = BrowserPool(plugins=[CamoufoxPlugin()])
crawler = PlaywrightCrawler(browser_pool=browser_pool)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Request URL: {context.request.url}')
context.log.info(f'Response URL actual: {context.response.url}')
context.log.info(f'HTML {(await context.page.content())[:1000]}')
await crawler.run(URLS)
if __name__ == '__main__':
asyncio.run(main()) I wasn't blocked and didn't even need to use proxies. Here's the output:
The issue likely lies with the proxies you're using. When I tested them, I also encountered frequent blocks and timeouts. Output:
These proxies are appear to be of low quality and may already be blacklisted by anti-bot/scraping systems. |
Thank you, @vdusek, for identifying the root cause. I apologize for mistakenly reporting this as an issue with Crawlee and for any time I may have wasted. |
Run this, switch to Firefox, and run again. Chrome does not follow the JavaScript redirect as it should. The target website is sensitive to bots; you might find it necessary to add a proxy.
I have a vague memory of solving this exact problem on an old project, here a snippet of the code.
The text was updated successfully, but these errors were encountered: