You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
Hi all,
Is the following a bug? I'm noticing that context.enqueueLinks seems to fail if the URL browsed has a WWW redirect. When this occurs, it looks like the selector succeeds to extract URL's however there's a "createFilteredRequests" call within enqueue_links.js that uses a enqueueStrategyPattern of "{glob: 'http{s,}://domain.com/**'}" and, because the glob doesn't have a WWW prefix, it fails.
It looks like this may be caused by enqueue_links.js resolveBaseUrlForEnqueueLinksFiltering() assuming that the sanest option would be to assume "same origin", but wouldn't using "same domain" be more sane for a typical crawler as http->https and www redirects are common?
Thanks for your help!
Code sample
import{PlaywrightCrawler,Dataset}from'crawlee';constcrawler=newPlaywrightCrawler({asyncrequestHandler(context){awaitcontext.enqueueLinks({selector: 'a[slot="full-post-link"]',// fails//globs: ['**/comments/**'], // succeeds});},headless: false,launchContext: {launchOptions: {slowMo: 500,},},});awaitcrawler.run(['https://reddit.com/r/legal']);// note: this is missing "www."
Package version
3.10.2
Node.js version
20.13.1
Operating system
Win11
Apify platform
Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
The text was updated successfully, but these errors were encountered:
Hi, we are having the same issue as well. According to the docs the default configuration should not filter out links to the same hostname but different subdomain (www). Hope this get fixed soon.
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
Hi all,
Is the following a bug? I'm noticing that context.enqueueLinks seems to fail if the URL browsed has a WWW redirect. When this occurs, it looks like the selector succeeds to extract URL's however there's a "createFilteredRequests" call within enqueue_links.js that uses a enqueueStrategyPattern of "{glob: 'http{s,}://domain.com/**'}" and, because the glob doesn't have a WWW prefix, it fails.
It looks like this may be caused by enqueue_links.js resolveBaseUrlForEnqueueLinksFiltering() assuming that the sanest option would be to assume "same origin", but wouldn't using "same domain" be more sane for a typical crawler as http->https and www redirects are common?
Thanks for your help!
Code sample
Package version
3.10.2
Node.js version
20.13.1
Operating system
Win11
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: