Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug? PlaywrightCrawler enqueueLinks fails after WWW redirect. #2513

Open
1 task
obsidience opened this issue Jun 5, 2024 · 1 comment
Open
1 task

Bug? PlaywrightCrawler enqueueLinks fails after WWW redirect. #2513

obsidience opened this issue Jun 5, 2024 · 1 comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@obsidience
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Hi all,

Is the following a bug? I'm noticing that context.enqueueLinks seems to fail if the URL browsed has a WWW redirect. When this occurs, it looks like the selector succeeds to extract URL's however there's a "createFilteredRequests" call within enqueue_links.js that uses a enqueueStrategyPattern of "{glob: 'http{s,}://domain.com/**'}" and, because the glob doesn't have a WWW prefix, it fails.

It looks like this may be caused by enqueue_links.js resolveBaseUrlForEnqueueLinksFiltering() assuming that the sanest option would be to assume "same origin", but wouldn't using "same domain" be more sane for a typical crawler as http->https and www redirects are common?

Thanks for your help!

Code sample

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
	async requestHandler(context) {
		await context.enqueueLinks({
			selector: 'a[slot="full-post-link"]', // fails
			//globs: ['**/comments/**'], // succeeds
		});
	},
	headless: false,
	launchContext: {
		launchOptions: {
			slowMo: 500,
		},
	},
});

await crawler.run(['https://reddit.com/r/legal']); // note: this is missing "www."

Package version

3.10.2

Node.js version

20.13.1

Operating system

Win11

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@obsidience obsidience added the bug Something isn't working. label Jun 5, 2024
@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Jun 5, 2024
@toanphan19
Copy link

Hi, we are having the same issue as well. According to the docs the default configuration should not filter out links to the same hostname but different subdomain (www). Hope this get fixed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants