Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sitemap-based request list implementation #2498

Merged
merged 27 commits into from
Jul 4, 2024
Merged

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented May 24, 2024

This introduces an alternative RequestList implementation based on sitemaps. It should be possible to use this in tandem with RequestProvider in BasicCrawler, just like with the current RequestList.

In the future, this will make it possible to start crawling before the sitemap is finished loading.

TODO

  • make sure that the API is usable in actors that require this
  • update examples, docs

@janbuchar janbuchar requested review from B4nan and barjin May 24, 2024 09:56
@github-actions github-actions bot added this to the 90th sprint - Tooling team milestone May 24, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 24, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few initial comments

packages/core/src/storages/request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
@janbuchar janbuchar requested a review from B4nan May 31, 2024 15:10
@@ -444,7 +444,7 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
* A reference to the underlying {@apilink RequestList} class that manages the crawler's {@apilink Request|requests}.
* Only available if used by the crawler.
*/
requestList?: RequestList;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this is breaking for anybody who extends BasicCrawler and accesses this. But anyone who does that should be able to deal with the change IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan is this cool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it is fine, by breaking you mean only if they would depend on a specific API that is not part of the new interface, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It seems improbable to me that anyone would do that, but life is full of surprises.

@@ -79,52 +285,6 @@ class SitemapTxtParser extends Writable {
export class Sitemap {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has become pretty hollow. I'd welcome any opinions on how to make it useful. Or can we remove it?

@janbuchar janbuchar marked this pull request as ready for review May 31, 2024 15:25
@janbuchar janbuchar requested a review from barjin June 3, 2024 13:57
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one idea, approving as long as you can say the migrations on platform are working as expected

packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
@B4nan B4nan marked this pull request as draft June 6, 2024 13:40
@barjin
Copy link
Contributor

barjin commented Jun 12, 2024

How's this looking? Anything we can help with?

@janbuchar
Copy link
Contributor Author

How's this looking? Anything we can help with?

Well, I want to add timeouts/cancellation. Also, we need to test if the SitemapRequestList survives migration. So if you'd like to do any of that, that would be super awesome. Otherwise, I think I'll finish it next week, or at the end of this one if I'm very lucky.

@barjin
Copy link
Contributor

barjin commented Jun 26, 2024

Alright, ready for the next round of reviews!

I simplified the parsing logic quite a lot (in my eyes) - in SitemapRequestList, there is now just one queue of parsed URLs and we're just keeping track of the remaining sitemaps to process (this is persisted on migrations).

9e4a620 adds a new helper method waitForNextRequest - basically an async generator for fetchNextRequest(). Blocks until there is a new request parsed from the sitemap (or ends if all sitemaps have been loaded and all URLs have been handled). It's implemented with active waiting, I'm not too happy about that, would be nice to have something better there.

b1498a8 adds signal and timeoutMillis options - signal expects AbortSignal (see usage here), timeoutMillis expect a number (see usage here). Both options only apply to the sitemap download / parsing - if any of them are triggered, the user is left with a SitemapRequestList containing incomplete contents of the sitemap. Once you hit isFinished() === true (you mark all the requests as handled), you can check isSitemapFullyLoaded() - if this is false, the timeout / abort has cut the download + parsing short, if isSitemapFullyLoaded() === true, you have handled all the URLs from the sitemap.

@barjin barjin self-assigned this Jun 26, 2024
packages/core/src/storages/request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
packages/utils/src/internals/sitemap.ts Show resolved Hide resolved
packages/utils/src/internals/sitemap.ts Show resolved Hide resolved
packages/core/src/storages/sitemap_request_list.ts Outdated Show resolved Hide resolved
@barjin barjin marked this pull request as ready for review June 27, 2024 11:53
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jul 3, 2024
@barjin
Copy link
Contributor

barjin commented Jul 3, 2024

Alright, time for the (yet another) final review!

My previous comment should provide enough guidance for the top-level ideas.

@barjin barjin requested a review from B4nan July 3, 2024 15:31
@barjin barjin merged commit 7bf8f0b into master Jul 4, 2024
9 checks passed
@barjin barjin deleted the sitemap-request-list branch July 4, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for all tags defined by the sitemap protocol
3 participants