Skip to content

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Oct 15, 2025

Description

  • Ensure that BasicCrawler is persisting statistics by default.
  • Ensure that BasicCrawler is recovering existing statistics by default if Configuration.purge_on_start is False.
  • Let the BasicCrawler emit Event.PERSIST_STATE when finishing.

Issues

Testing

@github-actions github-actions bot added this to the 125th sprint - Tooling team milestone Oct 15, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Oct 15, 2025
@Pijukatel Pijukatel force-pushed the crawler-persistance branch from 55e7316 to ebc350d Compare October 15, 2025 14:31
@Pijukatel Pijukatel requested review from janbuchar and vdusek October 16, 2025 13:27
@Pijukatel Pijukatel marked this pull request as ready for review October 16, 2025 13:27
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?

Comment on lines -44 to +47
crawler = HttpCrawler(
configuration=configuration,
storage_client=storage_client,
)
service_locator.set_configuration(configuration)
service_locator.set_storage_client(storage_client)

crawler = HttpCrawler()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client

Mentioned here:
#1438 (comment)

I am open to discussion about this.

self._statistics = statistics or cast(
'Statistics[TStatisticsState]',
Statistics.with_default_state(
persistence_enabled=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose changing the default values (persistence_enabled: bool = True) is a no-go in patch releases, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is changing the default value of the internal attribute... since we consider previous behavior a bug, this is probably ok?

Comment on lines 742 to 747
for context in contexts_to_enter:
await exit_stack.enter_async_context(context) # type: ignore[arg-type]

await self._autoscaled_pool.run()
self._crawler_state_rec_task.start()
try:
await self._autoscaled_pool.run()
finally:
await self._crawler_state_rec_task.stop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe RecurringTask could also be an async context manager (two options of usage - start/stop and aenter/aexit) and then it can be just another contexts_to_enter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 754 to 756
self._service_locator.get_event_manager().emit(
event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._service_locator.get_event_manager().emit(
event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False)
)
event_manager.emit(
event=Event.PERSIST_STATE,
event_data=EventPersistStateData(is_migrating=False),
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 749 to 753
# Emit PERSIST_STATE event when crawler is finishing to allow listeners to persist their state if needed
if not self.statistics.state.crawler_last_started_at:
raise RuntimeError('Statistics.state.crawler_last_started_at not set.')
run_duration = datetime.now(timezone.utc) - self.statistics.state.crawler_last_started_at
self._statistics.state.crawler_runtime = self.statistics.state.crawler_runtime + run_duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to belong here - this part handles high-level crawler behavior, so computing the run duration directly here feels wrong - Maybe move it to a separate method or to Statistics?

TODO: Figure out reason for stats difference in request_total_finished_duration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Crawler on migration not remembering statistics

2 participants