-
Notifications
You must be signed in to change notification settings - Fork 502
fix: Fix BasicCrawler
statistics persistance
#1490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
55e7316
to
ebc350d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we use the SDK_CRAWLER_STATISTICS_...
key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING
, __RQ_STATE_...
) for other cases. Could we update the key name, please?
crawler = HttpCrawler( | ||
configuration=configuration, | ||
storage_client=storage_client, | ||
) | ||
service_locator.set_configuration(configuration) | ||
service_locator.set_storage_client(storage_client) | ||
|
||
crawler = HttpCrawler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because RecoverableState
of statistics persists to/recovers from global storage_client
. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client
Mentioned here:
#1438 (comment)
I am open to discussion about this.
self._statistics = statistics or cast( | ||
'Statistics[TStatisticsState]', | ||
Statistics.with_default_state( | ||
persistence_enabled=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose changing the default values (persistence_enabled: bool = True
) is a no-go in patch releases, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is changing the default value of the internal attribute... since we consider previous behavior a bug, this is probably ok?
for context in contexts_to_enter: | ||
await exit_stack.enter_async_context(context) # type: ignore[arg-type] | ||
|
||
await self._autoscaled_pool.run() | ||
self._crawler_state_rec_task.start() | ||
try: | ||
await self._autoscaled_pool.run() | ||
finally: | ||
await self._crawler_state_rec_task.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe RecurringTask
could also be an async context manager (two options of usage - start
/stop
and aenter
/aexit
) and then it can be just another contexts_to_enter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self._service_locator.get_event_manager().emit( | ||
event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._service_locator.get_event_manager().emit( | |
event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False) | |
) | |
event_manager.emit( | |
event=Event.PERSIST_STATE, | |
event_data=EventPersistStateData(is_migrating=False), | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# Emit PERSIST_STATE event when crawler is finishing to allow listeners to persist their state if needed | ||
if not self.statistics.state.crawler_last_started_at: | ||
raise RuntimeError('Statistics.state.crawler_last_started_at not set.') | ||
run_duration = datetime.now(timezone.utc) - self.statistics.state.crawler_last_started_at | ||
self._statistics.state.crawler_runtime = self.statistics.state.crawler_runtime + run_duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to belong here - this part handles high-level crawler behavior, so computing the run duration directly here feels wrong - Maybe move it to a separate method or to Statistics
?
TODO: Figure out reason for stats difference in request_total_finished_duration
Description
BasicCrawler
is persisting statistics by default.BasicCrawler
is recovering existing statistics by default ifConfiguration.purge_on_start
is False.BasicCrawler
emitEvent.PERSIST_STATE
when finishing.Issues
Testing