fix: Fix `BasicCrawler` statistics persistance #1490

Pijukatel · 2025-10-15T12:23:23Z

Description

Ensure that BasicCrawler is persisting statistics by default.
Ensure that BasicCrawler is recovering existing statistics by default if Configuration.purge_on_start is False.
Let the BasicCrawler emit Event.PERSIST_STATE when finishing.

Issues

Closes: Fix Crawler on migration not remembering statistics #1501

Testing

Added unit test
Tested on SDK level: Update test to included statistics before reboot apify-sdk-python#629

…tent state init)

vdusek

I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?

vdusek · 2025-10-17T12:43:53Z

tests/unit/test_configuration.py

-    crawler = HttpCrawler(
-        configuration=configuration,
-        storage_client=storage_client,
-    )
+    service_locator.set_configuration(configuration)
+    service_locator.set_storage_client(storage_client)
+
+    crawler = HttpCrawler()


This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client

Mentioned here:
#1438 (comment)

I am open to discussion about this.

vdusek · 2025-10-17T12:58:42Z

src/crawlee/crawlers/_basic/_basic_crawler.py

        self._statistics = statistics or cast(
            'Statistics[TStatisticsState]',
            Statistics.with_default_state(
+                persistence_enabled=True,


I suppose changing the default values (persistence_enabled: bool = True) is a no-go in patch releases, right?

Well, it is changing the default value of the internal attribute... since we consider previous behavior a bug, this is probably ok?

vdusek · 2025-10-17T13:01:52Z

src/crawlee/crawlers/_basic/_basic_crawler.py

            for context in contexts_to_enter:
                await exit_stack.enter_async_context(context)  # type: ignore[arg-type]

-            await self._autoscaled_pool.run()
+            self._crawler_state_rec_task.start()
+            try:
+                await self._autoscaled_pool.run()
+            finally:
+                await self._crawler_state_rec_task.stop()


Maybe RecurringTask could also be an async context manager (two options of usage - start/stop and aenter/aexit) and then it can be just another contexts_to_enter.

vdusek · 2025-10-17T13:04:01Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+            self._service_locator.get_event_manager().emit(
+                event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False)
+            )


Suggested change

self._service_locator.get_event_manager().emit(

event=Event.PERSIST_STATE, event_data=EventPersistStateData(is_migrating=False)

)

event_manager.emit(

event=Event.PERSIST_STATE,

event_data=EventPersistStateData(is_migrating=False),

)

vdusek · 2025-10-17T13:06:31Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+            # Emit PERSIST_STATE event when crawler is finishing to allow listeners to persist their state if needed
+            if not self.statistics.state.crawler_last_started_at:
+                raise RuntimeError('Statistics.state.crawler_last_started_at not set.')
+            run_duration = datetime.now(timezone.utc) - self.statistics.state.crawler_last_started_at
+            self._statistics.state.crawler_runtime = self.statistics.state.crawler_runtime + run_duration


This doesn't seem to belong here - this part handles high-level crawler behavior, so computing the run duration directly here feels wrong - Maybe move it to a separate method or to Statistics?

TODO: Figure out reason for stats difference in request_total_finished_duration

Pijukatel added 3 commits October 2, 2025 15:38

WIp, fix failing tests

7dd0f66

Add comment for remembering where left

f3b9812

Start _crawler_state_rec_task when active contexts (to allow persis…

3a28d42

…tent state init)

github-actions bot assigned Pijukatel Oct 15, 2025

github-actions bot added this to the 125th sprint - Tooling team milestone Oct 15, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Oct 15, 2025

Test Windows related issue

ebc350d

Pijukatel force-pushed the crawler-persistance branch from 55e7316 to ebc350d Compare October 15, 2025 14:31

Pijukatel added 2 commits October 16, 2025 11:26

Fix test on windows

52daca0

Merge remote-tracking branch 'origin/master' into crawler-persistance

b2b4724

Pijukatel requested review from janbuchar and vdusek October 16, 2025 13:27

Pijukatel marked this pull request as ready for review October 16, 2025 13:27

vdusek requested changes Oct 17, 2025

View reviewed changes

Review comments

675dadf

TODO: Figure out reason for stats difference in request_total_finished_duration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix `BasicCrawler` statistics persistance #1490

fix: Fix `BasicCrawler` statistics persistance #1490

Uh oh!

Pijukatel commented Oct 15, 2025 •

edited

Loading

Uh oh!

vdusek left a comment

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Pijukatel Oct 17, 2025

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Pijukatel Oct 17, 2025

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Pijukatel Oct 17, 2025

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Pijukatel Oct 17, 2025

Uh oh!

vdusek Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Fix BasicCrawler statistics persistance #1490

Are you sure you want to change the base?

fix: Fix BasicCrawler statistics persistance #1490

Uh oh!

Conversation

Pijukatel commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Fix `BasicCrawler` statistics persistance #1490

fix: Fix `BasicCrawler` statistics persistance #1490

Pijukatel commented Oct 15, 2025 •

edited

Loading