Recover faster after network outage #49

bajtos · 2023-12-14T17:00:00Z

#47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

Proposed fix:

detect whether we are online (see ActivityState.#healthy)
if we are offline then reduce the delay between iterations to something like 3-5 seconds

The text was updated successfully, but these errors were encountered:

bajtos · 2023-12-14T17:01:03Z

Possibly related:

SPARK does not resume checking after computer awakes from sleep #39

juliangruber · 2023-12-15T11:27:17Z

#47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

I want to make sure I understand the problem statement. Why is it a problem to wait 60 seconds after having been offline? Isn't it ok to be offline, then wait 60 seconds, then try again? And why does restarting Station fix this?

bajtos · 2024-01-08T14:04:33Z

Here is what I observed:

Sometimes, when I wake my computer from sleep and connect to the network, I see that the Station icon in the tray/menubar indicates offline status.
I know my computer is online because I can browse the web, but the Station still stays offline.
When I restart the Station, it comes almost immediately online.

This behaviour creates an impression that the Station cannot correctly detect the transition of the computer from offline to online. (Personally, I perceive such behaviour as the app developers' sloppiness, and I don't want to perceive myself as a sloppy person.)

why does restarting Station fix this?

IIUC, the Station decides whether we are offline or online based on the outcome of a SPARK iteration. The Station goes offline when SPARK cannot fetch round details or submit the measurement. When we are offline, and SPARK reports that it was able to fetch round details, we go back online.

This worked well when the delay between jobs was ~10 seconds. It no longer works with the current ~60-second delay because it can take up to 60 seconds before Station/SPARK can detect that we are back online.

When I restart the Station, SPARK starts the next job immediately and therefore the Station quickly transitions to the online status.

Here is the main SPARK loop:

https://github.com/filecoin-station/spark/blob/fc756cf9720a31af11148df77ce2d716569a84ff/lib/spark.js#L165-L187

I propose modifying the following line to calculate different delays based on whether we are in a healthy (online) state.

https://github.com/filecoin-station/spark/blob/fc756cf9720a31af11148df77ce2d716569a84ff/lib/spark.js#L181

bajtos added the good first issue 🤗 Good for newcomers label Dec 14, 2023

github-project-automation bot added this to Space Meridian Dec 14, 2023

bajtos moved this to 🗃 backlog in Space Meridian Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover faster after network outage #49

Recover faster after network outage #49

bajtos commented Dec 14, 2023

bajtos commented Dec 14, 2023

juliangruber commented Dec 15, 2023

bajtos commented Jan 8, 2024

Recover faster after network outage #49

Recover faster after network outage #49

Comments

bajtos commented Dec 14, 2023

bajtos commented Dec 14, 2023

juliangruber commented Dec 15, 2023

bajtos commented Jan 8, 2024