Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor build process #59

Merged
merged 16 commits into from
Oct 26, 2021
Merged

Conversation

marvinmarnold
Copy link
Contributor

@marvinmarnold marvinmarnold commented Oct 20, 2021

Why
The hm-pktfwd build process is confusing. We have decided to split up the build process by repo instead of having a mono-Dockerfile in hm-pktfwd.

How
nebraltd/lora_gateway, nebraltd/packet_forwarder, and nebraltd/sx1302_hal are now built independently and the necessary assets are pulled into the hm-pktfwd build process.

References

No files have been edited for clarity. Only deleted or renamed.

- packet_forwarder, lora_gateway, and sx1302_hal will be built
independently so files related to those have been removed.
- renamed files -> pktfwd
- moved regional conf files to pktfwd/config
Dockerfile Outdated Show resolved Hide resolved
pktfwd/utils.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
build-essential \
python3 \
python3-pip \
python3-venv && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During a recent engineering call, we decided not to use venv moving forward. The overall feeling is that Docker is already portable so venv is an unnecessary level of encapsulation. One advantage of the venv is that the python version is also copied.

- Pull in built assets from other repos
- Added unit tests
@marvinmarnold marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 03ff228 to 1091867 Compare October 20, 2021 20:52
Dockerfile Outdated
# Copy sx1302 hal from builder
COPY --from=builder /opt/iotloragateway/dev/sx1302_hal-2.1.0 .
# Copy sx1301 lora_pkt_fwd_SPI_BUS
COPY --from=nebraltd/packet_forwarder:83fd72d2db4c69faffd387d757452cbe9d594a08 "$SX1301_PACKET_FORWARDER_OUTPUT_DIR" "$SX1301_DIR"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update these all to the master versions in other repos before merging

pktfwd/utils.py Outdated
LOGGER = get_logger(__name__)
LOGLEVEL_INT = getattr(logging, LOGLEVEL)
LORA_PKT_FWD_RETRY_SLEEP_SECONDS = 2
LORA_PKT_FWD_MAX_TRIES = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #60 - I don't think this is the best way for this to work.

But right now, I am having a bit of a mind blank as to what could be a better solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to set a max_tries. We can also use exponential backoff.

Looks like tenacity (I just found this, have been using unmaintained retry up till now) also lets you chain waits. So we could wait 2 seconds the first 5 times, then exponential backoff from 2 to 600 seconds until the end of time.

Copy link
Member

@shawaj shawaj Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can tell of the existing code we don't actually ever stop trying to start the concentrator, it is just that after 5 times of failing we write false into this file:

if "concentrator EUI:" in euiTest:
print("SX1302")
print("Frequency " + regionID)
writeRegionConfSx1302(regionID, spi_bus)
subprocess.run("/opt/iotloragateway/packet_forwarder/sx1302/packet_forwarder/lora_pkt_fwd") # nosec (B603)
print("Software crashed, restarting")
failTimes += 1
else:
print("SX1301")
print("Frequency " + regionID)
writeRegionConfSx1301(regionID)
subprocess.run("/opt/iotloragateway/packet_forwarder/sx1301/lora_pkt_fwd_{}".format(spi_bus)) # nosec (B603)
print("Software crashed, restarting")
failTimes += 1
if(failTimes == 5):
with open("/var/pktfwd/diagnostics", 'w') as diagOut:
diagOut.write("false")
# Sleep forever

which is then consumed by hm-diag here:

https://github.com/NebraLtd/hm-diag/blob/9ba78481584cacf8bb5aede8faae136da80085e1/hw_diag/utilities/hardware.py#L76-L94

the problem with this is that the packet forwarder may eventually work (for whatever reason) but after the first 5 times, if the packet forwarder does then eventually start working, the file will not get updated, and therefore diagnostics will show an error forever, even if the packet forwarder is now working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: #59 (comment)

sentry_sdk.set_user({"id": balena_id})


def write_diagnostics(diagnostics_filepath, is_running):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could perhaps use this for a healthcheck for the container and also pull it into hm-diag as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this file is trying to communicate. That the lora packet_forwarder is running successfully? That python hm-pktfwd app is running? The former seems useful but we wouldn't want to block diag from starting up just because the packet forwarder won't start, would we? Also, that's not what the file currently represents.

If we are trying to communicate the latter, that the python app is running, that doesn't seem like very useful information because the python app can run even if the concentrator fails to start. I'd also argue that typically the python app will always start as long as the container starts. In that case, we might as well use a normal depends_on without a healthcheck and remove the file logic all together.

Finally, what we decide for max_retries, will impact the meaning of this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was sleepy and misunderstood what this file was for and that it wasn't a new addition.

as mentioned above it already is being used in hm-diag...

https://github.com/NebraLtd/hm-diag/blob/9ba78481584cacf8bb5aede8faae136da80085e1/hw_diag/utilities/hardware.py#L76-L94

basically this is used in hm-diag to show if there is an error with the packet_forwarder / lora concentrator starting up.

essentially this is a duplicate of my comment above

we just need a way that the retrying of starting the Lora concentrator is not coupled to the reading of the status of the lora module starting up - if that makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a proposal. Now set the initial state to False. Then poll the process and update the diagnostics file accordingly. Currently the polling interval is the same as the retry start interval. Maybe we should make the polling interval longer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that looks good.

Yeah the retry and polling interval probably don't need to be the same.

And yeah usually if the concentrator starts up it's less likely that it then fails (although of course possible). It's more likely to be the starting up that fails. But both need to be handled.

The only thing I'm concerned about is slowing down the time to return True to initFile.txt which is used in the Hotspot-Production-Tool so maybe the retry and polling interval could be tailored accordingly

pktfwd/utils.py Outdated Show resolved Hide resolved
pktfwd/utils.py Outdated Show resolved Hide resolved
pktfwd/utils.py Outdated Show resolved Hide resolved
@shawaj
Copy link
Member

shawaj commented Oct 21, 2021

Generally looks really good apart from the questions / comments above 👍🥳

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
- Add modified reset_lgw.sh that can be used by all concentrators.
- PktfwdApp only execpts one version of the script and correctly
invokes it based on the concentrator version.
Set the intial state in diagnostics file to False.
Only update to True when lora_pkt_fwd successfully running.
Set back to False when it exits.
- For consistency with hm-config
- Also fixed linting
@marvinmarnold marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 56fb81c to 1547f8b Compare October 22, 2021 18:49
- Ensure pktfwd always cleans up by stopping in `finally` block
- Correctly handle reset_pin argument passing and env setting
- reset_lgw does not attempt reset if GPIO has not been exported
- Set UTIL_CHIP_ID_FILEPATH envvar
@marvinmarnold marvinmarnold marked this pull request as ready for review October 22, 2021 23:52
@marvinmarnold marvinmarnold requested a review from a team as a code owner October 22, 2021 23:52
Created issue to track: #63
pktfwd/utils.py Outdated Show resolved Hide resolved
Use different waits when trying to restart lora_pkt_fwd,
vs when trying polling its status after it has started running.
@marvinmarnold
Copy link
Contributor Author

Great, going to do one more round of tests then merge.

@marvinmarnold
Copy link
Contributor Author

@shawaj I think this also closes #60

@shawaj
Copy link
Member

shawaj commented Oct 25, 2021

@shawaj I think this also closes #60

Yep 👍

- LORA_PKT_FWD_BEFORE_CHECK_SLEEP_SECONDS:  before checking the first
time.
- LORA_PKT_FWD_AFTER_SUCCESS_SLEEP_SECONDS: between polls if the
concentrator is running.
- LORA_PKT_FWD_AFTER_FAILURE_SLEEP_SECONDS: before restarting on
failure without error.
- Exit loop and restart container if concentrator exited with error.
@marvinmarnold marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 900925b to 27d6589 Compare October 25, 2021 23:30
@marvinmarnold marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 27d6589 to f948c6f Compare October 26, 2021 16:48
@marvinmarnold marvinmarnold merged commit 1af2d8b into master Oct 26, 2021
@marvinmarnold marvinmarnold deleted the marvinmarnold/total-refactor-2 branch October 26, 2021 16:55
@shawaj
Copy link
Member

shawaj commented Oct 26, 2021

woop 🍾 🥳

@vpetersson
Copy link
Contributor

Good stuff, @marvinmarnold!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve failure checking Refactor and add RockPi support Add unit tests
3 participants