-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor build process #59
Conversation
No files have been edited for clarity. Only deleted or renamed. - packet_forwarder, lora_gateway, and sx1302_hal will be built independently so files related to those have been removed. - renamed files -> pktfwd - moved regional conf files to pktfwd/config
ebdbaf1
to
03ff228
Compare
build-essential \ | ||
python3 \ | ||
python3-pip \ | ||
python3-venv && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During a recent engineering call, we decided not to use venv moving forward. The overall feeling is that Docker is already portable so venv is an unnecessary level of encapsulation. One advantage of the venv is that the python version is also copied.
- Pull in built assets from other repos - Added unit tests
03ff228
to
1091867
Compare
Dockerfile
Outdated
# Copy sx1302 hal from builder | ||
COPY --from=builder /opt/iotloragateway/dev/sx1302_hal-2.1.0 . | ||
# Copy sx1301 lora_pkt_fwd_SPI_BUS | ||
COPY --from=nebraltd/packet_forwarder:83fd72d2db4c69faffd387d757452cbe9d594a08 "$SX1301_PACKET_FORWARDER_OUTPUT_DIR" "$SX1301_DIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update these all to the master versions in other repos before merging
pktfwd/utils.py
Outdated
LOGGER = get_logger(__name__) | ||
LOGLEVEL_INT = getattr(logging, LOGLEVEL) | ||
LORA_PKT_FWD_RETRY_SLEEP_SECONDS = 2 | ||
LORA_PKT_FWD_MAX_TRIES = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #60 - I don't think this is the best way for this to work.
But right now, I am having a bit of a mind blank as to what could be a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to set a max_tries. We can also use exponential backoff.
Looks like tenacity (I just found this, have been using unmaintained retry up till now) also lets you chain waits. So we could wait 2 seconds the first 5 times, then exponential backoff from 2 to 600 seconds until the end of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I can tell of the existing code we don't actually ever stop trying to start the concentrator, it is just that after 5 times of failing we write false into this file:
hm-pktfwd/files/configurePktFwd.py
Lines 140 to 160 in 76897fd
if "concentrator EUI:" in euiTest: | |
print("SX1302") | |
print("Frequency " + regionID) | |
writeRegionConfSx1302(regionID, spi_bus) | |
subprocess.run("/opt/iotloragateway/packet_forwarder/sx1302/packet_forwarder/lora_pkt_fwd") # nosec (B603) | |
print("Software crashed, restarting") | |
failTimes += 1 | |
else: | |
print("SX1301") | |
print("Frequency " + regionID) | |
writeRegionConfSx1301(regionID) | |
subprocess.run("/opt/iotloragateway/packet_forwarder/sx1301/lora_pkt_fwd_{}".format(spi_bus)) # nosec (B603) | |
print("Software crashed, restarting") | |
failTimes += 1 | |
if(failTimes == 5): | |
with open("/var/pktfwd/diagnostics", 'w') as diagOut: | |
diagOut.write("false") | |
# Sleep forever |
which is then consumed by hm-diag here:
the problem with this is that the packet forwarder may eventually work (for whatever reason) but after the first 5 times, if the packet forwarder does then eventually start working, the file will not get updated, and therefore diagnostics will show an error forever, even if the packet forwarder is now working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed here: #59 (comment)
sentry_sdk.set_user({"id": balena_id}) | ||
|
||
|
||
def write_diagnostics(diagnostics_filepath, is_running): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could perhaps use this for a healthcheck for the container and also pull it into hm-diag as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this file is trying to communicate. That the lora packet_forwarder is running successfully? That python hm-pktfwd app is running? The former seems useful but we wouldn't want to block diag from starting up just because the packet forwarder won't start, would we? Also, that's not what the file currently represents.
If we are trying to communicate the latter, that the python app is running, that doesn't seem like very useful information because the python app can run even if the concentrator fails to start. I'd also argue that typically the python app will always start as long as the container starts. In that case, we might as well use a normal depends_on without a healthcheck and remove the file logic all together.
Finally, what we decide for max_retries, will impact the meaning of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was sleepy and misunderstood what this file was for and that it wasn't a new addition.
as mentioned above it already is being used in hm-diag...
basically this is used in hm-diag to show if there is an error with the packet_forwarder / lora concentrator starting up.
essentially this is a duplicate of my comment above
we just need a way that the retrying of starting the Lora concentrator is not coupled to the reading of the status of the lora module starting up - if that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a proposal. Now set the initial state to False. Then poll the process and update the diagnostics file accordingly. Currently the polling interval is the same as the retry start interval. Maybe we should make the polling interval longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that looks good.
Yeah the retry and polling interval probably don't need to be the same.
And yeah usually if the concentrator starts up it's less likely that it then fails (although of course possible). It's more likely to be the starting up that fails. But both need to be handled.
The only thing I'm concerned about is slowing down the time to return True to initFile.txt which is used in the Hotspot-Production-Tool so maybe the retry and polling interval could be tailored accordingly
Generally looks really good apart from the questions / comments above 👍🥳 |
- Add modified reset_lgw.sh that can be used by all concentrators. - PktfwdApp only execpts one version of the script and correctly invokes it based on the concentrator version.
Set the intial state in diagnostics file to False. Only update to True when lora_pkt_fwd successfully running. Set back to False when it exits.
- For consistency with hm-config - Also fixed linting
56fb81c
to
1547f8b
Compare
- Ensure pktfwd always cleans up by stopping in `finally` block - Correctly handle reset_pin argument passing and env setting - reset_lgw does not attempt reset if GPIO has not been exported - Set UTIL_CHIP_ID_FILEPATH envvar
Created issue to track: #63
Use different waits when trying to restart lora_pkt_fwd, vs when trying polling its status after it has started running.
Great, going to do one more round of tests then merge. |
- LORA_PKT_FWD_BEFORE_CHECK_SLEEP_SECONDS: before checking the first time. - LORA_PKT_FWD_AFTER_SUCCESS_SLEEP_SECONDS: between polls if the concentrator is running. - LORA_PKT_FWD_AFTER_FAILURE_SLEEP_SECONDS: before restarting on failure without error. - Exit loop and restart container if concentrator exited with error.
900925b
to
27d6589
Compare
27d6589
to
f948c6f
Compare
woop 🍾 🥳 |
Good stuff, @marvinmarnold! |
Why
The hm-pktfwd build process is confusing. We have decided to split up the build process by repo instead of having a mono-Dockerfile in hm-pktfwd.
How
nebraltd/lora_gateway, nebraltd/packet_forwarder, and nebraltd/sx1302_hal are now built independently and the necessary assets are pulled into the hm-pktfwd build process.
References