Refactor build process #59

marvinmarnold · 2021-10-20T15:45:50Z

Why
The hm-pktfwd build process is confusing. We have decided to split up the build process by repo instead of having a mono-Dockerfile in hm-pktfwd.

How
nebraltd/lora_gateway, nebraltd/packet_forwarder, and nebraltd/sx1302_hal are now built independently and the necessary assets are pulled into the hm-pktfwd build process.

References

Closes: Refactor and add RockPi support #45
Closes: Add unit tests #44
Closes: Improve failure checking #60
Depends on: Independent build lora_gateway#2
Depends on: Independent build packet_forwarder#2
Depends on: feat: copy process from hm-pktfwd sx1302_hal#1
Depends on: await_spi_available added, add retry to get_region, and repackage logger hm-pyhelper#31
Closed all-in-one proposal: All-in-one build total refactor #58
Incorrectly opened here originally: Independent build Lora-net/lora_gateway#172

No files have been edited for clarity. Only deleted or renamed. - packet_forwarder, lora_gateway, and sx1302_hal will be built independently so files related to those have been removed. - renamed files -> pktfwd - moved regional conf files to pktfwd/config

Dockerfile

pktfwd/utils.py

requirements.txt

marvinmarnold · 2021-10-20T20:48:45Z

Dockerfile

-        build-essential \
-        python3 \
-        python3-pip \
-        python3-venv && \


During a recent engineering call, we decided not to use venv moving forward. The overall feeling is that Docker is already portable so venv is an unnecessary level of encapsulation. One advantage of the venv is that the python version is also copied.

- Pull in built assets from other repos - Added unit tests

shawaj · 2021-10-20T23:57:00Z

Dockerfile

-# Copy sx1302 hal from builder
-COPY --from=builder /opt/iotloragateway/dev/sx1302_hal-2.1.0 .
+# Copy sx1301 lora_pkt_fwd_SPI_BUS
+COPY --from=nebraltd/packet_forwarder:83fd72d2db4c69faffd387d757452cbe9d594a08 "$SX1301_PACKET_FORWARDER_OUTPUT_DIR" "$SX1301_DIR"


Should update these all to the master versions in other repos before merging

shawaj · 2021-10-21T00:16:35Z

pktfwd/utils.py

+LOGGER = get_logger(__name__)
+LOGLEVEL_INT = getattr(logging, LOGLEVEL)
+LORA_PKT_FWD_RETRY_SLEEP_SECONDS = 2
+LORA_PKT_FWD_MAX_TRIES = 5


See #60 - I don't think this is the best way for this to work.

But right now, I am having a bit of a mind blank as to what could be a better solution.

We don't need to set a max_tries. We can also use exponential backoff.

Looks like tenacity (I just found this, have been using unmaintained retry up till now) also lets you chain waits. So we could wait 2 seconds the first 5 times, then exponential backoff from 2 to 600 seconds until the end of time.

from what I can tell of the existing code we don't actually ever stop trying to start the concentrator, it is just that after 5 times of failing we write false into this file:

hm-pktfwd/files/configurePktFwd.py

Lines 140 to 160 in 76897fd

if "concentrator EUI:" in euiTest:

print("SX1302")

print("Frequency " + regionID)

writeRegionConfSx1302(regionID, spi_bus)

subprocess.run("/opt/iotloragateway/packet_forwarder/sx1302/packet_forwarder/lora_pkt_fwd") # nosec (B603)

print("Software crashed, restarting")

failTimes += 1

else:

print("SX1301")

print("Frequency " + regionID)

writeRegionConfSx1301(regionID)

subprocess.run("/opt/iotloragateway/packet_forwarder/sx1301/lora_pkt_fwd_{}".format(spi_bus)) # nosec (B603)

print("Software crashed, restarting")

failTimes += 1

if(failTimes == 5):

with open("/var/pktfwd/diagnostics", 'w') as diagOut:

diagOut.write("false")

# Sleep forever

which is then consumed by hm-diag here:

https://github.com/NebraLtd/hm-diag/blob/9ba78481584cacf8bb5aede8faae136da80085e1/hw_diag/utilities/hardware.py#L76-L94

the problem with this is that the packet forwarder may eventually work (for whatever reason) but after the first 5 times, if the packet forwarder does then eventually start working, the file will not get updated, and therefore diagnostics will show an error forever, even if the packet forwarder is now working

Addressed here: #59 (comment)

shawaj · 2021-10-21T00:18:04Z

pktfwd/utils.py

+        sentry_sdk.set_user({"id": balena_id})
+
+
+def write_diagnostics(diagnostics_filepath, is_running):


We could perhaps use this for a healthcheck for the container and also pull it into hm-diag as well

I'm not sure what this file is trying to communicate. That the lora packet_forwarder is running successfully? That python hm-pktfwd app is running? The former seems useful but we wouldn't want to block diag from starting up just because the packet forwarder won't start, would we? Also, that's not what the file currently represents.

If we are trying to communicate the latter, that the python app is running, that doesn't seem like very useful information because the python app can run even if the concentrator fails to start. I'd also argue that typically the python app will always start as long as the container starts. In that case, we might as well use a normal depends_on without a healthcheck and remove the file logic all together.

Finally, what we decide for max_retries, will impact the meaning of this file.

i was sleepy and misunderstood what this file was for and that it wasn't a new addition.

as mentioned above it already is being used in hm-diag...

https://github.com/NebraLtd/hm-diag/blob/9ba78481584cacf8bb5aede8faae136da80085e1/hw_diag/utilities/hardware.py#L76-L94

basically this is used in hm-diag to show if there is an error with the packet_forwarder / lora concentrator starting up.

essentially this is a duplicate of my comment above

we just need a way that the retrying of starting the Lora concentrator is not coupled to the reading of the status of the lora module starting up - if that makes sense

Here's a proposal. Now set the initial state to False. Then poll the process and update the diagnostics file accordingly. Currently the polling interval is the same as the retry start interval. Maybe we should make the polling interval longer?

Yeah that looks good.

Yeah the retry and polling interval probably don't need to be the same.

And yeah usually if the concentrator starts up it's less likely that it then fails (although of course possible). It's more likely to be the starting up that fails. But both need to be handled.

The only thing I'm concerned about is slowing down the time to return True to initFile.txt which is used in the Hotspot-Production-Tool so maybe the retry and polling interval could be tailored accordingly

pktfwd/utils.py

shawaj · 2021-10-21T01:39:06Z

Generally looks really good apart from the questions / comments above 👍🥳

Dockerfile

- Add modified reset_lgw.sh that can be used by all concentrators. - PktfwdApp only execpts one version of the script and correctly invokes it based on the concentrator version.

Set the intial state in diagnostics file to False. Only update to True when lora_pkt_fwd successfully running. Set back to False when it exits.

- For consistency with hm-config - Also fixed linting

- Ensure pktfwd always cleans up by stopping in `finally` block - Correctly handle reset_pin argument passing and env setting - reset_lgw does not attempt reset if GPIO has not been exported - Set UTIL_CHIP_ID_FILEPATH envvar

pktfwd/utils.py

Created issue to track: #63

pktfwd/utils.py

Use different waits when trying to restart lora_pkt_fwd, vs when trying polling its status after it has started running.

…-refactor-2

marvinmarnold · 2021-10-25T20:40:53Z

Great, going to do one more round of tests then merge.

marvinmarnold · 2021-10-25T21:45:16Z

@shawaj I think this also closes #60

shawaj · 2021-10-25T21:45:31Z

@shawaj I think this also closes #60

Yep 👍

- LORA_PKT_FWD_BEFORE_CHECK_SLEEP_SECONDS: before checking the first time. - LORA_PKT_FWD_AFTER_SUCCESS_SLEEP_SECONDS: between polls if the concentrator is running. - LORA_PKT_FWD_AFTER_FAILURE_SLEEP_SECONDS: before restarting on failure without error. - Exit loop and restart container if concentrator exited with error.

shawaj · 2021-10-26T16:58:52Z

woop 🍾 🥳

vpetersson · 2021-10-27T06:39:26Z

Good stuff, @marvinmarnold!

marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch 4 times, most recently from ebdbaf1 to 03ff228 Compare October 20, 2021 17:51

This was referenced Oct 20, 2021

feat: copy process from hm-pktfwd NebraLtd/sx1302_hal#1

Merged

Independent build NebraLtd/lora_gateway#2

Merged

Independent build NebraLtd/packet_forwarder#2

Merged

marvinmarnold commented Oct 20, 2021

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

marvinmarnold commented Oct 20, 2021

View reviewed changes

pktfwd/utils.py Outdated Show resolved Hide resolved

marvinmarnold commented Oct 20, 2021

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

marvinmarnold commented Oct 20, 2021

View reviewed changes

refactor: redo entire app build

1091867

- Pull in built assets from other repos - Added unit tests

marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 03ff228 to 1091867 Compare October 20, 2021 20:52

shawaj requested changes Oct 21, 2021

View reviewed changes

shawaj reviewed Oct 21, 2021

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

shawaj reviewed Oct 22, 2021

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

shawaj mentioned this pull request Oct 22, 2021

Unify base images to reduce disk image size NebraLtd/helium-miner-software#182

Closed

marvinmarnold added 5 commits October 22, 2021 12:44

refactor: reset_lgw shared by all concentrators

d297992

- Add modified reset_lgw.sh that can be used by all concentrators. - PktfwdApp only execpts one version of the script and correctly invokes it based on the concentrator version.

feat: monitor lora_pkt_fwd state

c88a9f3

Set the intial state in diagnostics file to False. Only update to True when lora_pkt_fwd successfully running. Set back to False when it exits.

refactor(build): use images with Python

d0d11aa

- For consistency with hm-config - Also fixed linting

fix: injected conf for sx1302

4ff56f9

fix: shellcheck and linting

1547f8b

marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 56fb81c to 1547f8b Compare October 22, 2021 18:49

marvinmarnold added 3 commits October 22, 2021 15:57

fix: miscellaneous errors

da5d989

- Ensure pktfwd always cleans up by stopping in `finally` block - Correctly handle reset_pin argument passing and env setting - reset_lgw does not attempt reset if GPIO has not been exported - Set UTIL_CHIP_ID_FILEPATH envvar

build: bump packet_forwarder dep to master branch

b9d7038

build: bump sx1302_hal dep to master branch

eeecb51

marvinmarnold marked this pull request as ready for review October 22, 2021 23:52

marvinmarnold requested a review from a team as a code owner October 22, 2021 23:52

marvinmarnold commented Oct 23, 2021

View reviewed changes

pktfwd/utils.py Show resolved Hide resolved

marvinmarnold mentioned this pull request Oct 23, 2021

Calibrate sleep time #63

Open

docs: removed TODO about sleep

5563ec6

Created issue to track: #63

marvinmarnold commented Oct 23, 2021

View reviewed changes

pktfwd/utils.py Outdated Show resolved Hide resolved

marvinmarnold added 3 commits October 24, 2021 13:56

refactor: more granular start logic

8b540fb

Use different waits when trying to restart lora_pkt_fwd, vs when trying polling its status after it has started running.

Merge remote-tracking branch 'origin/master' into marvinmarnold/total…

e7eb5ce

…-refactor-2

fix: linting

8a4362d

shawaj approved these changes Oct 25, 2021

View reviewed changes

marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 900925b to 27d6589 Compare October 25, 2021 23:30

docs: add envvars to README

f948c6f

marvinmarnold force-pushed the marvinmarnold/total-refactor-2 branch from 27d6589 to f948c6f Compare October 26, 2021 16:48

marvinmarnold merged commit 1af2d8b into master Oct 26, 2021

marvinmarnold deleted the marvinmarnold/total-refactor-2 branch October 26, 2021 16:55

marvinmarnold mentioned this pull request Nov 12, 2021

release(prod): 2021.11.11.0-1, ECC concurrency, initFile robustness, and pktfwd rewrite NebraLtd/helium-miner-software#215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor build process #59

Refactor build process #59

marvinmarnold commented Oct 20, 2021 •

edited by shawaj

Loading

marvinmarnold Oct 20, 2021

shawaj Oct 20, 2021

shawaj Oct 21, 2021

marvinmarnold Oct 21, 2021

shawaj Oct 21, 2021 •

edited

Loading

marvinmarnold Oct 22, 2021

shawaj Oct 21, 2021

marvinmarnold Oct 21, 2021

shawaj Oct 21, 2021

marvinmarnold Oct 22, 2021

shawaj Oct 23, 2021

shawaj commented Oct 21, 2021

marvinmarnold commented Oct 25, 2021

marvinmarnold commented Oct 25, 2021

shawaj commented Oct 25, 2021

shawaj commented Oct 26, 2021

vpetersson commented Oct 27, 2021

	if "concentrator EUI:" in euiTest:
	print("SX1302")
	print("Frequency " + regionID)
	writeRegionConfSx1302(regionID, spi_bus)
	subprocess.run("/opt/iotloragateway/packet_forwarder/sx1302/packet_forwarder/lora_pkt_fwd") # nosec (B603)
	print("Software crashed, restarting")
	failTimes += 1

	else:
	print("SX1301")
	print("Frequency " + regionID)
	writeRegionConfSx1301(regionID)
	subprocess.run("/opt/iotloragateway/packet_forwarder/sx1301/lora_pkt_fwd_{}".format(spi_bus)) # nosec (B603)
	print("Software crashed, restarting")
	failTimes += 1

	if(failTimes == 5):
	with open("/var/pktfwd/diagnostics", 'w') as diagOut:
	diagOut.write("false")

	# Sleep forever

		sentry_sdk.set_user({"id": balena_id})


		def write_diagnostics(diagnostics_filepath, is_running):

Refactor build process #59

Refactor build process #59

Conversation

marvinmarnold commented Oct 20, 2021 • edited by shawaj Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawaj Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawaj commented Oct 21, 2021

marvinmarnold commented Oct 25, 2021

marvinmarnold commented Oct 25, 2021

shawaj commented Oct 25, 2021

shawaj commented Oct 26, 2021

vpetersson commented Oct 27, 2021

marvinmarnold commented Oct 20, 2021 •

edited by shawaj

Loading

shawaj Oct 21, 2021 •

edited

Loading