job-status-consumer: improve logging of "not alive" workflows #443

VMois · 2022-04-04T09:07:38Z

addresses #437

How to test:

Start helloworld workflow - reana-client run -w test
As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
Check reana-client list, it will show you that test workflow is deleted
Check kubectl get pods, you will find batch pod in NotReady state
Check kubectl logs deployment/reana-workflow-controller job-status-consumer, you will see that workflow was not in an alive state in DB and logs indicate that.

2022-04-04 11:27:58,215 | root | MainThread | WARNING | Event for not alive workflow 7c081fbc-42b2-4889-841d-a48ccd68dda2 with DB status RunStatus.deleted received:
{"workflow_uuid": "7c081fbc-42b2-4889-841d-a48ccd68dda2", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["17aeb786-69ec-4ddd-911a-542c59d7bf14"]}}}}
Ignoring...

This doesn't solve a core issue but, at least, we can have better logs.

tiborsimko · 2022-04-04T09:29:02Z

reana_workflow_controller/consumer.py

@@ -240,46 +232,6 @@ def _update_job_progress(workflow_uuid, msg):
                    job.status = job_status


-def _update_job_cache(msg):


Hmm, whilst it is true that caching has been disabled since a long time, we may want to resurrect and retest it one day perhaps... Hence let me muse a bit on this.

Does this code actually runs on the server-side? We have cache=off by default on the client side, so I think this code should not even be triggered on the server-side. If it is not, then we don't seem to have a problem here? If it is, then we should fix it so that it would not be? (or inactivate it via a server-side configuration settings without removal of the code).

That said, philosophically speaking, I like suckless approach arguing that "the more code lines you have removed, the more progress you have made" 😉 Removing old inactive code is a good thing, especially if we investigate something like Armada or Kueue soon... Then it might be a perfect time for cleaning job cache everywhere. Until then, perhaps "inactivation" may still be having some advantages.

Honestly, there is a lack of documentation for cache internals so I have no idea how it works or where it is configured exactly. Need to dig into the code.

Doesn't look like we have a cache config in the helm chart for this. Also, user docs only have a few things mentioned like the "CACHE=off" parameter for r-client. There are some Python tests for cache.

Looks like only the serial engine supports cache. Other engines do not check if a job is cached from the job-controller.

I heard that cache was not working so, in my opinion, we should remove it. Otherwise, this is just a dead code. But, I agree, maybe, I should create a separate issue to remove the cache and we can address it later (espcecially if we plan work on improving consumers and internals)

addresses reanahub#437

tiborsimko

Logging works nicely; the core issue to be addressed later.

VMois force-pushed the improve-job-status-consumer branch from b1a73dc to c33bcc8 Compare April 4, 2022 09:08

tiborsimko reviewed Apr 4, 2022

View reviewed changes

VMois changed the title ~~job-status-consumer: improve logging and handling of "not alive" workflows~~ job-status-consumer: improve logging of "not alive" workflows Apr 4, 2022

VMois force-pushed the improve-job-status-consumer branch 3 times, most recently from 9282a46 to 3ca6537 Compare April 4, 2022 10:51

VMois marked this pull request as ready for review April 4, 2022 10:51

VMois mentioned this pull request Apr 4, 2022

job-status-consumer: improve handling of "not alive" workflows #437

Open

job-status-consumer, log DB status of "not alive" workflows

ae32eb8

addresses reanahub#437

VMois force-pushed the improve-job-status-consumer branch from 3ca6537 to ae32eb8 Compare April 4, 2022 10:53

tiborsimko self-assigned this Apr 12, 2022

tiborsimko approved these changes Apr 12, 2022

View reviewed changes

tiborsimko merged commit ae32eb8 into reanahub:master Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-status-consumer: improve logging of "not alive" workflows #443

job-status-consumer: improve logging of "not alive" workflows #443

VMois commented Apr 4, 2022 •

edited

Loading

tiborsimko Apr 4, 2022

VMois Apr 4, 2022

tiborsimko left a comment

		@@ -240,46 +232,6 @@ def _update_job_progress(workflow_uuid, msg):
		job.status = job_status


		def _update_job_cache(msg):

job-status-consumer: improve logging of "not alive" workflows #443

job-status-consumer: improve logging of "not alive" workflows #443

Conversation

VMois commented Apr 4, 2022 • edited Loading

tiborsimko Apr 4, 2022

Choose a reason for hiding this comment

VMois Apr 4, 2022

Choose a reason for hiding this comment

tiborsimko left a comment

Choose a reason for hiding this comment

VMois commented Apr 4, 2022 •

edited

Loading