Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-status-consumer: improve logging of "not alive" workflows #443

Merged
merged 1 commit into from
Apr 12, 2022

Conversation

VMois
Copy link

@VMois VMois commented Apr 4, 2022

addresses #437

How to test:

  1. Start helloworld workflow - reana-client run -w test
  2. As soon as workflow starts, wait 3-5 seconds (it should go to the pending state) and delete it reana-client delete -w test
  3. Check reana-client list, it will show you that test workflow is deleted
  4. Check kubectl get pods, you will find batch pod in NotReady state
  5. Check kubectl logs deployment/reana-workflow-controller job-status-consumer, you will see that workflow was not in an alive state in DB and logs indicate that.
2022-04-04 11:27:58,215 | root | MainThread | WARNING | Event for not alive workflow 7c081fbc-42b2-4889-841d-a48ccd68dda2 with DB status RunStatus.deleted received:
{"workflow_uuid": "7c081fbc-42b2-4889-841d-a48ccd68dda2", "logs": "", "status": 2, "message": {"progress": {"finished": {"total": 1, "job_ids": ["17aeb786-69ec-4ddd-911a-542c59d7bf14"]}}}}
Ignoring...

This doesn't solve a core issue but, at least, we can have better logs.

@VMois VMois force-pushed the improve-job-status-consumer branch from b1a73dc to c33bcc8 Compare April 4, 2022 09:08
@@ -240,46 +232,6 @@ def _update_job_progress(workflow_uuid, msg):
job.status = job_status


def _update_job_cache(msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, whilst it is true that caching has been disabled since a long time, we may want to resurrect and retest it one day perhaps... Hence let me muse a bit on this.

Does this code actually runs on the server-side? We have cache=off by default on the client side, so I think this code should not even be triggered on the server-side. If it is not, then we don't seem to have a problem here? If it is, then we should fix it so that it would not be? (or inactivate it via a server-side configuration settings without removal of the code).

That said, philosophically speaking, I like suckless approach arguing that "the more code lines you have removed, the more progress you have made" 😉 Removing old inactive code is a good thing, especially if we investigate something like Armada or Kueue soon... Then it might be a perfect time for cleaning job cache everywhere. Until then, perhaps "inactivation" may still be having some advantages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, there is a lack of documentation for cache internals so I have no idea how it works or where it is configured exactly. Need to dig into the code.

Doesn't look like we have a cache config in the helm chart for this. Also, user docs only have a few things mentioned like the "CACHE=off" parameter for r-client. There are some Python tests for cache.

Looks like only the serial engine supports cache. Other engines do not check if a job is cached from the job-controller.

I heard that cache was not working so, in my opinion, we should remove it. Otherwise, this is just a dead code. But, I agree, maybe, I should create a separate issue to remove the cache and we can address it later (espcecially if we plan work on improving consumers and internals)

@VMois VMois changed the title job-status-consumer: improve logging and handling of "not alive" workflows job-status-consumer: improve logging of "not alive" workflows Apr 4, 2022
@VMois VMois force-pushed the improve-job-status-consumer branch 3 times, most recently from 9282a46 to 3ca6537 Compare April 4, 2022 10:51
@VMois VMois marked this pull request as ready for review April 4, 2022 10:51
@VMois VMois force-pushed the improve-job-status-consumer branch from 3ca6537 to ae32eb8 Compare April 4, 2022 10:53
@tiborsimko tiborsimko self-assigned this Apr 12, 2022
Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging works nicely; the core issue to be addressed later.

@tiborsimko tiborsimko merged commit ae32eb8 into reanahub:master Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants