Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an exception handler for api client #72

Merged
merged 2 commits into from
Sep 23, 2024
Merged

Conversation

abhi18av
Copy link
Member

This PR

  1. Adds a baseline exception handler for the Nomad server connection
  2. Exposes some of the configuration values used to create the API client (for the server)

@abhi18av abhi18av requested a review from jagedn July 17, 2024 09:09
@abhi18av abhi18av linked an issue Jul 17, 2024 that may be closed by this pull request
@abhi18av abhi18av self-assigned this Jul 17, 2024
@jagedn
Copy link
Collaborator

jagedn commented Jul 17, 2024

not sure this implementation avoid the connection error when the server is restarted, did you tested ?

@jagedn
Copy link
Collaborator

jagedn commented Jul 17, 2024

as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic

https://square.github.io/okhttp/features/interceptors/

@abhi18av
Copy link
Member Author

as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic

https://square.github.io/okhttp/features/interceptors/

Agreed there are multiple options in the HTTP client which we can possibly expose.

Regarding testing the PR, will check on local cluster when I'm back at my desk, but I'm starting to think that we should not use that Process.. Exception.. , I think it intrinsically terminates the execution 🤔

@abhi18av
Copy link
Member Author

So I did the experiment with branch and this is what I experienced. Different failure compared to #71

Experimental setup:

  1. Run ./start-nomad.sh in validation
  2. Trigger ./run-all.sh --build
  3. When the execution is underway, kill the nomad process.
  4. Restart the nomad process, without clearing any cache.

With the current server-exception branch

executor >  nomad (4)
[c2/463003] sayHello (4) [100%] 4 of 4, failed: 4 ✘
WARN: [NOMAD] Cannot read exit status for task: `sayHello (2)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (3)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/18/e8010df35254f73ce6767a7857d917/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (4)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c2/463003f44e018b65d32d94a5aabf8f/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (1)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/56/04ff36f7635d14c721c7019df8d2c1/.exitcode
ERROR ~ Error executing process > 'sayHello (2)'

Caused by:
  Process `sayHello (2)` terminated for an unknown reason -- Likely it has been terminated by the external system


Command executed:

  echo 'Ciao world!'

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details


@jagedn jagedn marked this pull request as draft July 20, 2024 11:30
@abhi18av abhi18av mentioned this pull request Jul 20, 2024
16 tasks
@abhi18av abhi18av added help wanted Extra attention is needed good first issue Good for newcomers labels Jul 21, 2024
@abhi18av abhi18av linked an issue Sep 20, 2024 that may be closed by this pull request
@abhi18av abhi18av removed help wanted Extra attention is needed good first issue Good for newcomers labels Sep 20, 2024
@jagedn
Copy link
Collaborator

jagedn commented Sep 20, 2024

@abhi18av

I've refactored our work and now we are using FailSafe approach

still need to test a little more (trying to stop the cluster and so on) but it looks nice

@jagedn
Copy link
Collaborator

jagedn commented Sep 21, 2024

I want to implement a more robust test but stopping/restarting manually the nomad process during a bactopia pipeline (because it takes more time to complete than a simple hello) seems to work:

  • start the cluster
  • run the bactopia pipeline and when a job is started kill the nomad process
  • check the .nextflow.log to see how the plugin is retrying
  • start the nomad server
  • the pipeline ends successfully

cc @matthdsm

sept-21 12:27:06.156 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:06.161 [TaskFinalizer-2] DEBUG nextflow.processor.TaskProcessor - Process BACTOPIA:GATHER:CSVTK_CONCAT > Skipping output binding because one or more optional files are missing: fileoutparam<1>
sept-21 12:27:06.161 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:11.129 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] determineClientNode: jobName:nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; clientName:slimbook


sept-21 12:27:11.344 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 1; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:11.958 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 2; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:12.797 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 3; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:15.098 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 4; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:18.115 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 5; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:26.336 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 6; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:44.490 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 7; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646


sept-21 12:27:44.493 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.495 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.497 [Task monitor] DEBUG n.nomad.executor.NomadService - Task nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR , state=dead

@abhi18av abhi18av marked this pull request as ready for review September 23, 2024 09:28
Copy link
Member Author

@abhi18av abhi18av left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jorge! To me this looks good to merge ✅

I have just added a minor comment regarding the exit codes we're relying upon.

config.jobOpts().region,
config.jobOpts().namespace,
null, null, null)
safeExecutor.apply {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this clean design @jagedn 🤩

.build()
}

final private static List<Integer> RETRY_CODES = List.of(408, 429, 500, 502, 503, 504)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagedn, is there a specific dictionary/reference you used to focus only on these exit codes?

Maybe a good idea to add that in the comments or just explain the through process behind these exit codes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha c&p from the azure plugin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in that case these must be specific to the Azure Batch API 🤔

@tomiles @matthdsm @jhaezebr , are you aware of any Nomad specific error codes or does a Nomad client/server just propogate the task-and-OS level error code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think so, they are common http error codes

I've added descriptions for them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, I was under the impression that nomad would have it's own set of error codes as well but judging from this issue hashicorp/nomad#17782 I think that for client/task errors it stores them in exit code

Could be related to #77 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree but maybe better handle the ext code in the #77 and let this PR handle the infra/http errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - then let's merge and make an edge release :)

@abhi18av abhi18av removed the request for review from jagedn September 23, 2024 09:30
@abhi18av abhi18av assigned abhi18av and unassigned abhi18av Sep 23, 2024
Signed-off-by: Jorge Aguilera <[email protected]>
@abhi18av abhi18av merged commit c14220c into master Sep 23, 2024
2 checks passed
@abhi18av abhi18av deleted the server-exception branch September 23, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement time-out for unallocated jobs Nexflow crashes when querying jobstate (from a dead server)
2 participants