Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: delete-rule returns HTTP 504 (timeout) for some rules #787

Open
haozturk opened this issue Apr 29, 2024 · 3 comments
Open

Bug: delete-rule returns HTTP 504 (timeout) for some rules #787

haozturk opened this issue Apr 29, 2024 · 3 comments
Labels

Comments

@haozturk
Copy link
Contributor

haozturk commented Apr 29, 2024

Bug Description

UERJ site admin reported this problem and I was able to reproduce it with transfer_ops account. See the details below. Not all rule deletions fail:

[2024-04-29 12:45:46]    10.100.245.113    188.185.31.22    137.138.44.133    Zi-WegtBQFmMN72oWXvvMQAAAe8    200    676    0    9740    "DELETE /rules/8603bb9f74194f07a70e92906f5f10b8 HTTP/1.1"    "tier0_prod-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0-unknown-87178f21cbef497da6d3ca0bae992763"    "wmcore-client/1.29.10"    wmcoreD::--start
[2024-04-29 12:45:58]    10.100.245.113    10.100.61.128    152.92.255.251    Zi-WhgtBQFmMN72oWXvvOwAAAYg    504    682    247    60060086    "DELETE /rules/39a6afe366144e0db800f7bb67ba2cd8 HTTP/1.1"    "rvaladao-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=rvaladao/CN=783687/CN=Renan Bernardo Valadao-unknown-5f116761642e43dd98ac4b4e46ec9a36"    "rucio-clients/1.28.7"    rucio::delete-rule
[2024-04-29 12:49:44]    10.100.245.113    188.185.31.22    137.138.52.11    Zi-XaFHYoA_wJt9RrzO60AAAADE    200    678    0    28449    "DELETE /rules/295287aad56343d0a8c4d4caba783449 HTTP/1.1"    "tier0_replay-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0-unknown-0adf0333b0104ac5a96e03881cf3dd97"wmcore-client/1.29.10"    wmcoreD::--start 

Reproduction Steps

[haozturk@lxplus936 dm]$ rucio delete-rule 39a6afe366144e0db800f7bb67ba2cd8
2024-04-29 15:08:49,887	WARNING	Waiting 0.5s due to reason: server returned 504 
2024-04-29 15:11:50,393	WARNING	Waiting 1.0s due to reason: server returned 504 
2024-04-29 15:11:51,394	ERROR	An unknown exception occurred.
$ rucio whoami
suspended_at : None
account    : transfer_ops
...

Expected Behavior

The rule deletion above should succeed.

Possible Solution

No response

Related Issues

No response

@haozturk haozturk added the bug label Apr 29, 2024
@haozturk
Copy link
Contributor Author

Actually the expiry date of the rule is set properly, but the rule hasn't been cleaned even if it's expired 3 days ago:

$ rucio rule-info 39a6afe366144e0db800f7bb67ba2cd8
Id:                         39a6afe366144e0db800f7bb67ba2cd8
Account:                    rvaladao
Scope:                      user.rvaladao
Name:                       /UERJ/Datasets/USER
RSE Expression:             T2_BR_UERJ
Copies:                     1
State:                      STUCK
Locks OK/REPLICATING/STUCK: 3768/568949/685
Grouping:                   DATASET
Expires at:                 2024-04-26 19:21:33
Locked:                     False
Weight:                     None
Created at:                 2024-04-24 16:54:46
Updated at:                 2024-04-29 14:11:58
Error:                      RequestErrMsg.TRANSFER_FAILED:TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Request cancellation was requested.
Subscription Id:            None
Source replica expression:  None
Activity:                   User Subscriptions
Comment:                    None
Ignore Quota:               False
Ignore Availability:        False
Purge replicas:             False
Notification:               NO
End of life:                None
Child Rule Id:              None

@haozturk
Copy link
Contributor Author

This is what I see in judge-cleaner:

{"process": {"pid": 9}, "@timestamp": "2024-04-29T00:00:14.233Z", "message": "[7/8]: Deleting rule 39a6afe366144e0db800f7bb67ba2cd8 with expression T2_BR_UERJ", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"level": "INFO", "logger": "root"}}
{"process": {"pid": 9}, "@timestamp": "2024-04-29T00:00:14.242Z", "message": "[7/8]: Locks detected for 39a6afe366144e0db800f7bb67ba2cd8", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"level": "WARNING", "logger": "root"}}

Checking further to understand what Locks detected means

@haozturk
Copy link
Contributor Author

haozturk commented May 3, 2024

This is ORA-00054 db error whose message is resource busy and acquire with nowait specified or timeout expired. I suspect this is related to the size of the rule (~0.5M files). It might be that the DB call is timing out or there is an active query locking the table. Will investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant