Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pages/Run/index.tsx: Add kill run button #51

Merged
merged 15 commits into from
Sep 30, 2024
Merged

Conversation

Devansh3712
Copy link
Member

@Devansh3712 Devansh3712 commented Aug 13, 2023

This PR solves issue #2 which adds a kill run button that will kill all jobs in the current run. It currently uses a mock API endpoint for success alert, needs to be connected to the teuthology-api

Screen.Recording.2024-02-12.at.3.09.07.PM.mov

@netlify
Copy link

netlify bot commented Aug 13, 2023

Deploy Preview for pulpito ready!

Name Link
🔨 Latest commit 53d2449
🔍 Latest deploy log https://app.netlify.com/sites/pulpito/deploys/66f46d7cd4afbd000805a0bd
😎 Deploy Preview https://deploy-preview-51--pulpito.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Devansh3712 Devansh3712 self-assigned this Aug 13, 2023
@Devansh3712 Devansh3712 added the feature New feature or request label Aug 13, 2023
@VallariAg
Copy link
Member

VallariAg commented Dec 1, 2023

Changes made:

  • Rebased the branch to main
  • Added a commit which
    • moves the code from Run Page to two new components Alert and KillButton
    • uses react-query's useMutation to send a POST request at t-api's /kill endpoint

TODO:

  • show kill button only when signed-in-user == scheduler-user
  • add kill button on Job Page (to kill a job)
  • fix teuthologya-api's kill function (errors occurring there)
  • add a pop-up dialog box which says something like "are you sure you want to kill this run/job?"

@VallariAg
Copy link
Member

To test this PR, follow instructions in the description of this PR: ceph/teuthology-api#47

Screen.Recording.2024-02-12.at.3.52.16.PM.mov

Copy link
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better than ever!

Something's not working properly if the api returns an error, though; if /kill/ returns a 500, the UI transitions from "Killing run..." right back to "Are you sure...", without displaying anything related to the failure.

src/components/KillButton/index.tsx Outdated Show resolved Hide resolved
@VallariAg VallariAg requested a review from zmc February 14, 2024 04:39
Copy link
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in retesting this!
I think we need to allow org admins to kill, and I'm getting errors trying to kill runs I own. I can't find details aside from the "Failed!" message that pops up in the UI.
If I dismiss the error dialog and click the button a second time, it skips the confirmation and displays the "Failed!" message again immediately, without making another API request.

src/components/KillButton/index.tsx Outdated Show resolved Hide resolved
@VallariAg VallariAg self-assigned this Apr 15, 2024
@VallariAg
Copy link
Member

VallariAg commented Apr 24, 2024

I have added the logic for letting admins kill all runs on ceph/teuthology-api#55
If the run belongs to the logged-in user, then pulpito-ng will show "Kill" button. If not, then it'll show "Kill As Admin" button.
For now, pulpito-ng does not verify if user has admin privileges. That check happens on t-api after the request is sent by clicking "Kill As Admin".

I have improved error handling on t-api so it should not just show "Failed!" now and give more details, please let me know if it still doesn't work!
And the retry on button click problem should also be fixed now.

Edit: To test this PR, use this docker-compose: ceph/teuthology#1957


Or to manually test this PR:

Setup teuthology-api inside the teuthology container.

  1. Add instructions in teuthology dockerfile to setup t-api inside the teuthology container:
SHELL ["/bin/bash", "-c"] 
WORKDIR /
RUN git clone -b admin-kill https://github.com/ceph/teuthology-api && \
    cd /teuthology-api/ && \
    python3 -m venv venv && source ./venv/bin/activate && pip install -e . && \
    export TEUTHOLOGY_API_SERVER_PORT=8083 && \
    exit
WORKDIR /teuthology
....
  1. In docker-compose file, expose TEUTHOLOGY_API_SERVER_PORT for teuthology service:
teuthology:
   ports:
      - 8083:8083
  1. Start the containers
  2. exec into the teuthology container and run these commands:
cd /teuthology-api
source ./venv/bin/activate
<add your teuthology-api's .env file> (add `TEUTHOLOGY_PATH=/teuthology` if not present)
uvicorn teuthology_api.main:app --reload --port 8083 --host 0.0.0.0
  1. Schedule a run on /suite route, using the README payload.
  2. Navigate to the run's page on pulpito-ng and click "Kill run"/"Kill As Admin"

@kamoltat
Copy link
Member

kamoltat commented May 28, 2024

Hi Vallari,
after testing out your PRs, there are some extra steps that I needed to take in order to schedule the teuthology jobs with http://localhost:8083/docs

2024-05-28 14:39:26,375.375 INFO:root:teuthology version: 1.1.1.dev713+gb9e3da87
2024-05-28 14:39:26,376.376 INFO:teuthology.suite:Using random seed=2820
2024-05-28 14:39:26,376.376 INFO:teuthology.suite.run:kernel sha1: distro
2024-05-28 14:39:26,644.644 INFO:teuthology.suite.run:ceph sha1 explicitly supplied
2024-05-28 14:39:26,644.644 INFO:teuthology.suite.run:ceph sha1: 0cd602b1d11a1cc56441fbb1faec9e7a9b1cd7d5
2024-05-28 14:39:26,645.645 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=0cd602b1d11a1cc56441fbb1faec9e7a9b1cd7d5
2024-05-28 14:39:27,885.885 DEBUG:teuthology.packaging:looking for centos/8 x86_64 default
2024-05-28 14:39:27,885.885 DEBUG:teuthology.packaging:build: centos/8 arm64 default
2024-05-28 14:39:27,885.885 DEBUG:teuthology.packaging:build: ubuntu/22.04 x86_64 default
2024-05-28 14:39:27,885.885 DEBUG:teuthology.packaging:build: centos/8 x86_64 default
2024-05-28 14:39:27,887.887 INFO:teuthology.suite.run:ceph version: 19.0.0-3953.g0cd602b1
2024-05-28 14:39:28,169.169 DEBUG:teuthology.repo_utils:git ls-remote https://github.com/ceph/ceph-ci.git main -> 1fb5d10608dccee73edd6ef8a2c7b804a097c324
2024-05-28 14:39:28,169.169 INFO:teuthology.suite.run:ceph-ci branch: main 1fb5d10608dccee73edd6ef8a2c7b804a097c324
2024-05-28 14:39:28,170.170 DEBUG:teuthology.repo_utils:Resetting repo at /root/src/github.com_ceph_ceph-c_main to origin/main
Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/teuthology-api/src/teuthology_api/services/helpers.py", line 51, in _execute_with_logs
    func(args)
  File "/teuthology-api/venv/lib/python3.12/site-packages/teuthology/suite/__init__.py", line 143, in main
    run = Run(conf)
          ^^^^^^^^^
  File "/teuthology-api/venv/lib/python3.12/site-packages/teuthology/suite/run.py", line 56, in __init__
    self.base_config = self.create_initial_config()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/teuthology-api/venv/lib/python3.12/site-packages/teuthology/suite/run.py", line 102, in create_initial_config
    teuthology_branch, teuthology_sha1 = self.choose_teuthology_branch()
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/teuthology-api/venv/lib/python3.12/site-packages/teuthology/suite/run.py", line 270, in choose_teuthology_branch
    raise BranchMismatchError(
teuthology.exceptions.BranchMismatchError: Cannot use branch main with repo /teuthology because config.teuthology_path is set
2024-05-28 14:39:28,929.929 ERROR:teuthology_api.services.helpers:404 Client Error: Not Found for url: http://paddles:8080/runs/kamoltat-2024-05-28_14:39:24-teuthology:no-ceph-main-distro-default-testnode/
2024-05-28 14:39:28,930.930 ERROR:teuthology_api.services.suite:teuthology.suite.main failed with the error: HTTPException(status_code=404, detail="HTTPError('404 Client Error: Not Found for url: http://paddles:8080/runs/kamoltat-2024-05-28_14:39:24-teuthology:no-ceph-main-distro-default-testnode/')")
INFO:     192.168.65.1:26583 - "POST /suite/?logs=false HTTP/1.1" 500 Internal Server Error

I had to remove teuthology_path in ~/.teuthology.yaml inorder to get the run to successfully scheduled.

Also, after executing the kill button, I get this error

Screenshot 2024-05-28 at 10 52 18 AM

However, even with the error above, we were still able to successfully kill the run we intended to kill.

Not sure if I was doing something wrong, but let me know if you know anything about these errors.

@kamoltat
Copy link
Member

Here is a prettier version of the logs in the kill confirmation box above.

INFO:     192.168.65.1:27242 - "POST /kill?logs=true HTTP/1.1" 307 Temporary Redirect
2024-05-28 14:52:00,524.524 INFO:teuthology_api.services.kill:['/teuthology/virtualenv/bin/teuthology-kill', '--owner', 'kamoltat', '--run', 'kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode', '--machine-type', 'testnode']
2024-05-28 14:52:02,804.804 INFO:teuthology_api.services.kill:b'2024-05-28 14:52:01,525.525 INFO:teuthology.kill:Checking Beanstalk Queue...\n2024-05-28 14:52:01,551.551 INFO:teuthology.kill:Deleting job from queue. ID: 10 Name: kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode Desc: None\n2024-05-28 14:52:01,561.561 INFO:teuthology.kill:Deleting job from queue. ID: 11 Name: kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode Desc: teuthology:no-ceph/{clusters/single tasks/teuthology}\n2024-05-28 14:52:01,578.578 INFO:teuthology.kill:Deleting jobs from paddles: [\'11\']\n2024-05-28 14:52:01,614.614 INFO:teuthology.kill:No teuthology processes running\n/teuthology/teuthology/lock/cli.py:136: SyntaxWarning: invalid escape sequence \'\\w\'\n  mo = re.match(\'\\w+@(\\w+?)\\..*\', s[\'name\'])\n2024-05-28 14:52:02,692.692 INFO:teuthology.kill:No locked machines. Not nuking anything\nTraceback (most recent call last):\n  File "/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>\n    sys.exit(main())\n             ^^^^^^\n  File "/teuthology/scripts/kill.py", line 44, in main\n    teuthology.kill.main(args)\n  File "/teuthology/teuthology/kill.py", line 40, in main\n    kill_run(run_name, archive_base, owner, machine_type,\n  File "/teuthology/teuthology/kill.py", line 78, in kill_run\n    report.try_mark_run_dead(run_name)\n  File "/teuthology/teuthology/report.py", line 579, in try_mark_run_dead\n    jobs = reporter.get_jobs(run_name, fields=[\'status\'])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/teuthology/teuthology/report.py", line 374, in get_jobs\n    response.raise_for_status()\n  File "/teuthology/virtualenv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://paddles:8080/runs/kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode/jobs/?fields=status,job_id\n'
2024-05-28 14:52:02,804.804 ERROR:teuthology_api.services.kill:teuthology-kill command failed with the error: Exception(b'2024-05-28 14:52:01,525.525 INFO:teuthology.kill:Checking Beanstalk Queue...\n2024-05-28 14:52:01,551.551 INFO:teuthology.kill:Deleting job from queue. ID: 10 Name: kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode Desc: None\n2024-05-28 14:52:01,561.561 INFO:teuthology.kill:Deleting job from queue. ID: 11 Name: kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode Desc: teuthology:no-ceph/{clusters/single tasks/teuthology}\n2024-05-28 14:52:01,578.578 INFO:teuthology.kill:Deleting jobs from paddles: [\'11\']\n2024-05-28 14:52:01,614.614 INFO:teuthology.kill:No teuthology processes running\n/teuthology/teuthology/lock/cli.py:136: SyntaxWarning: invalid escape sequence \'\\w\'\n  mo = re.match(\'\\w+@(\\w+?)\\..*\', s[\'name\'])\n2024-05-28 14:52:02,692.692 INFO:teuthology.kill:No locked machines. Not nuking anything\nTraceback (most recent call last):\n  File "/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>\n    sys.exit(main())\n             ^^^^^^\n  File "/teuthology/scripts/kill.py", line 44, in main\n    teuthology.kill.main(args)\n  File "/teuthology/teuthology/kill.py", line 40, in main\n    kill_run(run_name, archive_base, owner, machine_type,\n  File "/teuthology/teuthology/kill.py", line 78, in kill_run\n    report.try_mark_run_dead(run_name)\n  File "/teuthology/teuthology/report.py", line 579, in try_mark_run_dead\n    jobs = reporter.get_jobs(run_name, fields=[\'status\'])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/teuthology/teuthology/report.py", line 374, in get_jobs\n    response.raise_for_status()\n  File "/teuthology/virtualenv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://paddles:8080/runs/kamoltat-2024-05-28_14:51:14-teuthology:no-ceph-main-distro-default-testnode/jobs/?fields=status,job_id\n')
INFO:     192.168.65.1:27242 - "POST /kill/?logs=true HTTP/1.1" 500 Internal Server Error
INFO:     192.168.65.1:27397 - "GET / HTTP/1.1" 200 OK

Copy link
Member

@kamoltat kamoltat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some questions regarding exceptions I encountered while testing locally myself.

@VallariAg
Copy link
Member

Thank you for reviewing this, I'll take look at both the above problems! Looks like error handling can be improved more too.

@VallariAg
Copy link
Member

VallariAg commented Jun 11, 2024

@kamoltat I have improved error handling so the logs are more readable.

About the traceback, I think maybe your local teuthology branch might not be up-to-date with remote branch because Zack removed the nuke command from main which I see logged in that traceback. Maybe the latest teuthology commit might not have the problem.
Also if you see this tracker's traceback (https://tracker.ceph.com/issues/66440) when killing runs, then please ignore because it's a separate problem in tuethology in production which I observed when killing queued runs.

And for editing ".teuthology.yml" file: there's a note in t-api README that we can use "--teuthology-branch": "", // necessary for docker setup in suite request. Ref:
https://github.com/ceph/teuthology-api?tab=readme-ov-file#route-suite

I am also trying to think of ways to improve the setup of running t-api inside the teuthology container so it's not a lot of manual setup.
Update: I have updated the docker setup so we wouldn't need to do above manual setup, please use this docker-compose file: ceph/teuthology#1956

@kamoltat
Copy link
Member

kamoltat commented Jun 24, 2024

@VallariAg
I'm pretty sure this is the issue you were talking about.
I was trying to kill runs with all queued jobs:
image

However, we running jobs should be expected to be killed successfully correct?

@VallariAg
Copy link
Member

@kamoltat both (running and waiting) seems to have a regression at the moment: https://tracker.ceph.com/issues/66440
I was able to debug the problem, I'll open a PR for a fix tomorrow.

Though I see in that screenshot that the log trace is not very readable. I have improved that in a recent commit in this PR. Pulling the latest commit of this branch should fix that.

@kamoltat
Copy link
Member

@kamoltat both (running and waiting) seems to have a regression at the moment: https://tracker.ceph.com/issues/66440 I was able to debug the problem, I'll open a PR for a fix tomorrow.

Though I see in that screenshot that the log trace is not very readable. I have improved that in a recent commit in this PR. Pulling the latest commit of this branch should fix that.

@VallariAg sounds good thanks

@kamoltat
Copy link
Member

Putting https://github.com/ceph/teuthology-api/pull/63/files
here so it's easier for me to query when trying to test PR

@VallariAg
Copy link
Member

I have updated the docker-compose file to only mount "teuthology_api/src": ceph/teuthology#1957
Now you wouldn't need to manually edit that for the next review, hope that helps!

@kamoltat
Copy link
Member

kamoltat commented Jun 26, 2024

@VallariAg ceph/teuthology#1959
this should be resolved right for the issue with killing (running + waiting) jobs

@VallariAg
Copy link
Member

VallariAg commented Jun 26, 2024

@kamoltat It'll resolve for running jobs, yes! For waiting there are more regressions (mentioned in the comments of that ticket).
I haven't gotten around to testing my fix for "waiting" jobs yet, I'll update here soon.

Update: we should be able to kill both running and queued runs.

@zmc
Copy link
Member

zmc commented Aug 13, 2024

The kill button is greyed out for me, on this run: https://pulpito-ng-zmc.ceph.com/runs/zmc-2024-08-13_20:31:57-teuthology:no-ceph-main-distro-default-smithi
I deployed this branch at the URL above. The run is currently queued. Here's the redacted response from t-api's root endpoint:
{"root":"success","session":{"id": "XXX","username":"zmc","state":"active","role":"admin","access_token":"XXX"}}

Edit: Pointing at ceph/teuthology-api#55, the button is enabled, though the tooltip says I'm not an admin. Clicking it, and the confirmation after, shows hitting: https://teuthology.front.sepia.ceph.com:8999/kill?logs=true

@VallariAg
Copy link
Member

@zmc I think something might be not right with t-api deployment. I tried to sign in on the above pulpito URL and it showed "internal server error". Maybe some env variable is missing, I don't have access to t-api's traceback so I'm not sure why it failed.

Devansh3712 and others added 10 commits August 28, 2024 19:40
1. create `Alert` and `KillButton` components
2. lib/teuthologyAPI: add useRunKill hook
    (which uses `useMutation` to send POST req at /kill to t-api)
3. create lib/teuthologyAPI.d.ts and add 'KillRun'

Signed-off-by: Vallari Agrawal <[email protected]>
new features:
1. Disable kill-run button for finished runs
2. Hide kill-run button for users who don't own the run

Signed-off-by: Vallari Agrawal <[email protected]>
This commit also removes "--user" from the t-api /kill
route request.

Signed-off-by: Vallari Agrawal <[email protected]>
Show error message received in response from api
when killing a run.

This commit also includes some minor UI improvements
on KillButtonDialog. And removes "dry_run" query
param from "useRunKill".

Signed-off-by: Vallari Agrawal <[email protected]>
If the run belongs to the logged-in user, then it'll
show "Kill" button. If not, then  "Kill As Admin" btn.
For now, pulpito-ng does not verify if user has admin
privileges. That check happens on t-api after the request is sent.

This commit also refreshes the mutation obj for kill-button,
upon button click. This fixes the issue of retrying to kill runs
by clicking the button again.

Signed-off-by: Vallari Agrawal <[email protected]>
…logy"

"scheduled_<username>@teuthology" is the default owner name if
run is scheduled from teuthology CLI tool. This commit allows
users of same github username to recognize it as their jobs.

Signed-off-by: Vallari Agrawal <[email protected]>
Read "isUserAdmin" from useSession and
disable "Kill as Admin" button if isUserAdmin is false

Signed-off-by: Vallari Agrawal <[email protected]>
Request get redirected to /kill/?logs=true if trailing
slash is missing.

Signed-off-by: Vallari Agrawal <[email protected]>
Add getHelperMessage() to display better
tooltip message on kill button.

The message should help the user understand why
the button is disabled and know if they are admin user.

Signed-off-by: Vallari Agrawal <[email protected]>
@zmc zmc requested a review from kamoltat September 18, 2024 15:55
@zmc
Copy link
Member

zmc commented Sep 18, 2024

I was able to kill two runs that matched my github username: one with a queued job, and one with a running job. Both behaved properly. A tiny issue is that we don't refresh the job status on the page after killing, but we don't have to block this PR on that.

Screenshot 2024-09-18 at 11 37 28
Screenshot 2024-09-18 at 11 30 25

Copy link
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the feature works reliably and we've sorted out the CORS and other issues getting in the way, I'm reminded that we're still showing the button when we know it can't be used either because of permissions, or because the run has completed and there's nothing to kill.

We should hide the button when:

  1. The user is logged out
  2. The run is finished
  3. The user is not an admin, and the run belongs to someone else (90% sure on this one)

I think the above might actually all cases where the button would normally be greyed out.

Sorry to retract an approval @VallariAg, but I bet you'd agree that 1 and 2 are unhelpful, and likely 3 as well. Curious what @kamoltat thinks.

@VallariAg
Copy link
Member

VallariAg commented Sep 20, 2024

@zmc my logic for greying out the button is that I wanted to give a message to the user (with tooltip) explaining why they can't kill a run (otherwise they might wonder why they can't see the button). But I 100% agree that not having a button for 1 and 2 would be more intuitive. I can change that!

I think for 3, benefits of having a disabled button is that:

  1. people who don't have admin access (but would want to have it), can know they're missing something. Otherwise, they wouldn't even know that users can have admin access, especially because its a new feature. And admin-kill might become a hidden feature.
  2. the disabled button would be a good indicator that a user doesn't own the run and need to use --owner while scheduling. If there's no button, it might confuse users why they don't see the button (esp if there's a difference between owner/github-username). We can help explain this in the tooltip, and even tell them who is the owner. Example "Run owned another user <owner name>, you need to be admin user to kill this run"

@zmc
Copy link
Member

zmc commented Sep 23, 2024

Thanks @VallariAg, I think I agree with you about keeping the button in scenario 3 - adding the value to the tooltip is a good idea too.

Send Run query as KillButton props. Then do all
processing of payload and ownership in KillButton
component.

This also refreshes Run query after killing a run.

Signed-off-by: Vallari Agrawal <[email protected]>
@VallariAg VallariAg requested a review from zmc September 25, 2024 20:10
Copy link
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @Devansh3712 and @VallariAg for making this happen!

@zmc zmc dismissed kamoltat’s stale review September 30, 2024 23:18

never re-reviewed

@zmc zmc merged commit e2bad8f into ceph:main Sep 30, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants