Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding serve support for Api server #1728

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

blublinsky
Copy link
Contributor

Why are these changes needed?

Extending API server to support serve

Related issue number

Checks

  • [x ] I've made sure the tests are passing.
  • Testing Strategy
    • [ x] Unit tests
    • [ x] Manual tests
    • This PR is not tested :(

@blublinsky
Copy link
Contributor Author

@tedhtchang, @z103cb please take a look

cc: @kevin85421

Copy link
Contributor

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very Nice! I was able to verify the PR following the instruction. There are some typos otherwise LGTM. Thanks

apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
apiserver/ServeSubmission.md Outdated Show resolved Hide resolved
@blublinsky
Copy link
Contributor Author

@kevin85421 its ready for you

@kevin85421 kevin85421 self-assigned this Dec 12, 2023
@kevin85421
Copy link
Member

The CI fails consistently. Would you mind taking a look? Thanks!

@tedhtchang
Copy link
Contributor

tedhtchang commented Dec 13, 2023

This test sometimes PASSED locally.

pytest --log-level=info -v -s tests/test_sample_rayservice_yamls.py::TestRayService::test_zero_downtime_rollout

Sometimes failed with

Failed to connect to rayservice-sample-serve-svc.default.svc.cluster.local port 8000: Connection refused
command terminated with exit code 7
....
INFO     framework.utils:utils.py:375 Execute command (check_output): kubectl exec curl -n default -- curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc.default.svc.cluster.local:8000/fruit/ -d '["MANGO", 2]'
INFO     framework.utils:utils.py:382 Exception: b''
====================================================

or

....
INFO     test_sample_rayservice_yamls:test_sample_rayservice_yamls.py:149 Ray service transitioned to status Running.
INFO     test_sample_rayservice_yamls:test_sample_rayservice_yamls.py:154 Ray service has moved to cluster "rayservice-sample-raycluster-xtvwr"
INFO     framework.utils:utils.py:375 Execute command (check_output): kubectl exec curl -n default -- curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc.default.svc.cluster.local:8000/fruit/ -d '["MANGO", 2]'
INFO     framework.utils:utils.py:379 Output: b'6'
============================================================================== short test summary info ==============================================================================
FAILED tests/test_sample_rayservice_yamls.py::TestRayService::test_zero_downtime_rollout - AssertionError

@blublinsky
Copy link
Contributor Author

ANd I do not think this PR changes anything there

@kevin85421
Copy link
Member

Hmm, I'm not referring to the test-rayservice-sample-yamls-latest-release. More than one CI check has failed. You may need to rebase the branch or open a new PR instead.

@blublinsky
Copy link
Contributor Author

@kevin85421 we are all set

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the KubeRay API server should construct its own HTTP client instead of using the existing one in KubeRay.

  • The dashboard HTTP client in KubeRay is not a user-facing API, so no one promises its backward compatibility. Constructing a new one in KubeRay API server can improve the stability.
  • Reducing the dependencies between KubeRay API server and KubeRay can avoid a lot of issues for users and developers. For example, KubeRay API server may want to support multiple versions of KubeRay and Ray. It is hard to achieve if there are a lot of dependencies between them.
  • The fact that these functions are not used by KubeRay means they are not tested by end-to-end tests and not utilized by users. Therefore, it is difficult to ascertain whether they work or not.
  • Using the dashboard HTTP client in KubeRay for communication with Ray doesn't seem to make much sense to me. The KubeRay API server can directly talk to Ray.

@blublinsky
Copy link
Contributor Author

blublinsky commented Dec 14, 2023

Sorry @kevin85421 , I completely disagree with your statements, see point-by-point replies below

Perhaps the KubeRay API server should construct its own HTTP client instead of using the existing one in KubeRay.
I do not think it makes sense to maintain 2 copies of identical code
The dashboard HTTP client in KubeRay is not a user-facing API, so no one promises its backward compatibility. Constructing a new one in the KubeRay API server can improve the stability.
It is not user-facing in the API server either. API Server uses these APIs to communicate with the Ray cluster and then convert them to the internal representation defined by its own APIs
Reducing the dependencies between KubeRay API server and KubeRay can avoid a lot of issues for users and developers. For example, the KubeRay API server may want to support multiple versions of KubeRay and Ray. It is hard to achieve if there are a lot of dependencies between them.
Theoretically, it might be right, but practically will lead to a huge duplication of the code. The API server does depend on the KubeRay operator
The fact that these functions are not used by KubeRay means they are not tested by end-to-end tests and not utilized by users. Therefore, it is difficult to ascertain whether they work or not.
That's why I was testing them. Besides, they are used by KubeRay operator
Using the dashboard HTTP client in KubeRay for communication with Ray doesn't seem to make much sense to me. The KubeRay API server can directly talk to Ray.
It does talk to Ray directly using the dashboard HTTP client as a code to do this

@blublinsky
Copy link
Contributor Author

@kevin85421, so where do we go from here. We disagree, should we ask for the third opinion?

@blublinsky
Copy link
Contributor Author

@kevin85421 we need to come to some kind of resolution here. Leaving it hanging is not a solution @architkulkarni ?
Also, Ray job support is using the same class

@blublinsky blublinsky force-pushed the servesupport branch 3 times, most recently from 6b4e022 to 900f6c2 Compare January 8, 2024 09:41
@blublinsky
Copy link
Contributor Author

@kevin85421 it has been over a month. Any chance you can get to this?

@anyscalesam anyscalesam added the enhancement New feature or request label Jan 29, 2024
@kevin85421
Copy link
Member

I will remove all functions in dashboard_httpclient.go that are not used by the KubeRay operator itself before the release of KubeRay v1.1.0. I don't want to have any unnecessary dependencies between KubeRay and KubeRay API server. In addition, I may also consider to move the API server to a separate repository.

We have 3 possible options to clean up dashboard_httpclient.go

@blublinsky
Copy link
Contributor Author

blublinsky commented Feb 1, 2024

I will remove all functions in dashboard_httpclient.go that are not used by the KubeRay operator itself before the release of KubeRay v1.1.0. I don't want to have any unnecessary dependencies between KubeRay and KubeRay API server. In addition, I may also consider to move the API server to a separate repository.

There are 2 methods added there:

  1. DeployApplication(context.Context, string) error which was copied from driver, thus making it simpler and cleaner trivial 10 lines of code
  2. DeleteServeApplications(context.Context) error, which is 15 trivial lines of code

The biggest change is updating the return definition to align it with the definition of the serve implementation

Do whatever you have to do. This is taking way too long

@blublinsky
Copy link
Contributor Author

@kevin85421 moved dashboard_httpclient.go API server specific version and supporting files to the API server. Can we, please, finally merge it

@blublinsky
Copy link
Contributor Author

@kevin85421 the only change to the ray operator is in pod.go to capitalize initLivenessAndReadinessProbe to make it visible

@@ -239,7 +239,7 @@ func DefaultWorkerPodTemplate(ctx context.Context, instance rayv1.RayCluster, wo
return podTemplate
}

func initLivenessAndReadinessProbe(rayContainer *corev1.Container, rayNodeType rayv1.RayNodeType, creatorCRDType utils.CRDType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to avoid having the KubeRay API server call functions within KubeRay. I will only take care of the CRD compatibility, not the internal functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here we go again. There is plenty of dependencies already. Do you really assume reproducing all of the code in the API server? This makes no sense to me, sorry. Can we, please, get a second opinion from one of your colleges?
We have been sitting on this PR for several month right now and its getting counter productive. I can do requested enhancements anymore, because we are stuck here for several month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants