[issue-368] knative integration with DataIndex and JobService #467

jianrongzhang89 · 2024-05-20T02:52:49Z

Implements the Knative integration for DataIndex and JobService as well as the workflow as outlined in the ADR:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit#heading=h.ds8q4xtkmu64

Unit test cases are updated. See the ADR for common test cases:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing

Motivation for the change:
Knative integration

Fix #368

Checklist

Add or Modify a unit test for your change
Have you verified that all the tests are passing?

How to backport a pull request to a different branch?

If something goes wrong, the author will be notified and at this point a manual backporting is needed.

NOTE: this automated backporting is triggered whenever a pull request on main branch is labeled or closed, but both conditions must be satisfied to get the new PR created.

ricardozanini

Thank you for taking care of this implementation. I'm sorry for the number of comments, but I guess you're getting familiar with the code base.

Let's try not to change the base API for object management. The client reference is cached for use across the code base.

api/v1alpha08/sonataflowplatform_types.go

controllers/knative/knative.go

controllers/profiles/common/mutate_visitors.go

controllers/profiles/common/object_creators.go

controllers/profiles/common/properties/managed.go

workflowproj/operator.go

ricardozanini · 2024-05-20T15:52:46Z

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

jianrongzhang89 · 2024-05-21T00:33:17Z

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

Thanks!

wmedvede · 2024-05-21T17:13:45Z

Hi @jianrongzhang89 , this starts to look good!

Some results here:

Case1:

DI, JS and workflows takes eventing configuration from the platform

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

The use case works, evens are produced/registered as expected.
Expected triggers are created

NAME BROKER SINK AGE CONDITIONS READY REASON
callbackstatetimeouts-callbackevent-trigger default service:callbackstatetimeouts 50m 7 OK / 7 True
sonataflow-platform-data-index-jobs-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-definition-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-error-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-node-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-sla-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-state-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-variable-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-job-service-create-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True
sonataflow-platform-job-service-delete-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True

Expected sinkbindings are created

callbackstatetimeouts-sb SinkBinding sinkbindings.sources.knative.dev broker:default True
jobs-service-sb SinkBinding sinkbindings.sources.knative.dev broker:default True

@jianrongzhang89 to be continued tomorrow.

ricardozanini · 2024-05-21T18:41:01Z

@jianrongzhang89 I'll do another review round tomorrow.

wmedvede · 2024-05-30T09:47:16Z

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

The Job service deployment is produced.
And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

See some cases here:

jianrongzhang89 · 2024-05-30T13:34:07Z

@wmedvede this issue is now fixed. Please check again.

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

The Job service deployment is produced.

And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

ricardozanini

Overall, looks good! I think we can turn a few knobs here and there and it should be ready to go for QE to verify.

api/condition_types.go

controllers/knative/knative.go

controllers/profiles/dev/object_creators_dev.go

controllers/profiles/dev/profile_dev_test.go

controllers/profiles/dev/states_dev.go

controllers/profiles/preview/states_preview.go

ricardozanini

I think we can move on now to validate this PR. @domhanak can you take a look?

wmedvede · 2024-06-13T17:20:51Z

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:

kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

Case 5) is now working with the following considerations:
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Trigger names collision:

When we use the SFCP, with the cluster-wide instance of the DI, JS, and Broker, all the triggers must be created in the namespace of the SFP that defines these objects. (in the example case5-kn-eventing). This is necessary from the point of view of Knative Eventing.

And thus, if we deploy a workflow "callbackstatetimeouts" in namespace1, a trigger with the name callbackstatetimeouts-callbackevent-trigger will be created in "case5-kn-eventing".

Now, if we deploy "callbackstatetimeouts" in nampespace2, the operator will try to create callbackstatetimeouts-callbackevent-trigger in "case5-kn-eventing" and will produce an error.

The error only happens of course when we use that cluster-wide configuration.

Well, considering that we are doing work in parallel, to support the same workflow name to be deployed in different namespaces, I think this trigger issue, must be fixed as part of this PR. Not too much really.

note: we should include the namespace in the trigger name only for workflows that are deployed in a different ns I think.

Trigger names length:

It looks like the Trigger names, IDK why, must be no longer than 63 characters. With a larger name, it looks like the trigger is created, but, it doesn't work. And, by querying triggers with "kn trigger list", we can see the following output.

sonataflow-broker service:sonataflow-platform-data-index-service 9s 3 OK / 6 False NotSubscribed : Subscription.messaging.knative.dev "sonataflow-broker-sonataflow-pla6ff7c12b8027f7bef859225cc5ef7cf" is invalid: metadata.labels: Invalid value: "sonataflow-platform-data-index-service-process-def-triggerssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff": must be no more than 63 characters

Considering that we are not that far to reach that limit, think we must:

Ask if this is an error in Knative Eventiing. Why so short trigger names?. Maybe in some newer version this was fixed, or mabye this is needed.

Mabye the rational for that limit is this: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#:~:text=in%20RFC%201035.-,This%20means%20the%20name%20must%3A,start%20with%20an%20alphabetic%20character

If that length is there for good reasons, we must apply a smart trimming of the names for the generated Triggers, otherwise the probability of a failure is huge and the integration won't work.

On the other hand, the SinkBinding names looks to support 253 characters like other k8s objects.

ricardozanini · 2024-06-13T18:49:31Z

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

jianrongzhang89 · 2024-06-14T17:17:12Z

For case 2, there is no broker defined in the platform's spec.inventing field and the workflow does not have sink defined. In this case, based on our previous discussions, the test is intended to fail. Please see the ADR https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing use case #1.1.
@ricardozanini could you please confirm this is the desired behavior?

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:
kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

jianrongzhang89 · 2024-06-14T19:02:31Z

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

@ricardozanini @wmedvede this is a great idea. This limitation is indeed required by the Knative eventing itself, as it uses a label with the trigger name in the subscription objects. I updated the PR with a slightly different implementation using the knative's function ChildName() to generate the trigger name. It is based on the parent object (data index, job index or workflow)'s UID and make sure that the generated name won't be longer that 63 characters. Here are an example of the trigger names:
data-index-jobs-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-definition-73a585bd4a43f0ee289b57dfbedf39dd9
data-index-process-error-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-node-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-sla-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-state-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-variable-5fa0f63cae14d1df1c1bd23aaa599b559e8
jobs-service-create-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
jobs-service-delete-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
greetingtest5-greetingevent-be7d8bd8dce1feb49e4e476c2842512b8ea
greetingtest5-greetingevent2-0a73c1906bdc359d653befc487a9f7048e

Since these triggers have labels or owner references that indicate which workflows (or dataindex or job service's) they belong to already, it does not seem to me that we need to add the information to workflow's status. Please let me know if I missed something and this is really required. Thank you.

ricardozanini · 2024-06-14T19:18:50Z

@jianrongzhang89 agreed with the naming approach, many thanks for looking into this!

Regarding with the test failure, please see my changes here: https://github.com/apache/incubator-kie-kogito-serverless-operator/pull/487/files#diff-4782f3846c145011deb7edc727cc7dd3016d1683615d1f426ce246e57908daf2

It should help fix this bug and have access to utils.GetClient() across all tests.

ricardozanini · 2024-06-14T19:19:41Z

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

jianrongzhang89 · 2024-06-15T01:26:32Z

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

@ricardozanini the sonataflow status already has error message in the condition:
status:
address: {}
conditions:
- lastUpdateTime: "2024-06-15T01:16:08Z"
status: "True"
type: Built
- lastUpdateTime: "2024-06-15T01:16:08Z"
message: Error in deploy the workflow:no sink configured in the workflow or
the platform when Job Service or Data Index Service is enabled
reason: DeploymentFailure
status: "False"
type: Running
observedGeneration: 1
services:
dataIndexRef:
url: http://sonataflow-platform-data-index-service.sonataflow-infra
jobServiceRef:
url: http://sonataflow-platform-jobs-service.sonataflow-infra

Is there anything additional needed?

ricardozanini · 2024-08-06T21:05:35Z

@jianrongzhang89 please squash these commits so we have only one to merge. And your rebase will be way easier.

wmedvede · 2024-09-04T08:07:30Z

Hi @jakubschwan, @domhanak , I think this is ready for you guys try this from the execution perspective.
Would you mind prioritize?

Thanks!

wmedvede · 2024-09-05T15:11:59Z

Guys, @jianrongzhang89 @ricardozanini

I have executed most of the use cases in https://github.com/flows-examples/techpreview2/tree/0ac3f5285fac169bafa0ae84661e6c608e88283c/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing

and in general I'd say that most of issues are fixed and working. Great job @jianrongzhang89!

However, I have identified the following situation that looks not good, specially considering that this is probably one of the most frequent setups to use by Parodos.

Full executable usecase here:

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Summarized usecase:

Broker is configured in SPF and deployed in namespace case5-kn-eventing
We configure also a SFCP that points to the SPF in 1
We deploy WF callbackstatetimeouts in a different namespace case5-kn-eventing-workflows. The WF takes the Broker config from the SFCP as expected
We deploy WF callbackstatetimeouts2in namespace case5-kn-eventing. The WF takes the Broker config from the SFP as expected
DI, JS, and both workflows works fine, all Triggers, SBs, etc are created. Events are produced well.

After some time, we try to delete the WFs created before.

We execute kubectl delete workflow callbackstatetimeouts2 -n case5-kn-eventing
Deletion is produced as expected, associated Triggers and SBs are also deleted as expected.
We execute kubectl delete workflow callbackstatetimeouts -n case5-kn-eventing-workflows
This time the deletion never finishes.
Basically looks to get stuck.

We can query the current workflows and see it all the time.

kubectl get workflows -n case5-kn-eventing-workflows

NAME PROFILE VERSION URL READY REASON
callbackstatetimeouts preview 0.0.1 True

The controller log attached here:
https://gist.github.com/wmedvede/6da7d3fb3fb9f213ed62c9dbe2609a28

@jianrongzhang89 Would you mind try to reproduce?

wmedvede · 2024-09-05T15:56:22Z

Guys, @jianrongzhang89 @ricardozanini

I have executed most of the use cases in https://github.com/flows-examples/techpreview2/tree/0ac3f5285fac169bafa0ae84661e6c608e88283c/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing

and in general I'd say that most of issues are fixed and working. Great job @jianrongzhang89!

However, I have identified the following situation that looks not good, specially considering that this is probably one of the most frequent setups to use by Parodos.

Full executable usecase here:

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Summarized usecase:

Broker is configured in SPF and deployed in namespace case5-kn-eventing

We configure also a SFCP that points to the SPF in 1

We deploy WF callbackstatetimeouts in a different namespace case5-kn-eventing-workflows. The WF takes the Broker config from the SFCP as expected

We deploy WF callbackstatetimeouts2in namespace case5-kn-eventing. The WF takes the Broker config from the SFP as expected

DI, JS, and both workflows works fine, all Triggers, SBs, etc are created. Events are produced well.

After some time, we try to delete the WFs created before.

We execute kubectl delete workflow callbackstatetimeouts2 -n case5-kn-eventing
Deletion is produced as expected, associated Triggers and SBs are also deleted as expected.

We execute kubectl delete workflow callbackstatetimeouts -n case5-kn-eventing-workflows
This time the deletion never finishes.
Basically looks to get stuck.

We can query the current workflows and see it all the time.

kubectl get workflows -n case5-kn-eventing-workflows

NAME PROFILE VERSION URL READY REASON callbackstatetimeouts preview 0.0.1 True

The controller log attached here:
https://gist.github.com/wmedvede/6da7d3fb3fb9f213ed62c9dbe2609a28

@jianrongzhang89 Would you mind try to reproduce?

@jakubschwan @domhanak Would you mind guys double check this execution so see if you get same results.
Thanks!

jakubschwan

Code changes in PR looks good to me. Now reviewing the execution of usecases.

jianrongzhang89 · 2024-09-06T04:27:31Z

@wmedvede @ricardozanini the workflow deletion hanging issue is now fixed with a minor code change. Could you please test again? Thanks.

…rors after rebasing from main

…rkflow deletion hanging issue

…#467)

jianrongzhang89 requested a review from ricardozanini as a code owner May 20, 2024 02:52

jianrongzhang89 marked this pull request as draft May 20, 2024 02:54

ricardozanini requested a review from wmedvede May 20, 2024 14:56

ricardozanini reviewed May 20, 2024

View reviewed changes

jianrongzhang89 force-pushed the issue-368 branch from 23e87a9 to 55dcb24 Compare May 20, 2024 23:35

jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 3c95aac to e6b4601 Compare May 21, 2024 01:39

jianrongzhang89 force-pushed the issue-368 branch from 9832452 to 5153141 Compare May 27, 2024 21:47

jianrongzhang89 marked this pull request as ready for review May 27, 2024 21:50

jianrongzhang89 force-pushed the issue-368 branch from ba31991 to 1600377 Compare May 29, 2024 18:15

ricardozanini reviewed Jun 3, 2024

View reviewed changes

jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 35bffcc to 13b1bbb Compare June 5, 2024 02:38

ricardozanini approved these changes Jun 5, 2024

View reviewed changes

ricardozanini requested a review from domhanak June 5, 2024 13:09

jianrongzhang89 force-pushed the issue-368 branch from 13b1bbb to 54c472d Compare June 5, 2024 19:26

ricardozanini approved these changes Jun 11, 2024

View reviewed changes

jianrongzhang89 force-pushed the issue-368 branch from 829321e to 1946ca7 Compare August 6, 2024 19:19

jianrongzhang89 force-pushed the issue-368 branch 4 times, most recently from 6e3ec27 to 2350a95 Compare August 13, 2024 16:19

jianrongzhang89 mentioned this pull request Aug 13, 2024

Enhance error messages in SonataFlowPlatform CR status for knative integration #521

Closed

jianrongzhang89 force-pushed the issue-368 branch 3 times, most recently from 2d9d955 to e0ca64a Compare September 4, 2024 02:59

jakubschwan reviewed Sep 5, 2024

View reviewed changes

jianrongzhang89 force-pushed the issue-368 branch 3 times, most recently from db0c4bd to 5b0adea Compare September 11, 2024 19:26

jianrongzhang89 added 2 commits September 11, 2024 15:59

[issue-368] knative integration with DataIndex and JobService

3e6d551

[issue-368] knative integration with DataIndex and JobService: fix er…

748810a

…rors after rebasing from main

jianrongzhang89 force-pushed the issue-368 branch from 5b0adea to 0d6c3d1 Compare September 11, 2024 20:05

[issue-368] knative integration with DataIndex and JobService: fix wo…

728b8ea

…rkflow deletion hanging issue

jianrongzhang89 force-pushed the issue-368 branch from 0d6c3d1 to 728b8ea Compare September 11, 2024 21:35

gabriel-farache mentioned this pull request Sep 12, 2024

Add section to configure when operator full knative parodos-dev/orchestrator-helm-operator#211

Closed

jakubschwan approved these changes Sep 18, 2024

View reviewed changes

wmedvede approved these changes Sep 18, 2024

View reviewed changes

ricardozanini merged commit e91495d into apache:main Sep 19, 2024
4 checks passed

jianrongzhang89 deleted the issue-368 branch September 19, 2024 15:26

gabriel-farache mentioned this pull request Sep 30, 2024

Add section to configure when operator full knative parodos-dev/orchestrator-helm-operator#247

Open

rgdoliveira pushed a commit to rgdoliveira/kogito-serverless-operator that referenced this pull request Oct 3, 2024

[issue-368] knative integration with DataIndex and JobService (apache…

732a00b

…#467)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[issue-368] knative integration with DataIndex and JobService #467

[issue-368] knative integration with DataIndex and JobService #467

jianrongzhang89 commented May 20, 2024 •

edited by ricardozanini

Loading

ricardozanini left a comment

ricardozanini commented May 20, 2024

jianrongzhang89 commented May 21, 2024

wmedvede commented May 21, 2024

ricardozanini commented May 21, 2024

wmedvede commented May 30, 2024

jianrongzhang89 commented May 30, 2024 •

edited

Loading

ricardozanini left a comment

ricardozanini left a comment

wmedvede commented Jun 13, 2024

ricardozanini commented Jun 13, 2024 •

edited

Loading

jianrongzhang89 commented Jun 14, 2024

jianrongzhang89 commented Jun 14, 2024

ricardozanini commented Jun 14, 2024

ricardozanini commented Jun 14, 2024

jianrongzhang89 commented Jun 15, 2024

ricardozanini commented Aug 6, 2024

wmedvede commented Sep 4, 2024

wmedvede commented Sep 5, 2024

wmedvede commented Sep 5, 2024

jakubschwan left a comment

jianrongzhang89 commented Sep 6, 2024

[issue-368] knative integration with DataIndex and JobService #467

[issue-368] knative integration with DataIndex and JobService #467

Conversation

jianrongzhang89 commented May 20, 2024 • edited by ricardozanini Loading

ricardozanini left a comment

Choose a reason for hiding this comment

ricardozanini commented May 20, 2024

jianrongzhang89 commented May 21, 2024

wmedvede commented May 21, 2024

ricardozanini commented May 21, 2024

wmedvede commented May 30, 2024

jianrongzhang89 commented May 30, 2024 • edited Loading

ricardozanini left a comment

Choose a reason for hiding this comment

ricardozanini left a comment

Choose a reason for hiding this comment

wmedvede commented Jun 13, 2024

ricardozanini commented Jun 13, 2024 • edited Loading

jianrongzhang89 commented Jun 14, 2024

jianrongzhang89 commented Jun 14, 2024

ricardozanini commented Jun 14, 2024

ricardozanini commented Jun 14, 2024

jianrongzhang89 commented Jun 15, 2024

ricardozanini commented Aug 6, 2024

wmedvede commented Sep 4, 2024

wmedvede commented Sep 5, 2024

wmedvede commented Sep 5, 2024

jakubschwan left a comment

Choose a reason for hiding this comment

jianrongzhang89 commented Sep 6, 2024

jianrongzhang89 commented May 20, 2024 •

edited by ricardozanini

Loading

jianrongzhang89 commented May 30, 2024 •

edited

Loading

ricardozanini commented Jun 13, 2024 •

edited

Loading