Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KOGITO-9940] Add E2E test cases for platform configured with Job Service and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles #337

Merged
merged 9 commits into from
Jan 24, 2024

Conversation

jordigilh
Copy link
Contributor

Extends the E2E tests to include coverage for the scenarios where Job Service and Data Index are deployed:

  • Enabled field set to false for both services and workflow in dev profile:
    • With ephemeral persistence.
    • With posgreSQL persistence.
  • Enabled field set to true for both services and workflow in prod profile:
    • With ephemeral persistence.
    • With posgreSQL persistence.

Each test case takes between 4-5 minutes to run, so I have limited the coverage to these 4 cases.

Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! Just a few minor comments. Thank you!

corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
corev1.ResourceCPU: resource.MustParse("500m"),
corev1.ResourceMemory: resource.MustParse("1Gi"),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any justification for this change? Have you run a benchmark? @wmedvede do you have any idea regarding DI/JS consumption resources? Can we have a follow-up task to get a more approximated number? I feel like this can be too much for a default setup. Mainly when running it locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got OOMKill with 256Mi for the DI. I increased it to 512Mi as well as the CPU limits to try to speed up the deployment as it takes 2:40 seconds for the container to reach ready status, which is significant more than for the Job Service (90 seconds).

It didn't help much on both accounts. I can reduce it to 512Mi and the CPU to 100m for limits if you think it aligns with your expectations.

Is there is any previous test done on the resource limits for the DI container or these values were selected on best effort?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about changing these numbers now, we can do a benchmark later and have close numbers and a good approximation for users depending on their env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I'll set the memory request and limit to 512Mi to avoid random OOMKills. I wonder if the startup time is caused by the JVM resizing its memory capacity as it runs the code....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that on OCP cluster the data-service failed to start with the default values set by the operator and had to set the resources limits in the platform CR explicitly to make it start:

  services:
    dataIndex:
      enabled: true
      podTemplate:
        container:
          image: "quay.io/kiegroup/kogito-data-index-postgresql-nightly:latest" # To be removed when stable version is released
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
      persistence:

test/e2e/workflow_test.go Outdated Show resolved Hide resolved
@ricardozanini
Copy link
Member

Please don't merge until @domhanak's review while I'm on PTO.

@ricardozanini
Copy link
Member

@jordigilh were you able to run the tests locally? It seems that we have a build problem.

@jordigilh
Copy link
Contributor Author

@jordigilh were you able to run the tests locally? It seems that we have a build problem.

Yes, but with go 1.20. I see one of the functions I used is not supported in 1.19. I'll fix that.

@jordigilh
Copy link
Contributor Author

@ricardozanini Running the e2e test suite locally I'm getting an error for the existing e2e tests:

$> kubectl logs -f greeting-748857df-6tztv -n sonataflow-operator-system
Starting the Java application using /opt/jboss/container/java/run/run-java.sh ...
INFO exec -a "java" java -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager -cp "." -jar /deployments/quarkus-run.jar
INFO running in /deployments
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at io.quarkus.bootstrap.runner.QuarkusEntryPoint.doRun(QuarkusEntryPoint.java:61)
	at io.quarkus.bootstrap.runner.QuarkusEntryPoint.main(QuarkusEntryPoint.java:32)
Caused by: java.lang.UnsupportedClassVersionError: org/kie/kogito/addons/quarkus/k8s/config/KubernetesAddonConfigSource has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1022)
	at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
	at io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:65)
	at io.quarkus.runtime.configuration.RuntimeConfigSource.getConfigSources(RuntimeConfigSource.java:19)

Seems like the greetings container has been rebuilt for a newer version of Java.

Copy link
Contributor

@wmedvede wmedvede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just one comment from my side, I have rebased this locally with main and executed the e2e tests, and got this error. Looks like one test is not passing, but maybe it's my local enviroment.

SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile
/home/wmedvede/development/projects/kogito/kogito-serverless-operator/test/e2e/workflow_test.go:357

[FAILED] Timed out after 300.002s.
Expected success, but got an error:
<*errors.errorString | 0xc00037cab0>:
kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found

  {
      s: "kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found\n",
  }

In [It] at: /home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586 @ 01/12/24 16:20:22.891

Summarizing 1 Failure:
[FAIL] SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile
/home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586

Ran 7 of 7 Specs in 1476.462 seconds
FAIL! -- 6 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestE2E (1476.46s)
FAIL
FAIL command-line-arguments 1476.476s
FAIL
make: *** [Makefile:349: test-e2e] Error 1

@jordigilh jordigilh force-pushed the kogito_9940_e2e_tests branch 2 times, most recently from 656c722 to ebb4dc3 Compare January 13, 2024 04:23
@domhanak
Copy link
Contributor

Pr check also complains about missing headers in some files

@jordigilh
Copy link
Contributor Author

Pr check also complains about missing headers in some files

Fixed 😄

@jordigilh
Copy link
Contributor Author

LGTM

Just one comment from my side, I have rebased this locally with main and executed the e2e tests, and got this error. Looks like one test is not passing, but maybe it's my local enviroment.

SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile /home/wmedvede/development/projects/kogito/kogito-serverless-operator/test/e2e/workflow_test.go:357

[FAILED] Timed out after 300.002s. Expected success, but got an error: <*errors.errorString | 0xc00037cab0>: kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found

  {
      s: "kubectl wait pod -n test-459 -l sonataflow.org/workflow-app --for condition=Ready --timeout=30s failed with error: (exit status 1) error: no matching resources found\n",
  }

In [It] at: /home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586 @ 01/12/24 16:20:22.891

Summarizing 1 Failure: [FAIL] SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and postgreSQL persistence and the workflow in a production profile /home/wmedvede/development/go/go1.20.4/src/reflect/value.go:586

Ran 7 of 7 Specs in 1476.462 seconds FAIL! -- 6 Passed | 1 Failed | 0 Pending | 0 Skipped --- FAIL: TestE2E (1476.46s) FAIL FAIL command-line-arguments 1476.476s FAIL make: *** [Makefile:349: test-e2e] Error 1

Re run them one more time with success...

Ran 7 of 7 Specs in 1033.880 seconds
SUCCESS! -- 7 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (1033.88s)
PASS
ok  	command-line-arguments	1034.430s

I can only guess that the problem you found was the data index or job service pods failed to deploy. If you see this again please capture the logs from operator and the platform CR so I can troubleshoot it?

@domhanak
Copy link
Contributor

domhanak commented Jan 16, 2024

So looks like there is a consistent fail on PR check - 3 out of 3 reruns:

SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and ephemeral persistence and the workflow in a dev profile:
  [FAILED] No container was found that could respond to the health endpoint failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!!(MISSING)w(<nil>)
  Unexpected error:
      <*errors.errorString | 0xc00051a2d0>: 
      failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!w(<nil>)
      {
          s: "failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!w(<nil>)",
      }
  occurred

I am currently not sure why this is happening, locally it passes. Should be investigated to keep the CI stable.

Copy link
Contributor

@domhanak domhanak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you @jordigilh please rebase for the CI to execute these after kind migration

@jordigilh
Copy link
Contributor Author

So looks like there is a consistent fail on PR check - 3 out of 3 reruns:

SonataFlow Operator Validate that Platform services and flows are running successfully when creating a simple workflow [It] with both Job Service and Data Index and ephemeral persistence and the workflow in a dev profile:
  [FAILED] No container was found that could respond to the health endpoint failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!!(MISSING)w(<nil>)
  Unexpected error:
      <*errors.errorString | 0xc00051a2d0>: 
      failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!w(<nil>)
      {
          s: "failed to execute curl command against health endpoint in container data-index-service:invalid character 'I' looking for beginning of value; %!w(<nil>)",
      }
  occurred

I am currently not sure why this is happening, locally it passes. Should be investigated to keep the CI stable.

It's because the e2e job has set DEBUG=true as environment value and the kubectl starts adding log entries in the kubectl exec -it command, which causes parsing issues. This is an example of what is returned from the command in a test run:

  running: kubectl --v=0 exec -t callbackstatetimeouts-7bf6f9f7f6-hg48m -n test-511 -c workflow -- curl -s localhost:8080/q/health
  I0118 18:50:33.207120   46502 log.go:194] (0x140000e6420) (0x140004a4e60) Create stream
  I0118 18:50:33.207262   46502 log.go:194] (0x140000e6420) (0x140004a4e60) Stream added, broadcasting: 1
  I0118 18:50:33.208525   46502 log.go:194] (0x140000e6420) Reply frame received for 1
  I0118 18:50:33.208535   46502 log.go:194] (0x140000e6420) (0x1400072e000) Create stream
  I0118 18:50:33.208538   46502 log.go:194] (0x140000e6420) (0x1400072e000) Stream added, broadcasting: 3
  I0118 18:50:33.209173   46502 log.go:194] (0x140000e6420) Reply frame received for 3
  I0118 18:50:33.209181   46502 log.go:194] (0x140000e6420) (0x140004f6460) Create stream
  I0118 18:50:33.209183   46502 log.go:194] (0x140000e6420) (0x140004f6460) Stream added, broadcasting: 5
  I0118 18:50:33.209632   46502 log.go:194] (0x140000e6420) Reply frame received for 5
  I0118 18:50:33.245994   46502 log.go:194] (0x140000e6420) Data frame received for 3
  I0118 18:50:33.246002   46502 log.go:194] (0x1400072e000) (3) Data frame handling
  I0118 18:50:33.246006   46502 log.go:194] (0x1400072e000) (3) Data frame sent
  {
      "status": "UP",
      "checks": [
          {
              "name": "SmallRye Reactive Messaging - liveness check",
              "status": "UP"
          },
          {
              "name": "alive",
              "status": "UP"
          },
          {
              "name": "Database connections health check",
              "status": "UP",
              "data": {
                  "<default>": "UP"
              }
          },
          {
              "name": "SmallRye Reactive Messaging - readiness check",
              "status": "UP"
          },
          {
              "name": "SmallRye Reactive Messaging - startup check",
              "status": "UP"
          }
      ]
  }I0118 18:50:33.246382   46502 log.go:194] (0x140000e6420) Data frame received for 3
  I0118 18:50:33.246389   46502 log.go:194] (0x1400072e000) (3) Data frame handling
  I0118 18:50:33.246398   46502 log.go:194] (0x140000e6420) Data frame received for 5
  I0118 18:50:33.246400   46502 log.go:194] (0x140004f6460) (5) Data frame handling
  I0118 18:50:33.247459   46502 log.go:194] (0x140000e6420) Data frame received for 1
  I0118 18:50:33.247468   46502 log.go:194] (0x140004a4e60) (1) Data frame handling
  I0118 18:50:33.247471   46502 log.go:194] (0x140004a4e60) (1) Data frame sent
  I0118 18:50:33.247475   46502 log.go:194] (0x140000e6420) (0x140004a4e60) Stream removed, broadcasting: 1
  I0118 18:50:33.247478   46502 log.go:194] (0x140000e6420) Go away received
  I0118 18:50:33.247571   46502 log.go:194] (0x140000e6420) (0x140004a4e60) Stream removed, broadcasting: 1
  I0118 18:50:33.247579   46502 log.go:194] (0x140000e6420) (0x1400072e000) Stream removed, broadcasting: 3
  I0118 18:50:33.247583   46502 log.go:194] (0x140000e6420) (0x140004f6460) Stream removed, broadcasting: 5

I noticed this while troubleshooting #322 . I removed the env variable in the job and the problem disappeared. If that variable is required for other reasons I can resort in setting it to false before calling the make test-e2e target. Let me know what you think.

@ricardozanini
Copy link
Member

We should remove this DEBUG=true var for now, once we migrate to BDD we will have a much more debugging/logging approach. Thank you, @jordigilh!

@jordigilh jordigilh force-pushed the kogito_9940_e2e_tests branch 2 times, most recently from 07accbf to ed83260 Compare January 24, 2024 03:56
@ricardozanini
Copy link
Member

@jordigilh just a last generation check and we should be good!

@jordigilh jordigilh force-pushed the kogito_9940_e2e_tests branch 4 times, most recently from 888df95 to d6993c1 Compare January 24, 2024 18:56
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles

Signed-off-by: Jordi Gil <[email protected]>
…hen running the ephemeral postgres

Signed-off-by: Jordi Gil <[email protected]>
…lth status since some finish quicker than the time it takes for the logic to evaluate the health endpoint and causes a test failure

Signed-off-by: Jordi Gil <[email protected]>
@jordigilh
Copy link
Contributor Author

@ricardozanini can we merge this PR? It's green

@ricardozanini ricardozanini merged commit cccb7b2 into apache:main Jan 24, 2024
4 checks passed
rgdoliveira pushed a commit to rgdoliveira/kogito-serverless-operator that referenced this pull request Jan 29, 2024
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles (apache#337)
rgdoliveira pushed a commit to rgdoliveira/kogito-serverless-operator that referenced this pull request Jan 29, 2024
…vice and Data Index in a combination of scenarios with ephemeral and postgreSQL persistence in dev and production profiles (apache#337)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants