Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drools.weekly-deploy jobs frequently fail with Request Timeout (408) #1444

Open
tkobayas opened this issue Aug 19, 2024 · 9 comments
Open

drools.weekly-deploy jobs frequently fail with Request Timeout (408) #1444

tkobayas opened this issue Aug 19, 2024 · 9 comments
Labels
area:cicd Related to pipelines, automation. Community GitHub Actions or internal area:rules Related to Rules (DRL, DROOLS)

Comments

@tkobayas
Copy link

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/

07-14: SUCCESS
07-21: FAILURE
07-28: FAILURE
08-04: FAILURE
08-11: FAILURE
08-18: FAILURE

for example)

[2024-08-18T04:42:15.583Z] [INFO] Retrying deployment attempt 5 of 5
[2024-08-18T04:44:56.049Z] [WARNING] Failed to upload checksum to org/drools/drools-tms/999-20240818-SNAPSHOT/drools-tms-999-20240818-20240818.030803-1-sources.jar.sha1
[2024-08-18T04:44:56.049Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.handleStatus (HttpTransporter.java:619)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.execute (HttpTransporter.java:488)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.implPut (HttpTransporter.java:469)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.spi.connector.transport.AbstractTransporter.put (AbstractTransporter.java:107)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.uploadChecksum (BasicRepositoryConnector.java:608)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.uploadChecksums (BasicRepositoryConnector.java:591)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask (BasicRepositoryConnector.java:565)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2024-08-18T04:44:56.050Z]     at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
[2024-08-18T04:44:56.050Z]     at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
[2024-08-18T04:44:56.050Z]     at java.lang.Thread.run (Thread.java:840)
...
@tkobayas tkobayas changed the title main/drools.weekly-deploy jobs frequently fail with Request Timeout (408) main/other/drools.weekly-deploy jobs frequently fail with Request Timeout (408) Aug 19, 2024
@tkobayas
Copy link
Author

10.0.x/other/drools.weekly-deploy jobs have the same issue, but now focus on main

@tkobayas tkobayas changed the title main/other/drools.weekly-deploy jobs frequently fail with Request Timeout (408) drools.weekly-deploy jobs frequently fail with Request Timeout (408) Aug 19, 2024
@tkobayas
Copy link
Author

thought)

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/nightly/job/drools.build-and-deploy/
nightly also does deploy. I see
08-17: SUCCESS
08-18: FAILURE (Request Timeout (408))
08-19: SUCCESS
08-20: SUCCESS
08-21: SUCCESS

Hmm, Sunday night may cause a high-load (even within drools, both nightly and weekly did "deploy" around 4:00 AM on 08-18 ).

@tkobayas
Copy link
Author

tkobayas commented Aug 26, 2024

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/14/

08-25: SUCCESS (65 WARNINGs in 5 attemps)

However, we still see lots of timeout WARNING and retrying.

Also I have a doubt if the configured 300 sec timeout was effective. See the log was within 120 sec.

[2024-08-25T05:24:54.309Z] [INFO] Retrying deployment attempt 4 of 5
[2024-08-25T05:26:31.776Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240825-SNAPSHOT/kie-core-bom-999-20240825-20240825.030947-1.pom.sha1
[2024-08-25T05:26:31.776Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-08-25T05:26:31.776Z]     at org.eclipse.aether.transport.http.HttpTransporter.handleStatus (HttpTransporter.java:619)

@tkobayas
Copy link
Author

tkobayas commented Sep 2, 2024

09-01: SUCCESS

3 WARNINGs in the 1st attempt. 2nd attempt successful.

(Note: Failed to upload checksum doesn't stop the whole task. Could not transfer artifact stops the task and triggers retrying)

[2024-09-01T04:17:31.407Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-01T04:26:48.086Z] [WARNING] Failed to upload checksum to org/drools/drools-examples/999-20240901-SNAPSHOT/drools-examples-999-20240901-20240901.030335-1-javadoc.jar.md5
[2024-09-01T04:26:48.086Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:26:48.086Z]     ...
[2024-09-01T04:28:36.241Z] [WARNING] Failed to upload checksum to org/drools/kiebase-inclusion/999-20240901-SNAPSHOT/kiebase-inclusion-999-20240901-20240901.030335-1-tests.jar.md5
[2024-09-01T04:28:36.241Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:28:36.241Z]     ...
[2024-09-01T04:46:57.694Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:drools-canonical-model:jar:999-20240901-20240901.030335-1 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:46:57.694Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T05:20:53.254Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T05:20:53.254Z] [INFO] Reactor Summary for Drools :: Parent 999-20240901-SNAPSHOT:

This change aether.connector.basic.parallelPut=false seemed to be effective, but let's see next week.

@tkobayas
Copy link
Author

tkobayas commented Sep 9, 2024

on 09-05, Jan and Rodrigo manually triggered the job.

09-05 (1st): SUCCESS. No WARNING
09-05 (2nd): Upload was successful. No WARNING. The job failed because of the duplicate tag name, not related to uploading.

09-08: FAILURE. The job was cancelled because of job time out (3 hours). The job was in the middle of 2nd attempt of uploading. 1st uploading hit 30 WARNINGs.

[2024-09-08T03:02:07.906Z] Timeout set to expire in 3 hr 0 min
...
[2024-09-08T04:37:57.001Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-08T04:39:20.860Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240908-SNAPSHOT/kie-core-bom-999-20240908-20240908.030343-1.pom.md5
[2024-09-08T04:39:20.860Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
...
[2024-09-08T05:52:38.558Z] [INFO] Retrying deployment attempt 2 of 5
...
[2024-09-08T06:02:07.906Z] Cancelling nested steps due to timeout

With aether.connector.basic.parallelPut=false, usually one round of uploading all artifacts takes around 35 minutes (e.g. 09-01 Sunday). But on 09-08, the 1st attempt took around 75 minutes. The network was probably unusually unstable.

So far, aether.connector.basic.parallelPut=false seems to have a positive effect, but not yet perfect.

How to improve further?
A) Increase job timeout : But note that drools weekly deployment is a dependency of other projects weekly deployment.
B) Disable deployAtEnd

@jstastny-cz
Copy link

Please also discuss this on Mailing list, because -DdeployAtEnd was a decision taken to unify how we deploy things across KIE project, this would again deviate from that goal, see https://lists.apache.org/thread/d6oxh6qtm6mm4hc2zv1pwcqqb2kfmv70

@tkobayas
Copy link
Author

tkobayas commented Sep 9, 2024

Sorry that I missed the discussion, @jstastny-cz . I'll not push the solution "Disable deployAtEnd". Rather, I'll see the timeout trend for some while.

@jstastny-cz
Copy link

What I don't understand - why nightly deploy takes minutes and weekly hours.
I think we can compare the maven commands used between the 2 and check if they differ in significant aspects.

@tkobayas
Copy link
Author

tkobayas commented Sep 13, 2024

Hi @jstastny-cz ,

nightly

mvn dependency:tree clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/nightly/drools.build-and-deploy@tmp/config17784539189338421883tmp -Dmaven.wagon.http.ssl.insecure=true -Dmaven.test.failure.ignore=true -nsu -ntp -fae -e -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 -Dmaven.wagon.http.retryHandler.count=3 -Dfull -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -B

weekly

mvn -B -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/other/drools.weekly-deploy@tmp/config6925231661979507139tmp -fae -ntp -Dfull clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -Daether.connector.basic.parallelPut=false -Dfull -Dmaven.test.failure.ignore=true -DskipTests=false

nightly has:

  • -Dmaven.wagon.http.ssl.insecure=true
  • -Dmaven.wagon.http.pool=false
  • -Dmaven.wagon.httpconnectionManager.ttlSeconds=120
  • -Dmaven.wagon.http.retryHandler.count=3
  • -Dhttp.keepAlive=false

Per my understandings, wagon is no longer used by default (since maven 3.9.0). https://stackoverflow.com/questions/71099771/how-do-i-use-transport-http-instead-of-wagon-in-maven

-Dhttp.keepAlive=false has pros and cons. It may be good under unstable network environment.

weekyly has:


Btw, I think day of the week and time seems to matter.

nightly 09-01 (Sunday) was slow and unstable.

[2024-09-01T04:09:11.322Z] [INFO] --- install:3.1.1:install (default-install) @ drools-distribution ---
[2024-09-01T04:15:13.638Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 502, reason phrase: Proxy Error (502)
[2024-09-01T04:15:13.639Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T04:21:48.410Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:kie-pmml-evaluator-api:pom:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:21:48.410Z] [INFO] Retrying deployment attempt 3 of 5
...
[2024-09-01T04:30:06.008Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools.testcoverage:test-integration-ruleunits-tests:jar:tests:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:30:06.008Z] [INFO] Retrying deployment attempt 4 of 5
[2024-09-01T04:36:40.259Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:efesto-compilation-manager-core:jar:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:36:40.259Z] [INFO] Retrying deployment attempt 5 of 5
...
[2024-09-01T04:43:04.104Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T04:43:04.104Z] [INFO] Reactor Summary for Drools :: Parent 999-SNAPSHOT:
...
[2024-09-01T04:43:04.108Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:3.1.1:deploy (default-deploy) on project drools-distribution: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapsh\
ots): status code: 502, reason phrase: Proxy Error (502) -> [Help 1]

and weekly 09-12 (Thursday) manually triggered by Rodrigo was successful without timeout.

[2024-09-12T12:18:06.159Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-12T12:50:46.172Z] [INFO] ------------------------------------------------------------------------
...
[2024-09-12T12:50:46.176Z] [INFO] BUILD SUCCESS

I guess not only KIE projects but also many other projects in apache contribute to this "unstable Sunday night" (I don't know if we have CPU/Network quota). Imagine that many projects do nightly deployment every night and also weekly deployment on Sunday night, the load would be double on Sunday night.

So... how about moving the weekly build to Saturday daytime or Sunday daytime (or weekday daytime)? Do you think it's a good idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:cicd Related to pipelines, automation. Community GitHub Actions or internal area:rules Related to Rules (DRL, DROOLS)
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants