Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CRaC POC #8743

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

[WIP] CRaC POC #8743

wants to merge 7 commits into from

Conversation

danielkec
Copy link
Contributor

@danielkec danielkec commented May 10, 2024

Helidon MP on CRaC

Coordinated Restore at Checkpoint

Helidon MP Implicit example on CRaC

examples/crac/README.md

mvn clean package
docker build --build-arg CR_DIR=~/cr -t crac-helloworld . -f Dockerfile.crac
# First time ran, checkpoint is created, stop with Ctrl-C
docker run --privileged -p 7001:7001 --name crac-helloworld crac-helloworld
# Second time starting from checkpoint, stop with Ctrl-C
docker start -i crac-helloworld

Workaround for: Error (criu/cr-dump.c:203): 18 has rseq but kernel lacks get_rseq_conf feature

Signed-off-by: Daniel Kec <[email protected]>
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label May 10, 2024
@danielkec danielkec mentioned this pull request May 24, 2024
@rvansa
Copy link

rvansa commented May 28, 2024

Hi @danielkec , I've played with this a bit to let CRaC checkpoint the webserver after it starts: https://github.com/rvansa/helidon/tree/crac-poc
I've manually checked that I can curl localhost:7001, then execute docker exec -it crac-helloworld jcmd helidon-examples-microprofile-hello-world-implicit.jar JDK.checkpoint and then curl it again. There might be some synchronization pecularities missing, but the code looks like the only synchronization I need is with LoomServer.start()/stop() methods as provided. Let me know if I am wrong.

@rvansa
Copy link

rvansa commented May 29, 2024

Referring to my changes ^: Actually, the case where a stop() (or start()) would be called in between beforeCheckpoint() and afterRestore() might need a bit more love; I am not sure if that would blow up or pass with error log leaving the executors running, but I think it's an invalid sequence of operations. So we might either explicitly throw an error, or keep the lifecycleLock locked until afterRestore().


curl --retry 10 --retry-all-errors --retry-delay 1 http://localhost:7001
printf "\n==== Warming up ...\n"
wrk -c 16 -t 16 -d 10s http://localhost:7001
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rvansa thx for cool fix and sorry for the delay. It seems to be working for me, but when I do a little warmup before the snapshot, snapshot fails with:

An exception during a checkpoint operation:
jdk.internal.crac.mirror.CheckpointException
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenResourceException: FD fd=165 type=unknown path=anon_inode:[eventpoll]
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:117)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:188)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:286)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:299)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenResourceException: FD fd=183 type=unknown path=anon_inode:[eventpoll]
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:117)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:188)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:286)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:299)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a native component opening those epoll FD; I wonder why this didn't pop up with a single request for warmup. Native FDs ask for investigation through strace: https://github.com/CRaC/docs/blob/master/debugging.md#file-descriptors-in-native-code
I'll try to reproduce locally.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @danielkec, I can confirm this is an issue on JDK (CRaC) side. There are some codepaths in sun.nio triggered when the socket is created from a virtual threads, and we did not have test coverage for that case.

Copy link

@rvansa rvansa Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any luck with the latest release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that 22.32.17 fixes the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants