-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't make CRaC run on AWS Lambda #7
Comments
The dump error looks like you're not having I was trying to follow https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/example-lambda but image build failed at The restore error about |
Hi Radim, |
Well it's the known current limitation of CRaC using CRIU underhood: you cannot checkpoint without some extra privileges. We know that for some use cases it's inconvenient, and there can be a solution in the future, but not ATM.
That's why I tried to follow your linked example locally, but there was the failure. Can you fix it up so that the provided steps work?
Err, why do you think that? When trying to follow the tutorial I've executed |
"Well it's the known current limitation of CRaC using CRIU underhood: you cannot checkpoint without some extra privileges. We know that for some use cases it's inconvenient, and there can be a solution in the future, but not ATM." I meant that your own example https://github.com/CRaC/example-lambda uses Java 8 (see https://github.com/CRaC/example-lambda/blob/master/template.yml with Runtime: java8) and pom.xml. So this is not a valid tutorial as AWS Lambda doesn't support Java 8 anymore. That's why I suggested you to update your own example to Java 21 and retest whether it still supposed to work or something has change on the Firecracker VM. |
I see what you meant - it's a bit confusing as the example is based on another example, but AFAIU the |
@rvansa Thanks, do you mean it works locally with AWS Lambda emulator or what I tested on AWS Lambda itself (this is where I get the error 166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it Error (criu/cr-restore.c:1548): 166 exited, status=1 Error (criu/cr-restore.c:2605): Restoring FAILED. ? |
@rvansa Thanks, ok, it seems generally to work , which is good. Can you please use this example https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container and https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/blob/master/spring-crac-with-serverless-container/crac-steps.sh. I worked on this several months ago, but I had to modify some scripts, but I couldn't make it work on AWS Cloud 9 instance the way it was described in your example. Maybe I missed something important there. |
@Vadym79 I need a bit more detailed steps to follow to reproduce this; the readme looks like a mix of local setup and outdated information. I've started a fresh Cloud9, cloned the repository and fetched latest JDK 22 with CRaC, ensuring that So another attempt, I've built the image using
Running
I've added
So in what step exactly is the problem? |
@rvansa Thanks! Do I understand correctly that the last console output is on AWS Lambda and not locally? I personally had 166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it Error (criu/cr-restore.c:1548): 166 exited, status=1 Error (criu/cr-restore.c:2605): Restoring FAILED. during the CraC restore on AWS Lambda. But maybe I missed several steps. For example your wrote "ensuring that criu is owned by root and has SUID set." I didn't do that Can you please give me exact commands to execute? Also what Docker image have you used for the latest JDK 22 with CRaC? |
Most CRaC instructions suggest
This question does not make any sense. The image is selected by the Docker scripts that are part of the repo. I've just added Zulu 22.30.13 from https://www.azul.com/downloads |
No, your repository does not have
I should have reacted to this part earlier. This looks like you've did the checkpoint in a container with GLIBC >= 2.35 and modern kernel (not exactly sure which version but Amazon offers you the AMI 2023 with 6.something = modern). If you later try to restore with AMI 2 that features kernel 5.10 you run into this trouble. The solution is to either base this on an older image (e.g. Ubuntu 18, or CentOS 7) that does not use rseq, or perform the checkpoint with environment variable |
Ok, it worked for me locally to, but not on AWS Lambda. Yes, I see that there are indeed no instructions how to run it on AWS, but basically there are 2 steps
I will also look later and add the exact instructions. |
See my message above Here are instructions for 1) I use eu-central-1 AWS region. Please replace {aws_account_id} with your aws account id. a) aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com b) in case the reporsitory doesn't exist aws ecr create-repository --repository-name aws-spring-boot-3.2-java21-crac-custom-docker-image --image-scanning-configuration scanOnPush=true --region eu-central-1 c) docker tag crac-lambda-restore-zulu-spring-boot {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com/aws-spring-boot-3.2-java21-crac-custom-docker-image:v1 docker push {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com/aws-spring-boot-3.2-java21-crac-custom-docker-image:v1 Then deploy the stack with AWS SAM: sam deploy -g |
Thanks for detailing those steps; I wanted to see what exactly you're doing to avoid any mismatch. I was able to reproduce the rseq error and tried to fix adding
to However now I also observe the
The Thanks for your help so far! |
So far I found that Lambda is running probably a differently built kernel |
@rvansa thanks for your efforts to investigate the issue! |
I went a couple of dead ends, with following observations:
I was able to upload coredump to a s3 bucket and download these; the addresses are not resolved (so stacktraces aren't really useful). The process is killed with It's difficult to track what is happening in the process before the segfault. We cannot use |
@rvansa thanks for your efforts once again. I'm just wondering whether your own example at https://github.com/CRaC/example-lambda will work by updating the stack to Ubuntu from 18 ot the newer version (22 or 24) and Azul JDK from 8 to 21 or 22. Not sure if the problem lies on that or because I'm using Spring Boot Stack and DynamoDB instead of SQS and so. Maybe we can check iteratively but doing what I offered, then I convert pure Java Lambda to Spring Boot Lambda with SQS instead of DynamoDB and test. |
Hi, I have tried to update to Ubuntu 22, setting |
Hi @Vadym79 the fix should be in the last release on https://www.azul.com/downloads/#downloads-table-zulu , I've checked |
Hi @rvansa Thansk for your efforts. I took the Zulu CRaC from https://cdn.azul.com/zulu/bin/zulu21.36.19-ca-crac-jdk21.0.4-linux_x64.tar.gz and re-run the steps. I've got the following error on AWS Lambda during restore now... shm_open: No such file or directory I'll have to look a bit deeper whether I did everything right and didn't miss any steps as my last attempts have been months ago.... |
The |
I don't clearly see it, as besides the messages provided above I only see |
I've seen those timeouts as well; I thought these can be attributed to image pull to the local node. Second invocation of the lambda worked in my tests (well, to an extent - it responded some message about invalid format of the request...). |
Oh, I wanted to improve the
Looks like you happen to trigger this on a code path that would fail an assertion and terminate the process - this didn't occur to me before. |
As we have different results, let's verify that we're following the same steps
s00_init() {
} Do you use the same JDK version as me?
|
I can confirm that with adhering to those steps I can see the same error; previously I was running with a local build of CRIU and I wonder if I may have missed (unrelevant) fix the causes this trouble (it's unlikely since it was integrated in June, but I don't have a better explanation). To be honest the demo has gone a bit convoluted, especially because of the packing of libjvm.so: IIUC the jlink'ed JDK is not used at all, only what's in the image. As a workaround for the CRIU issue, I've used one from the previous release: https://cdn.azul.com/zulu/bin/zulu21.34.19-ca-crac-jdk21.0.3-linux_x64.tar.gz This fails in your resource:
I have commented that out and now it restores - when I've tried the
And sorry about the inconvenience with CRIU versions. |
One more thing: please don't forget to use |
Thanks for providing further suggestions. Unfortunately the same error using https://cdn.azul.com/zulu/bin/zulu21.34.19-ca-crac-jdk21.0.3-linux_x64.tar.gz -XX:CPUFeatures:generic wasn't recognized as a valid JVM option..... |
I meant using only the I made a typo: should be |
Sorry, completely forgot that I'm using FROM azul/zulu-openjdk:21-jdk-crac-latest as builder in the Docker files and not previously locally downloaded and installed Azul CRaC JDK. With the change in Dockerfile-zulu-spring-boot.restore by copying the old criu into $JAVA_HOME/lib I could now get passt restore. I have now other application specific errors which I will dig deeper. Thanks a lot! I hope to have at least some working demo soon |
I took azul/zulu-openjdk:21-jdk-crac-latest and deployed it as a Docker mage on Lambda (see https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container oder https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/example-lambda)
At the CRaC checkpoint I get the following error:
Detected cgroup V1 freezer
Warn (compel/src/lib/infect.c:129): Unable to interrupt task: 8 (Operation not permitted)
Unlock network
Unfreezing tasks into 1
Unseizing 8 into 1
Error (compel/src/lib/infect.c:358): Unable to detach from 8: Operation not permitted
Error (criu/cr-dump.c:2063): Dumping FAILED.
I have also tried Anton's way from this example (but using Java 21 not Java 8), namely using AWS Lambda Emulator on AWS Cloud9, starting the application there, making checkpoint and packaging the data from it into Docker Image deployed on AWS Lambda in the hope that CRaC Restore will then work (see the sae repo (see https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container). It does not with the following error:
166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it
Error (criu/cr-restore.c:1548): 166 exited, status=1
Error (criu/cr-restore.c:2605): Restoring FAILED.
Sure, it may be the case that the kernel of the AWS Cloud 9 instance is a different one than on the Firecracker VM where Lambda is running and CRIU is very sensitive to it.
Can you please advice? Did Firecracker VM stop supporting CRIU/CRaC? I got an inofficial comment from AWS that Firecracker VM doesn't support CRaC but couldn't veriy it.
Thanks for your support !
The text was updated successfully, but these errors were encountered: