Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't make CRaC run on AWS Lambda #7

Open
Vadym79 opened this issue Jul 6, 2024 · 33 comments
Open

Can't make CRaC run on AWS Lambda #7

Vadym79 opened this issue Jul 6, 2024 · 33 comments

Comments

@Vadym79
Copy link

Vadym79 commented Jul 6, 2024

I took azul/zulu-openjdk:21-jdk-crac-latest and deployed it as a Docker mage on Lambda (see https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container oder https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/example-lambda)

At the CRaC checkpoint I get the following error:

Detected cgroup V1 freezer

Warn (compel/src/lib/infect.c:129): Unable to interrupt task: 8 (Operation not permitted)

Unlock network

Unfreezing tasks into 1

Unseizing 8 into 1

Error (compel/src/lib/infect.c:358): Unable to detach from 8: Operation not permitted

Error (criu/cr-dump.c:2063): Dumping FAILED.

I have also tried Anton's way from this example (but using Java 21 not Java 8), namely using AWS Lambda Emulator on AWS Cloud9, starting the application there, making checkpoint and packaging the data from it into Docker Image deployed on AWS Lambda in the hope that CRaC Restore will then work (see the sae repo (see https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container). It does not with the following error:

166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it

Error (criu/cr-restore.c:1548): 166 exited, status=1

Error (criu/cr-restore.c:2605): Restoring FAILED.

Sure, it may be the case that the kernel of the AWS Cloud 9 instance is a different one than on the Firecracker VM where Lambda is running and CRIU is very sensitive to it.

Can you please advice? Did Firecracker VM stop supporting CRIU/CRaC? I got an inofficial comment from AWS that Firecracker VM doesn't support CRaC but couldn't veriy it.

Thanks for your support !

@rvansa
Copy link
Member

rvansa commented Jul 8, 2024

The dump error looks like you're not having SYS_PTRACE capability, because
a) not running as root (either directly or jdk/lib/criu owner running as root and having suid bit set)
b) in a containerized environment either using --privileged or --cap-add SYS_PTRACE --cap-add CHECKPOINT_RESTORE
c) having wrong value in /proc/sys/kernel/yama/ptrace_scope

I was trying to follow https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/example-lambda but image build failed at COPY checkpoint.cmd.sh - the file is not present (I wasn't really looking into why you use a modified version of the scripts).

The restore error about rseq is not important, though it indicates that you either use a very old kernel or have some seccomp policies that will affect feature detection. What AMI is that?

@Vadym79
Copy link
Author

Vadym79 commented Jul 8, 2024

Hi Radim,
thanks for your reply. My example is probably difficult to follow as I published the Docker Image in my private Amazon ECR repository, But nevetheless if I run Docker image on AWS Lambda I don't control the docker run command and can't set neither --provilleged nor everything else. But to create the Docker image I used m5 AWS Cloud9 instance which has Amazon Linux 2023 and Docker pre-installed.
But you first can try it on your own in this example-lambda. You currently use Java 1.8 (also in template.yaml) there which isn't supported by AWS Lambda anymore since end of 2023. You can neither deploy nor update Lambda with this runtime. So please update everything to Java 21 and re-test if this still works. You can also skip this very complex example with packaging "local" checkpoint file into Docker image and then doing restore. Simply do checkpoint in the Lambda function itself during its execution first to test whether CraC/CRIU works on AWS Lambda in general.

@rvansa
Copy link
Member

rvansa commented Jul 8, 2024

But nevetheless if I run Docker image on AWS Lambda I don't control the docker run command and can't set neither --provilleged nor everything else.

Well it's the known current limitation of CRaC using CRIU underhood: you cannot checkpoint without some extra privileges. We know that for some use cases it's inconvenient, and there can be a solution in the future, but not ATM.

But you first can try it on your own in this example-lambda.

That's why I tried to follow your linked example locally, but there was the failure. Can you fix it up so that the provided steps work?

You currently use Java 1.8

Err, why do you think that? When trying to follow the tutorial I've executed docker run -it --rm azul/zulu-openjdk:21-jdk-crac (the image you mentioned in the beginning) in one terminal and then copied /opt/zulu21.34.19-ca-crac-jdk21.0.3-linux_x64 and provided this to jlink.

@Vadym79
Copy link
Author

Vadym79 commented Jul 8, 2024

"Well it's the known current limitation of CRaC using CRIU underhood: you cannot checkpoint without some extra privileges. We know that for some use cases it's inconvenient, and there can be a solution in the future, but not ATM."
->
yes I suppose that this example was able to run on AWS Lambda with Java 8 several years ago with exactly the same limitations. But I couldn't make it work with Java 21.

I meant that your own example https://github.com/CRaC/example-lambda uses Java 8 (see https://github.com/CRaC/example-lambda/blob/master/template.yml with Runtime: java8) and pom.xml. So this is not a valid tutorial as AWS Lambda doesn't support Java 8 anymore. That's why I suggested you to update your own example to Java 21 and retest whether it still supposed to work or something has change on the Firecracker VM.

@rvansa
Copy link
Member

rvansa commented Jul 8, 2024

I see what you meant - it's a bit confusing as the example is based on another example, but AFAIU the template.yml is not used in the CRaCed version. We should clean it up of non-relevant parts. By following the tutorial I meant the steps as provided in the README

@rvansa
Copy link
Member

rvansa commented Jul 8, 2024

@Vadym79 I've filed #8 with removal of those unneeded files & some convenience changes. I've verified that this works locally; can you check where you diverge from the steps or at what phase does the example stop working?

@Vadym79
Copy link
Author

Vadym79 commented Jul 8, 2024

@rvansa Thanks, do you mean it works locally with AWS Lambda emulator or what I tested on AWS Lambda itself (this is where I get the error

166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it

Error (criu/cr-restore.c:1548): 166 exited, status=1

Error (criu/cr-restore.c:2605): Restoring FAILED.

?

@rvansa
Copy link
Member

rvansa commented Jul 8, 2024

Now I've tested in AWS as well; created function crac-test from the container image, and in the Test tab use the SQS template:
image

It spits out Error (criu/cr-restore.c:2009): Can't attach to 129: Operation not permitted in the log but this doesn't seem to prevent it from working. I'll check CRIU sources why is this an error...

@Vadym79
Copy link
Author

Vadym79 commented Jul 8, 2024

@rvansa Thanks, ok, it seems generally to work , which is good. Can you please use this example https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/tree/master/spring-crac-with-serverless-container and https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/blob/master/spring-crac-with-serverless-container/crac-steps.sh. I worked on this several months ago, but I had to modify some scripts, but I couldn't make it work on AWS Cloud 9 instance the way it was described in your example. Maybe I missed something important there.

@rvansa
Copy link
Member

rvansa commented Jul 10, 2024

@Vadym79 I need a bit more detailed steps to follow to reproduce this; the readme looks like a mix of local setup and outdated information.

I've started a fresh Cloud9, cloned the repository and fetched latest JDK 22 with CRaC, ensuring that criu is owned by root and has SUID set. Installed Maven. Started local DynamoDB.
I ran ./crac-steps.sh s00_init and s01_build_aws. The s02_start_checkpoint does not use the image with aws, and when I just replaced the image in the docker command but that fails with Could not get environment variable AWS_LAMBDA_RUNTIME_API

So another attempt, I've built the image using s01_build and started using s02_start_checkpoint, executing ./crac-steps.sh s03_checkpoint in another terminal, local checkpoint succeeded:

CR: Checkpoint ...
END RequestId: 5c9a9c0e-502f-4511-a24e-f5030de5e075
REPORT RequestId: 5c9a9c0e-502f-4511-a24e-f5030de5e075  Init Duration: 0.60 ms  Duration: 19632.44 ms   Billed Duration: 19633 ms       Memory Size: 3008 MB    Max Memory Used: 3008 MB
10 Jul 2024 08:06:32,621 [WARNING] (rapid) First fatal error stored in appctx: Runtime.ExitError
10 Jul 2024 08:06:32,621 [WARNING] (rapid) Process 165(bash) exited: Runtime exited without providing a reason

Running ./crac-steps.sh s04_prepare_restore but a local deploy fails because of

Suppressed: java.nio.file.NoSuchFileException: /tmp/crac/dump4.log
                at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
                at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
                at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261) ~[?:?]
                at java.base/java.nio.file.Files.newByteChannel(Files.java:379) ~[?:?]
                at java.base/java.nio.file.Files.newByteChannel(Files.java:431) ~[?:?]
                at java.base/java.nio.file.Files.readAllBytes(Files.java:3268) ~[?:?]
                at software.amazonaws.PrimingResource.afterRestore(PrimingResource.java:49) ~[function/:?]

I've added RUN mkdir -p /tmp/crac && touch /tmp/crac/dump4.log to the restore Dockerfile and tried again. This time it seems to work:

[main] INFO software.amazonaws.PrimingResource - 
2024-07-10 08:16:55  INFO  DefaultLifecycleProcessor:576 - Restarting Spring-managed lifecycle beans after JVM restore
2024-07-10 08:16:55  INFO  AWSLambda:56 - Restored AWSLambda in 0.157 seconds (process running for 0.164)
[main] INFO com.amazonaws.serverless.proxy.internal.servlet.AwsServletContext - Initializing Spring DispatcherServlet 'dispatcherServlet'
2024-07-10 08:16:55  INFO  DispatcherServlet:532 - Initializing Servlet 'dispatcherServlet'
2024-07-10 08:16:55  INFO  DispatcherServlet:554 - Completed initialization in 2 ms
[main] INFO software.amazonaws.example.product.handler.StreamLambdaHandler - entered generic stream lambda handler
[main] INFO software.amazonaws.example.product.controller.ProductController - entered getProductById method with id  crac0
[main] INFO software.amazonaws.example.product.dao.DynamoProductDao - product table name AWSLambdaSpringBoot32Java21DockerImageAndCRaCProductsTable
[main] INFO software.amazonaws.example.product.dao.DynamoProductDao - endpoint http://172.17.0.1:8000
[main] INFO software.amazonaws.example.product.dao.DynamoProductDao - end get item 
[main] INFO software.amazonaws.example.product.controller.ProductController -  product not found 
[main] INFO com.amazonaws.serverless.proxy.internal.LambdaContainerHandler - null null- null [10/07/2024:08:16:56Z] "GET /products/0 null" 200 4 "-" "-" combined
END RequestId: 9b618ec0-c4fc-4619-b10e-6099c13d28cd
REPORT RequestId: 9b618ec0-c4fc-4619-b10e-6099c13d28cd  Init Duration: 0.64 ms  Duration: 1148.88 ms    Billed Duration: 1149 ms        Memory Size: 3008 MB    Max Memory Used: 3008 MB

So in what step exactly is the problem?

@Vadym79
Copy link
Author

Vadym79 commented Jul 10, 2024

@rvansa Thanks! Do I understand correctly that the last console output is on AWS Lambda and not locally?

I personally had

166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it

Error (criu/cr-restore.c:1548): 166 exited, status=1

Error (criu/cr-restore.c:2605): Restoring FAILED.

during the CraC restore on AWS Lambda. But maybe I missed several steps. For example your wrote "ensuring that criu is owned by root and has SUID set." I didn't do that Can you please give me exact commands to execute?

Also what Docker image have you used for the latest JDK 22 with CRaC?

@rvansa
Copy link
Member

rvansa commented Jul 10, 2024

during the CraC restore on AWS Lambda. But maybe I missed several steps. For example your wrote "ensuring that criu is owned by root and has SUID set." I didn't do that Can you please give me exact commands to execute?

sudo chown root $JAVA_HOME/lib/criu && sudo chmod u+s $JAVA_HOME/lib/criu

Most CRaC instructions suggest sudo tar ... when unpacking the JDK but I personally prefer to assert it this way.

Also what Docker image have you used for the latest JDK 22 with CRaC?

This question does not make any sense. The image is selected by the Docker scripts that are part of the repo. I've just added Zulu 22.30.13 from https://www.azul.com/downloads

@rvansa
Copy link
Member

rvansa commented Jul 10, 2024

Thanks! Do I understand correctly that the last console output is on AWS Lambda and not locally?

No, your repository does not have s06_init_aws and later steps, contrary to https://github.com/CRaC/example-lambda

166: Error (criu/cr-restore.c:3151): rseq: can't restore as kernel doesn't support it

I should have reacted to this part earlier. This looks like you've did the checkpoint in a container with GLIBC >= 2.35 and modern kernel (not exactly sure which version but Amazon offers you the AMI 2023 with 6.something = modern). If you later try to restore with AMI 2 that features kernel 5.10 you run into this trouble. The solution is to either base this on an older image (e.g. Ubuntu 18, or CentOS 7) that does not use rseq, or perform the checkpoint with environment variable GLIBC_TUNABLES set to glibc.pthread.rseq=0.

@Vadym79
Copy link
Author

Vadym79 commented Jul 10, 2024

Ok, it worked for me locally to, but not on AWS Lambda. Yes, I see that there are indeed no instructions how to run it on AWS, but basically there are 2 steps

  1. Push the Docker image containing Checkpoint file to Amazon ECR and give it the name aws-spring-boot-3.2-java21-crac-custom-docker-image:v1 to execute 2) without any changes
  2. execute sam deploy -g (Cloud 9 has SAM installed). This docker image is referenced in the template.yaml, see https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/blob/master/spring-crac-with-serverless-container/template.yaml

I will also look later and add the exact instructions.

@Vadym79
Copy link
Author

Vadym79 commented Jul 10, 2024

See my message above

Here are instructions for 1) I use eu-central-1 AWS region. Please replace {aws_account_id} with your aws account id.

a) aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com

b) in case the reporsitory doesn't exist

aws ecr create-repository --repository-name aws-spring-boot-3.2-java21-crac-custom-docker-image --image-scanning-configuration scanOnPush=true --region eu-central-1

c) docker tag crac-lambda-restore-zulu-spring-boot {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com/aws-spring-boot-3.2-java21-crac-custom-docker-image:v1

docker push {aws_account_id}.dkr.ecr.eu-central-1.amazonaws.com/aws-spring-boot-3.2-java21-crac-custom-docker-image:v1

Then deploy the stack with AWS SAM: sam deploy -g

@rvansa
Copy link
Member

rvansa commented Jul 10, 2024

Thanks for detailing those steps; I wanted to see what exactly you're doing to avoid any mismatch. I was able to reproduce the rseq error and tried to fix adding

ENV GLIBC_TUNABLES=glibc.pthread.rseq=0

to Dockerfile-zulu-spring-boot.checkpoint. This was not necessary with the 'official' lambda example as this uses FROM ubuntu:18.04 which features older glibc. Note that we've realised the inconvenience and the next Azul builds will set the env variable automatically.

However now I also observe the

Error (criu/cr-restore.c:2009): Can't attach to 167: Operation not permitted
/restore.cmd-zulu-spring-boot.sh: line 9:    15 Segmentation fault      (core dumped) $JAVA_HOME/bin/java -XX:CRaCRestoreFrom=/tmp/cr

The Can't attach message is probably a red herring, since I can see that in some successful restores, too (though I haven't been able to reproduce this running manually, yet), but the segfault is really unfortunate. I'll see if I can get coredump out of lambda and analyze.

Thanks for your help so far!

@rvansa
Copy link
Member

rvansa commented Jul 10, 2024

So far I found that Lambda is running probably a differently built kernel 5.10.216-225.855.amzn2.x86_64. This one does not have CONFIG_IKCONFIG set so I can't see /proc/config.gz for other options, and /boot/ is not mounted to the container... But apparently it does not have the YAMA LSM that normally controls ptrace access through /proc/sys/kernel/yama/ptrace_scope (the file does not exist at all). The container does not even mount /sys so I could see what LSM are in use through /sys/kernel/security/lsm.
Another observation is that the container is not executed as root but another user: this creates another set of problems in containers as the SUID on criu is not almighty; there are effective UID/GID checks that somehow prevent non-root user from restore using criu with SUID even in regular unprivileged docker containers.
I'll continue with the investigation.

@Vadym79
Copy link
Author

Vadym79 commented Jul 11, 2024

@rvansa thanks for your efforts to investigate the issue!

@rvansa
Copy link
Member

rvansa commented Jul 12, 2024

I went a couple of dead ends, with following observations:

  • the container is executed with PR_SET_NO_NEW_PRIVS, which means that the SUID flags on CRIU are not applied and CRIU operates completely without priviledges
  • there's probably a seccomp policy the disables the ptrace syscall completely, even PTRACE_TRACEME (otherwise, this should fail only if the process is already traced or YAMA policy is set to 2 or 3 - but we don't have YAMA)
  • in the end the above should not prevent the restore - after all the example-lambda works OK, and I haven't found a significant divergence between CRIU debug logs from the successful example-lambda and crashing springboot

I was able to upload coredump to a s3 bucket and download these; the addresses are not resolved (so stacktraces aren't really useful). The process is killed with si_code=SI_KERNEL, indicating that probably the process was in a state that would make kernel access unmapped memory, inconsistently set up signal frame or something like that. Regrettably this dreaded code does not provide any information. I've seen that e.g. in case that RSEQ is not restored correctly, but in this case we have disabled RSEQ on checkpoint. Still, might be an issue?

It's difficult to track what is happening in the process before the segfault. We cannot use strace (one reason is because this interferes with CRIU, but also we don't have permissions to use ptrace). I can't install any BPF-based filters to log stuff in kernel. We can guess that this happens very soon after the process starts executing since I can see up to 3 coredumps for criuengine (I think this is the deamonizing hierarchy, and these processes just raise the signal to propagate exit value - nothing wrong with them).

@Vadym79
Copy link
Author

Vadym79 commented Jul 13, 2024

@rvansa thanks for your efforts once again. I'm just wondering whether your own example at https://github.com/CRaC/example-lambda will work by updating the stack to Ubuntu from 18 ot the newer version (22 or 24) and Azul JDK from 8 to 21 or 22. Not sure if the problem lies on that or because I'm using Spring Boot Stack and DynamoDB instead of SQS and so. Maybe we can check iteratively but doing what I offered, then I convert pure Java Lambda to Spring Boot Lambda with SQS instead of DynamoDB and test.

@rvansa
Copy link
Member

rvansa commented Jul 17, 2024

Hi, I have tried to update to Ubuntu 22, setting ENV GLIBC_TUNABLES=glibc.pthread.rseq=0 and this regrettably revealed a bug that prevents us from using -XX:CPUFeatures=generic - we'll try to fix it in the upcoming release. With developer builds (not prone to this issue) I was able to restore.
I have noticed that beforehand we were not setting -XX:CPUFeatures beforehand in your case and while a difference in CPU features compared to checkpoint environment should trigger an error message, I wonder if the checkpoint could use some advanced feature that causes a crash like above. With my developer build I was able to restore your test OK (well at least the native part, Java started executing afterRestore hooks...), so I'll get back to this once we have an official build you could try as well.

@rvansa
Copy link
Member

rvansa commented Aug 7, 2024

Hi @Vadym79 the fix should be in the last release on https://www.azul.com/downloads/#downloads-table-zulu , I've checked azul/zulu-openjdk:21-jdk-crac-latest was updated. Could you try with the latest image?

@Vadym79
Copy link
Author

Vadym79 commented Aug 7, 2024

Hi @rvansa Thansk for your efforts.

I took the Zulu CRaC from https://cdn.azul.com/zulu/bin/zulu21.36.19-ca-crac-jdk21.0.4-linux_x64.tar.gz and re-run the steps. I've got the following error on AWS Lambda during restore now...

shm_open: No such file or directory
Error (criu/cr-restore.c:2009): Can't attach to 167: Operation not permitted
pie: 182: Error (criu/pie/restorer.c:561): prctl PR_SET_NO_NEW_PRIVS failspie: 182: Error (criu/pie/restorer.c:765): BUG at criu/pie/restorer.c:765

I'll have to look a bit deeper whether I did everything right and didn't miss any steps as my last attempts have been months ago....

@rvansa
Copy link
Member

rvansa commented Aug 8, 2024

The shm_open: No such file or directory is not critical (it would only prevent you from updating env vars and system properties), and the PR_SET_NO_NEW_PRIVS shouldn't be a problem either (it complains because prctl syscall is disabled, so we can't even check if the no-new-privileges is set). The question is whether the process restores.

@Vadym79
Copy link
Author

Vadym79 commented Aug 8, 2024

I don't clearly see it, as besides the messages provided above I only see
INIT_REPORT Init Duration: 9929.65 ms Phase: init Status: timeout
directly after them... So Lambda can't initialize within max 10 seconds phase which I can't adjust. I will probably need to change the log Level to debug to see more

@rvansa
Copy link
Member

rvansa commented Aug 8, 2024

I've seen those timeouts as well; I thought these can be attributed to image pull to the local node. Second invocation of the lambda worked in my tests (well, to an extent - it responded some message about invalid format of the request...).

@rvansa
Copy link
Member

rvansa commented Aug 8, 2024

Oh, I wanted to improve the PR_SET_NO_NEW_PRIVS handling and I've noticed that in your case it triggers the

Error (criu/pie/restorer.c:765): BUG at criu/pie/restorer.c:765

Looks like you happen to trigger this on a code path that would fail an assertion and terminate the process - this didn't occur to me before.

@Vadym79
Copy link
Author

Vadym79 commented Aug 8, 2024

As we have different results, let's verify that we're following the same steps

  1. I used my modified examples (I don't use seccomp and other stuff) and s00 - s04 steps described here https://github.com/Vadym79/AWSLambdaJavaDockerImageWithCrac/blob/master/spring-crac-with-serverless-container/crac-steps.sh

  2. I updated s00_init to use the newest azul crac 21, to it looks like

s00_init() {

#curl -LO https://d1ni2b6xgvw0s0.cloudfront.net/v2.x/dynamodb_local_latest.tar.gz
#tar axf dynamodb_local_latest.tar.gz
#curl -L -o aws-lambda-rie https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/download/v1.15/aws-lambda-rie
#chmod +x aws-lambda-rie

curl -LO  https://cdn.azul.com/zulu/bin/zulu21.36.19-ca-crac-jdk21.0.4-linux_x64.tar.gz
tar axf zulu21.36.19-ca-crac-jdk21.0.4-linux_x64.tar.gz
dojlink zulu21.36.19-ca-crac-jdk21.0.4-linux_x64

}

Do you use the same JDK version as me?

  1. I use AWS m5.large Cloud9 instance to execute all those steps. Cloud9 was recently deprecated by AWS for the new users, but I can still use the same environment
  2. Local restore on Cloud9 works and I test the AWS setup through the Amazon API Gateway by going to GET Method and posting some id like 0 into the path, see the attachment crac-screenshot

@rvansa
Copy link
Member

rvansa commented Aug 9, 2024

I can confirm that with adhering to those steps I can see the same error; previously I was running with a local build of CRIU and I wonder if I may have missed (unrelevant) fix the causes this trouble (it's unlikely since it was integrated in June, but I don't have a better explanation).

To be honest the demo has gone a bit convoluted, especially because of the packing of libjvm.so: IIUC the jlink'ed JDK is not used at all, only what's in the image.

As a workaround for the CRIU issue, I've used one from the previous release: https://cdn.azul.com/zulu/bin/zulu21.34.19-ca-crac-jdk21.0.3-linux_x64.tar.gz

This fails in your resource:

Suppressed: java.nio.file.NoSuchFileException: /tmp/crac/dump4.log
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261) ~[?:?]
at java.base/java.nio.file.Files.newByteChannel(Files.java:379) ~[?:?]
at java.base/java.nio.file.Files.newByteChannel(Files.java:431) ~[?:?]
at java.base/java.nio.file.Files.readAllBytes(Files.java:3268) ~[?:?]
at software.amazonaws.PrimingResource.afterRestore(PrimingResource.java:50) ~[function/:?]

I have commented that out and now it restores - when I've tried the apigateway-aws-proxy from Lambda Test tab I got 404, but processed by the Java code; please put a request as in ./crac-steps.sh post hi.

START RequestId: 201a37fa-bd11-4a85-a468-dc1c2ecf9c66 Version: $LATEST
[main] INFO software.amazonaws.example.product.handler.StreamLambdaHandler - entered generic stream lambda handler
2024-08-09 09:44:57 201a37fa-bd11-4a85-a468-dc1c2ecf9c66 WARN  PageNotFound:1300 - No mapping for POST /path/to/resource
2024-08-09 09:44:57 201a37fa-bd11-4a85-a468-dc1c2ecf9c66 WARN  PageNotFound:452 - No endpoint POST /path/to/resource.
[main] INFO com.amazonaws.serverless.proxy.internal.LambdaContainerHandler - 127.0.0.1 null- null [09/04/2015:12:34:56Z] "POST /path/to/resource HTTP/1.1" 404 - "-" "Custom User Agent String" combined
END RequestId: 201a37fa-bd11-4a85-a468-dc1c2ecf9c66
REPORT RequestId: 201a37fa-bd11-4a85-a468-dc1c2ecf9c66	Duration: 17.74 ms	Billed Duration: 18 ms	Memory Size: 1024 MB	Max Memory Used: 288 MB	

And sorry about the inconvenience with CRIU versions.

@rvansa
Copy link
Member

rvansa commented Aug 9, 2024

One more thing: please don't forget to use -XX:CPUFeatures:generic (it would be nice if you could sync up your demo).

@Vadym79
Copy link
Author

Vadym79 commented Aug 9, 2024

Thanks for providing further suggestions. Unfortunately the same error using

https://cdn.azul.com/zulu/bin/zulu21.34.19-ca-crac-jdk21.0.3-linux_x64.tar.gz

crac-screenshot1

-XX:CPUFeatures:generic wasn't recognized as a valid JVM option.....

@rvansa
Copy link
Member

rvansa commented Aug 9, 2024

I meant using only the criu from the older release, and keep the latest JDK. But apparently you're still using the new one - the code calling PR_SET_NO_NEW_PRIVS is not in the older release at all. Have you updated that in the Dockerfile for restore?

I made a typo: should be -XX:CPUFeatures=generic. (try java -XX:+PrintFlagsFinal -version | grep CPUFeatures)

@Vadym79
Copy link
Author

Vadym79 commented Aug 9, 2024

Sorry, completely forgot that I'm using FROM azul/zulu-openjdk:21-jdk-crac-latest as builder in the Docker files and not previously locally downloaded and installed Azul CRaC JDK. With the change in Dockerfile-zulu-spring-boot.restore by copying the old criu into $JAVA_HOME/lib I could now get passt restore. I have now other application specific errors which I will dig deeper.

Thanks a lot! I hope to have at least some working demo soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants