Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create 3.10.1 cluster #280

Closed
gwolski opened this issue Nov 3, 2024 · 6 comments · Fixed by #281
Closed

Unable to create 3.10.1 cluster #280

gwolski opened this issue Nov 3, 2024 · 6 comments · Fixed by #281
Assignees

Comments

@gwolski
Copy link

gwolski commented Nov 3, 2024

I've pulled a clean clone of aws-eda-slurm-cluster.
After a source setup.sh, I'm able to create a 3.11.1 cluster with the install.sh --cdk-cmd create
I then create a new config yml file, only changing the cluster version and my custom ami to be a 3.10.1 ami.
Here is the diff of the config files:

diff tsi3fcs-x86_instance_types.yml tsi3-3-10-1-x86_instance_types.yml
11c11
< StackName: tsi3fcs-config
---
> StackName: tsi3-3-10-1-config
42c42
<     Version: 3.11.1
---
>     Version: 3.10.1
46c46
<       CustomAmi: ami-0d68c6538cd916c25  # pcluster-3-11-1-Rocky-8-x86-64-ami-0d0023cdec9d16c99 2024-10-24T03-12-55.412Z
---
>       CustomAmi: ami-03ff401c2420be54f   # pcluster-3-10-1-Rocky-8-10-x86-64-ami-0dae235de369c403c

I am not able to create a 3.10.1 cluster. Error message in CloudWatch logs, /aws/lambda/tsi3-3-10-1-config-CreateParallelClusterConfig complains about unable to import module rpds.rpds:


timestamp message
1730609451371 INIT_START Runtime Version: python:3.12.v36 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:188d9ca2e2714ff5637bd2bbe06ceb81ec3bc408a0f277dab104c14cd814b081
1730609452483 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609452564 INIT_REPORT Init Duration: 1193.71 ms Phase: init Status: error Error Type: Runtime.ImportModuleError
1730609453572 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609453648 INIT_REPORT Init Duration: 1069.94 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609453648 START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609453663 END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609453663 REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1083.74 ms Billed Duration: 1084 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730609520786 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609520874 INIT_REPORT Init Duration: 1070.40 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609520874 START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609520885 END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609520885 REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1081.01 ms Billed Duration: 1082 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730609636990 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609637064 INIT_REPORT Init Duration: 1036.39 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609637064 START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609637077 END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609637077 REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1048.69 ms Billed Duration: 1049 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError

I've tried latest version of CDK and different nodejs versions (see #279)
I'm at a loss if this is a parallelcluster issue or aws-eda-slurm-cluster issue. Would like to see if you can reproduce.

@gwolski
Copy link
Author

gwolski commented Nov 3, 2024

I get the same problem if I try to use parallelcluster 3.9.3:

timestamp message
1730626340792 INIT_START Runtime Version: python:3.12.v36 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:188d9ca2e2714ff5637bd2bbe06ceb81ec3bc408a0f277dab104c14cd814b081
1730626341838 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626341904 INIT_REPORT Init Duration: 1112.04 ms Phase: init Status: error Error Type: Runtime.ImportModuleError
1730626342851 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626342920 INIT_REPORT Init Duration: 1002.02 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730626342920 START RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Version: $LATEST
1730626342934 END RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2
1730626342934 REPORT RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Duration: 1014.86 ms Billed Duration: 1015 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730626411011 [ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626411083 INIT_REPORT Init Duration: 1009.77 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730626411083 START RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Version: $LATEST
1730626411096 END RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2
1730626411096 REPORT RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Duration: 1022.26 ms Billed Duration: 1023 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError

I'm not sure I understand how Python3.12.v36 comes into play here. Not sure where that is coming from. Is that somehow involved? If I look back to May time frame, my successful deployment of 3.9.1 was with 3.9v56.
As noted, I was able to deploy a 3.11.1, but that has slurmctld crashing every 24 hours or so as per filed issue: aws/aws-parallelcluster#6529
That deployment successfully used 3.12.v36.

3.9.3 deployment is using Rocky 8.9 due to parallelcluster build-image not supporting Rocky8.10 in 3.9.3 due to lack of fsx support which didn't become available until 3.10 for Rocky 8.10.

@gwolski
Copy link
Author

gwolski commented Nov 4, 2024

This is beginning to smell like a parallelcluster issue? If CreateParallelCluster is a parallelcluster command.
I find this with a bit of googling:
https://stackoverflow.com/questions/76667202/aws-lambda-function-cant-find-module

This suggests that jsonschema imports rpds.rpds. This article is a bit old as well.

I built a 3.10.1 and 3.11.1 parallelcluster virtual environments, I see the same jsonschema used in both a 3.10.1 and a 3.11.1 environment:

$ pip freeze | grep jsonschema
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
$

I don't see an rpds package installed in either environment, but I can install it.

I do see this rpds-py package:

$ pip freeze |grep rpds
rpds-py==0.20.1

So I'm not quite seeing how 3.11.1 can work and yet 3.9.3 and 3.10.1 fail.

Let me know if I should file this as an issue in parallelcluster.

@gwolski
Copy link
Author

gwolski commented Nov 4, 2024

Please also see this commentary about CDK: aws/aws-cdk#26300
and note at the end there is a link to a parallelcluster issue by hanwen-pcluster:
aws/aws-parallelcluster#6460
and one by gmarciani
aws/aws-parallelcluster#6465

It doesn't all quite make sense to me yet.

@cartalla cartalla self-assigned this Nov 4, 2024
cartalla added a commit that referenced this issue Nov 4, 2024
PC 3.11.1 updated the Lambda layer to use Python 3.12.
Previous version required Python 3.9.

Set the Lambda Python runtime based on the ParallelCluster version so that
the application continues to work on older versions of PC.

This bug was introduced in #270.

Related to #270

Resolves #280
@cartalla
Copy link
Contributor

cartalla commented Nov 4, 2024

Reproduced. This bug was introduced by #270. I updated to use the Python 3.12 lambda runtime with PC 3.11.1 without maintaining backward compatibility to older PC versions.

Fix incoming.

cartalla added a commit that referenced this issue Nov 4, 2024
PC 3.11.1 updated the Lambda layer to use Python 3.12.
Previous version required Python 3.9.

Set the Lambda Python runtime based on the ParallelCluster version so that
the application continues to work on older versions of PC.

This bug was introduced in #270.

Related to #270

Resolves #280
cartalla added a commit that referenced this issue Nov 4, 2024
PC 3.11.1 updated the Lambda layer to use Python 3.12.
Previous version required Python 3.9.

Set the Lambda Python runtime based on the ParallelCluster version so that
the application continues to work on older versions of PC.

This bug was introduced in #270.

Related to #270

Resolves #280
cartalla added a commit that referenced this issue Nov 4, 2024
PC 3.11.1 updated the Lambda layer to use Python 3.12.
Previous version required Python 3.9.

Set the Lambda Python runtime based on the ParallelCluster version so that
the application continues to work on older versions of PC.

This bug was introduced in #270.

Related to #270

Resolves #280
@gwolski
Copy link
Author

gwolski commented Nov 5, 2024

With a recent pull, I'm confirming I was able to create a 3.10.1 cluster and have successfully run a srun job from the head_node. Thank you for the quick fix.

Sadly the version of slurm (23.11.7) supported by 3.10.1 seems to still have some issues for cloud machines as noted in the release that was deployed for 3.11.0 slurm 23.11.10. Release notes from 23.11.10:
"[23.11.10 has] fixes for jobs potentially being stuck when using cloud nodes when some nodes are powered down"
But until issue aws/aws-parallelcluster#6529 gets resolved, I'm going to deploy this release.

@cartalla
Copy link
Contributor

cartalla commented Nov 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants