Unable to create 3.10.1 cluster #280

gwolski · 2024-11-03T05:28:17Z

I've pulled a clean clone of aws-eda-slurm-cluster.
After a source setup.sh, I'm able to create a 3.11.1 cluster with the install.sh --cdk-cmd create
I then create a new config yml file, only changing the cluster version and my custom ami to be a 3.10.1 ami.
Here is the diff of the config files:

diff tsi3fcs-x86_instance_types.yml tsi3-3-10-1-x86_instance_types.yml
11c11
< StackName: tsi3fcs-config
---
> StackName: tsi3-3-10-1-config
42c42
<     Version: 3.11.1
---
>     Version: 3.10.1
46c46
<       CustomAmi: ami-0d68c6538cd916c25  # pcluster-3-11-1-Rocky-8-x86-64-ami-0d0023cdec9d16c99 2024-10-24T03-12-55.412Z
---
>       CustomAmi: ami-03ff401c2420be54f   # pcluster-3-10-1-Rocky-8-10-x86-64-ami-0dae235de369c403c

I am not able to create a 3.10.1 cluster. Error message in CloudWatch logs, /aws/lambda/tsi3-3-10-1-config-CreateParallelClusterConfig complains about unable to import module rpds.rpds:

timestamp	message
1730609451371	INIT_START Runtime Version: python:3.12.v36 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:188d9ca2e2714ff5637bd2bbe06ceb81ec3bc408a0f277dab104c14cd814b081
1730609452483	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609452564	INIT_REPORT Init Duration: 1193.71 ms Phase: init Status: error Error Type: Runtime.ImportModuleError
1730609453572	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609453648	INIT_REPORT Init Duration: 1069.94 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609453648	START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609453663	END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609453663	REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1083.74 ms Billed Duration: 1084 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730609520786	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609520874	INIT_REPORT Init Duration: 1070.40 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609520874	START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609520885	END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609520885	REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1081.01 ms Billed Duration: 1082 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730609636990	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730609637064	INIT_REPORT Init Duration: 1036.39 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730609637064	START RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Version: $LATEST
1730609637077	END RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710
1730609637077	REPORT RequestId: fa33bc2e-d8d1-4f6b-b369-81dc3079d710 Duration: 1048.69 ms Billed Duration: 1049 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError

I've tried latest version of CDK and different nodejs versions (see #279)
I'm at a loss if this is a parallelcluster issue or aws-eda-slurm-cluster issue. Would like to see if you can reproduce.

The text was updated successfully, but these errors were encountered:

gwolski · 2024-11-03T09:47:48Z

I get the same problem if I try to use parallelcluster 3.9.3:

timestamp	message
1730626340792	INIT_START Runtime Version: python:3.12.v36 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:188d9ca2e2714ff5637bd2bbe06ceb81ec3bc408a0f277dab104c14cd814b081
1730626341838	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626341904	INIT_REPORT Init Duration: 1112.04 ms Phase: init Status: error Error Type: Runtime.ImportModuleError
1730626342851	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626342920	INIT_REPORT Init Duration: 1002.02 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730626342920	START RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Version: $LATEST
1730626342934	END RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2
1730626342934	REPORT RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Duration: 1014.86 ms Billed Duration: 1015 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError
1730626411011	[ERROR] Runtime.ImportModuleError: Unable to import module 'CreateParallelClusterConfig': No module named 'rpds.rpds' Traceback (most recent call last):
1730626411083	INIT_REPORT Init Duration: 1009.77 ms Phase: invoke Status: error Error Type: Runtime.ImportModuleError
1730626411083	START RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Version: $LATEST
1730626411096	END RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2
1730626411096	REPORT RequestId: 37d926c6-aea3-4025-b79c-bbf595e460a2 Duration: 1022.26 ms Billed Duration: 1023 ms Memory Size: 2048 MB Max Memory Used: 80 MB Status: error Error Type: Runtime.ImportModuleError

I'm not sure I understand how Python3.12.v36 comes into play here. Not sure where that is coming from. Is that somehow involved? If I look back to May time frame, my successful deployment of 3.9.1 was with 3.9v56.
As noted, I was able to deploy a 3.11.1, but that has slurmctld crashing every 24 hours or so as per filed issue: aws/aws-parallelcluster#6529
That deployment successfully used 3.12.v36.

3.9.3 deployment is using Rocky 8.9 due to parallelcluster build-image not supporting Rocky8.10 in 3.9.3 due to lack of fsx support which didn't become available until 3.10 for Rocky 8.10.

gwolski · 2024-11-04T09:47:25Z

This is beginning to smell like a parallelcluster issue? If CreateParallelCluster is a parallelcluster command.
I find this with a bit of googling:
https://stackoverflow.com/questions/76667202/aws-lambda-function-cant-find-module

This suggests that jsonschema imports rpds.rpds. This article is a bit old as well.

I built a 3.10.1 and 3.11.1 parallelcluster virtual environments, I see the same jsonschema used in both a 3.10.1 and a 3.11.1 environment:

$ pip freeze | grep jsonschema
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
$

I don't see an rpds package installed in either environment, but I can install it.

I do see this rpds-py package:

$ pip freeze |grep rpds
rpds-py==0.20.1

So I'm not quite seeing how 3.11.1 can work and yet 3.9.3 and 3.10.1 fail.

Let me know if I should file this as an issue in parallelcluster.

gwolski · 2024-11-04T10:21:10Z

Please also see this commentary about CDK: aws/aws-cdk#26300
and note at the end there is a link to a parallelcluster issue by hanwen-pcluster:
aws/aws-parallelcluster#6460
and one by gmarciani
aws/aws-parallelcluster#6465

It doesn't all quite make sense to me yet.

PC 3.11.1 updated the Lambda layer to use Python 3.12. Previous version required Python 3.9. Set the Lambda Python runtime based on the ParallelCluster version so that the application continues to work on older versions of PC. This bug was introduced in #270. Related to #270 Resolves #280

cartalla · 2024-11-04T17:53:45Z

Reproduced. This bug was introduced by #270. I updated to use the Python 3.12 lambda runtime with PC 3.11.1 without maintaining backward compatibility to older PC versions.

Fix incoming.

PC 3.11.1 updated the Lambda layer to use Python 3.12. Previous version required Python 3.9. Set the Lambda Python runtime based on the ParallelCluster version so that the application continues to work on older versions of PC. This bug was introduced in #270. Related to #270 Resolves #280

gwolski · 2024-11-05T05:20:11Z

With a recent pull, I'm confirming I was able to create a 3.10.1 cluster and have successfully run a srun job from the head_node. Thank you for the quick fix.

Sadly the version of slurm (23.11.7) supported by 3.10.1 seems to still have some issues for cloud machines as noted in the release that was deployed for 3.11.0 slurm 23.11.10. Release notes from 23.11.10:
"[23.11.10 has] fixes for jobs potentially being stuck when using cloud nodes when some nodes are powered down"
But until issue aws/aws-parallelcluster#6529 gets resolved, I'm going to deploy this release.

cartalla · 2024-11-05T18:14:27Z

When you get a minute, what do you think of the following proposal? #283 From: Guntram Wolski ***@***.***> Reply-To: aws-samples/aws-eda-slurm-cluster ***@***.***> Date: Monday, November 4, 2024 at 11:21 PM To: aws-samples/aws-eda-slurm-cluster ***@***.***> Cc: "Carter, Allan" ***@***.***>, State change ***@***.***> Subject: Re: [aws-samples/aws-eda-slurm-cluster] Unable to create 3.10.1 cluster (Issue #280) With a recent pull, I'm confirming I was able to create a 3.10.1 cluster and have successfully run a srun job from the head_node. Thank you for the quick fix. Sadly the version of slurm (23.11.7) supported by 3.10.1 seems to still have some issues for cloud machines as noted in the release that was deployed for 3.11.0 slurm 23.11.10. Release notes from 23.11.10: "[23.11.10 has] fixes for jobs potentially being stuck when using cloud nodes when some nodes are powered down" But until issue aws/aws-parallelcluster#6529<aws/aws-parallelcluster#6529> gets resolved, I'm going to deploy this release. — Reply to this email directly, view it on GitHub<#280 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFY4I5FNJJGA7ZKC4HQHPFTZ7BBRBAVCNFSM6AAAAABRCMWFB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWGI2TQNJSGI>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

gwolski mentioned this issue Nov 3, 2024

wiki instructions for Option 2 fix of (3.8.0 ‐ 3.9.3) ParallelCluster Build Image Failing during Installation of Minitar Ruby Gem Dependency aren't quite right. aws/aws-parallelcluster#6530

Closed

cartalla self-assigned this Nov 4, 2024

cartalla mentioned this issue Nov 4, 2024

Fix Lambda python version for previous versions of ParallelCluster. #281

Merged

cartalla linked a pull request Nov 4, 2024 that will close this issue

Fix Lambda python version for previous versions of ParallelCluster. #281

Merged

cartalla closed this as completed in #281 Nov 4, 2024

cartalla closed this as completed in 155193c Nov 4, 2024

cartalla mentioned this issue Nov 5, 2024

nodejs is unsupported version in setup.sh #279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create 3.10.1 cluster #280

Unable to create 3.10.1 cluster #280

gwolski commented Nov 3, 2024

gwolski commented Nov 3, 2024

gwolski commented Nov 4, 2024

gwolski commented Nov 4, 2024 •

edited

Loading

cartalla commented Nov 4, 2024

gwolski commented Nov 5, 2024

cartalla commented Nov 5, 2024 via email

Unable to create 3.10.1 cluster #280

Unable to create 3.10.1 cluster #280

Comments

gwolski commented Nov 3, 2024

gwolski commented Nov 3, 2024

I get the same problem if I try to use parallelcluster 3.9.3:

gwolski commented Nov 4, 2024

gwolski commented Nov 4, 2024 • edited Loading

cartalla commented Nov 4, 2024

gwolski commented Nov 5, 2024

cartalla commented Nov 5, 2024 via email

gwolski commented Nov 4, 2024 •

edited

Loading