-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to create 3.10.1 cluster #280
Comments
I get the same problem if I try to use parallelcluster 3.9.3:
I'm not sure I understand how Python3.12.v36 comes into play here. Not sure where that is coming from. Is that somehow involved? If I look back to May time frame, my successful deployment of 3.9.1 was with 3.9v56. 3.9.3 deployment is using Rocky 8.9 due to parallelcluster build-image not supporting Rocky8.10 in 3.9.3 due to lack of fsx support which didn't become available until 3.10 for Rocky 8.10. |
This is beginning to smell like a parallelcluster issue? If CreateParallelCluster is a parallelcluster command. This suggests that jsonschema imports rpds.rpds. This article is a bit old as well. I built a 3.10.1 and 3.11.1 parallelcluster virtual environments, I see the same jsonschema used in both a 3.10.1 and a 3.11.1 environment: $ pip freeze | grep jsonschema I don't see an rpds package installed in either environment, but I can install it. I do see this rpds-py package: $ pip freeze |grep rpds So I'm not quite seeing how 3.11.1 can work and yet 3.9.3 and 3.10.1 fail. Let me know if I should file this as an issue in parallelcluster. |
Please also see this commentary about CDK: aws/aws-cdk#26300 It doesn't all quite make sense to me yet. |
Reproduced. This bug was introduced by #270. I updated to use the Python 3.12 lambda runtime with PC 3.11.1 without maintaining backward compatibility to older PC versions. Fix incoming. |
With a recent pull, I'm confirming I was able to create a 3.10.1 cluster and have successfully run a srun job from the head_node. Thank you for the quick fix. Sadly the version of slurm (23.11.7) supported by 3.10.1 seems to still have some issues for cloud machines as noted in the release that was deployed for 3.11.0 slurm 23.11.10. Release notes from 23.11.10: |
When you get a minute, what do you think of the following proposal?
#283
From: Guntram Wolski ***@***.***>
Reply-To: aws-samples/aws-eda-slurm-cluster ***@***.***>
Date: Monday, November 4, 2024 at 11:21 PM
To: aws-samples/aws-eda-slurm-cluster ***@***.***>
Cc: "Carter, Allan" ***@***.***>, State change ***@***.***>
Subject: Re: [aws-samples/aws-eda-slurm-cluster] Unable to create 3.10.1 cluster (Issue #280)
With a recent pull, I'm confirming I was able to create a 3.10.1 cluster and have successfully run a srun job from the head_node. Thank you for the quick fix.
Sadly the version of slurm (23.11.7) supported by 3.10.1 seems to still have some issues for cloud machines as noted in the release that was deployed for 3.11.0 slurm 23.11.10. Release notes from 23.11.10:
"[23.11.10 has] fixes for jobs potentially being stuck when using cloud nodes when some nodes are powered down"
But until issue aws/aws-parallelcluster#6529<aws/aws-parallelcluster#6529> gets resolved, I'm going to deploy this release.
—
Reply to this email directly, view it on GitHub<#280 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFY4I5FNJJGA7ZKC4HQHPFTZ7BBRBAVCNFSM6AAAAABRCMWFB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWGI2TQNJSGI>.
You are receiving this because you modified the open/close state.Message ID: ***@***.***>
|
I've pulled a clean clone of aws-eda-slurm-cluster.
After a source setup.sh, I'm able to create a 3.11.1 cluster with the install.sh --cdk-cmd create
I then create a new config yml file, only changing the cluster version and my custom ami to be a 3.10.1 ami.
Here is the diff of the config files:
I am not able to create a 3.10.1 cluster. Error message in CloudWatch logs, /aws/lambda/tsi3-3-10-1-config-CreateParallelClusterConfig complains about unable to import module rpds.rpds:
I've tried latest version of CDK and different nodejs versions (see #279)
I'm at a loss if this is a parallelcluster issue or aws-eda-slurm-cluster issue. Would like to see if you can reproduce.
The text was updated successfully, but these errors were encountered: