You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running a snakemake pipeline which has ~2000 tasks to complete, and I am allowing 200 jobs to be in queue at a time (limited by my max spot quota as well).
When I go to the jobs list in PCUI, it stalls a bit, error messages begin to appear, and eventually behind them the list of jobs appears.
Steps to reproduce the issue
Launch a lot of jobs, open the jobs tab in PCUI.
Expected behaviour
Job status errors
To see the jobs list as it appears with fewer jobs in queue.
Actual behaviour
Open Jobs status
Spins a moment
Every 10s or so, an error appears in a red bar. There are a variety:
Error: Expecting property name enclosed in double quotes: line 1176 column 5 (char 23980)
Error: Expecting value: line 1177 column 1 (char 23980)
Error: Expecting property name enclosed in double quotes: line 1176 column 4 (char 23980)
jobs list does begin to appear.
Once the jobs list has appeared, it seems to load w/out error messages for a while.
Job ID Link Error
This is a bug that happens with every job, irrespective of the large number of jobs causing errors I report above. When clicking on a Job status ID from the ID column, I get the following error for every ID:
Error: not enough values to unpack (expected 2, got 1)
NEW and only occuring when the large number of jobs behavior is seen, I also get this error:
Error: An error occurred while trying to complete your request. Please try again later. If the problem persists, please contact support for further assistance.
Required info
In order to help us determine the root cause of the issue, please provide the following information:
Region PCUI : us-west-2
AZ of cluster: us-west-2d
version PCUI: public.ecr.aws/pcm/parallelcluster-ui:2024.10.0 (is this it?)
version pcluster: 3.11.1
Additional info
The following information is not required but helpful:
I connect to the pcui from a mac via chrome
If having problems with cluster creation or update
My cluster yaml:
---
Region: us-west-2
Image:
Os: ubuntu2204
HeadNode:
InstanceType: r7i.2xlarge
Networking:
ElasticIp: true
SubnetId: subnet-pub
DisableSimultaneousMultithreading: false
Ssh:
KeyName: KEY # must be ed25519 for ubuntu
AllowedIps: "0.0.0.0/0" # SET THIS TO YOUR DESIRED FILTER
Dcv:
Enabled: false
LocalStorage:
RootVolume:
Size: 775
VolumeType: gp3
DeleteOnTermination: true
EphemeralVolume:
MountDir: /head_root
CustomActions:
OnNodeConfigured:
Script:
s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh # head and each compute can have different scripts if desired
Args:
- us-west-2
- BUCKET
- na
- na
Iam:
S3Access:
- BucketName: BUCKET
EnableWriteAccess: false
AdditionalIamPolicies:
- Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Scheduling:
Scheduler: slurm
SlurmSettings:
EnableMemoryBasedScheduling: false
ScaledownIdletime: 5
Dns:
DisableManagedDns: false
QueueUpdateStrategy: DRAIN
SlurmQueues:
- Name: i8
CapacityType: SPOT
AllocationStrategy: lowest-price
ComputeResources:
- Name: r7gb64
Instances:
- InstanceType: r7i.2xlarge
MinCount: 0
MaxCount: 22
SpotPrice: 1.2488 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
- Name: r6gb64
Instances:
- InstanceType: r6i.2xlarge
MinCount: 0
MaxCount: 22
SpotPrice: 1.2462 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
Networking:
SubnetIds:
- subnet-012424a948f57e9ee
CustomActions:
OnNodeConfigured:
Script:
s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
Args:
- us-west-2
- BUCKET
- na
- na
Iam:
S3Access:
- BucketName: BUCKET
EnableWriteAccess: false
AdditionalIamPolicies:
- Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Name: i128
CapacityType: SPOT
AllocationStrategy: lowest-price
ComputeResources:
- Name: c6gb256
Instances:
- InstanceType: c6i.metal
- InstanceType: c6i.32xlarge
MinCount: 0
MaxCount: 22
SpotPrice: 1.8034 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
- Name: m6gb512
Instances:
- InstanceType: m6i.32xlarge
- InstanceType: m6i.metal
MinCount: 0
MaxCount: 22
SpotPrice: 2.1581 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
- Name: r6gb1024r6
Instances:
- InstanceType: r6i.metal
- InstanceType: r6i.32xlarge
MinCount: 0
MaxCount: 22
SpotPrice: 2.0494 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
Networking:
SubnetIds:
- subnet-012424a948f57e9ee
CustomActions:
OnNodeConfigured:
Script:
s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
Args:
- us-west-2
- BUCKET
- na
- na
Iam:
S3Access:
- BucketName: BUCKET
EnableWriteAccess: false
AdditionalIamPolicies:
- Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Name: i192
CapacityType: SPOT
AllocationStrategy: lowest-price
ComputeResources:
- Name: c7gb384
Instances:
- InstanceType: c7i.48xlarge
- InstanceType: c7i.metal-48xl
MinCount: 0
MaxCount: 22
SpotPrice: 2.4093 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
- Name: m7gb768
Instances:
- InstanceType: m7i.metal-48xl
- InstanceType: m7i.48xlarge
MinCount: 0
MaxCount: 22
SpotPrice: 2.5016 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
- Name: r7gb1536
Instances:
- InstanceType: r7i.48xlarge
- InstanceType: r7i.metal-48xl
MinCount: 0
MaxCount: 22
SpotPrice: 2.209 # Calculated using (median spot price)+1.01.
Networking:
PlacementGroup:
Enabled: false
Efa:
Enabled: false
Networking:
SubnetIds:
- subnet-012424a948f57e9ee
CustomActions:
OnNodeConfigured:
Script:
s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh
Args:
- us-west-2
- BUCKET
- na
- na
Iam:
S3Access:
- BucketName: BUCKET
EnableWriteAccess: false
AdditionalIamPolicies:
- Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Monitoring:
DetailedMonitoring: false
Logs:
CloudWatch:
Enabled: true
RetentionInDays: 3 # must be 0,1,3,5,7,14,30,60,90...
SharedStorage: # This is the local FS which is expensive but fast, could be swapped for EFS, etc.
- MountDir: /fsx # The cost of this will be roughly $22.93 per day, so it should not be kept hot unless in active use.
Name: fsx-daylily-07123j # WARNING, EDIT NAME WILL DEL EXISTING DATA
StorageType: FsxLustre
FsxLustreSettings:
ImportPath: s3://BUCKET/data/
StorageCapacity: 4800
DeploymentType: SCRATCH_2
AutoImportPolicy: NEW_CHANGED_DELETED
DeletionPolicy: Retain # Set to true to keep the FSX after the cluster is deleted
Tags: # TAGs necessary for per-user/project/job cost tracking
- Key: aws-parallelcluster-username
Value: daylily
- Key: aws-parallelcluster-jobid
Value: NA
- Key: aws-parallelcluster-project
Value: da-us-west-2d-daylily-07123j
- Key: aws-parallelcluster-clustername
Value: daylily-07123j
- Key: aws-parallelcluster-enforce-budget
Value: enforce
DevSettings:
Timeouts:
HeadNodeBootstrapTimeout: 3600
ComputeNodeBootstrapTimeout: 3600
...
If having problems with custom image creation
n/a
The text was updated successfully, but these errors were encountered:
Cluster: a simple cluster with 1 compute resource of 10 dynamic nodes
Region: doesn't matter, used us-east-1
Jobs: submitted a burst of 10000 sleep jobs
Root cause
When the number of jobs in queue is 500+, the output of the SSM command invoked here to retrieve the queue status is truncated by SSM because the output of the squeue command it's too long (SSM truncates at 24000 chars).
The truncated output cannot be parse as a JSON here, causing a JSON parsing error Expecting value: line 1579 column 14 (char 23980).
We are working on the resolution and will keep you posted here.
gmarciani
added a commit
to gmarciani/aws-parallelcluster-ui
that referenced
this issue
Feb 3, 2025
Description
Steps to reproduce the issue
Expected behaviour
Job status errors
Actual behaviour
Jobs status
Job ID Link Error
ID
from theID
column, I get the following error for every ID:Required info
In order to help us determine the root cause of the issue, please provide the following information:
Additional info
The following information is not required but helpful:
If having problems with cluster creation or update
My cluster yaml:
If having problems with custom image creation
n/a
The text was updated successfully, but these errors were encountered: