Large numbers of jobs cause slow loading and many error messages in `Job status` tab #376

iamh2o · 2024-11-24T22:18:27Z

Description

PCUI has been amazing. Thank you. My bug:
- I am running a snakemake pipeline which has ~2000 tasks to complete, and I am allowing 200 jobs to be in queue at a time (limited by my max spot quota as well).
- When I go to the jobs list in PCUI, it stalls a bit, error messages begin to appear, and eventually behind them the list of jobs appears.

Steps to reproduce the issue

Launch a lot of jobs, open the jobs tab in PCUI.

Expected behaviour

Job status errors

To see the jobs list as it appears with fewer jobs in queue.

Actual behaviour

Open Jobs status
Spins a moment
Every 10s or so, an error appears in a red bar. There are a variety:

Error: Expecting property name enclosed in double quotes: line 1176 column 5 (char 23980)

Error: Expecting value: line 1177 column 1 (char 23980)

Error: Unterminated string starting at: line 1176 column 5 (char 23973)

Error: Expecting property name enclosed in double quotes: line 1176 column 4 (char 23980)

jobs list does begin to appear.
Once the jobs list has appeared, it seems to load w/out error messages for a while.

Job ID Link Error

This is a bug that happens with every job, irrespective of the large number of jobs causing errors I report above. When clicking on a Job status ID from the ID column, I get the following error for every ID:

Error: not enough values to unpack (expected 2, got 1)

NEW and only occuring when the large number of jobs behavior is seen, I also get this error:

Error: An error occurred while trying to complete your request. Please try again later. If the problem persists, please contact support for further assistance.

Required info

In order to help us determine the root cause of the issue, please provide the following information:

Region PCUI : us-west-2
AZ of cluster: us-west-2d
version PCUI: public.ecr.aws/pcm/parallelcluster-ui:2024.10.0 (is this it?)
version pcluster: 3.11.1

Additional info

The following information is not required but helpful:

I connect to the pcui from a mac via chrome

If having problems with cluster creation or update

My cluster yaml:

---
Region: us-west-2  
Image:
  Os: ubuntu2204
HeadNode:
  InstanceType: r7i.2xlarge
  Networking:
    ElasticIp: true
    SubnetId: subnet-pub 
  DisableSimultaneousMultithreading: false
  Ssh:
    KeyName: KEY  # must be ed25519 for ubuntu
    AllowedIps: "0.0.0.0/0" # SET THIS TO YOUR DESIRED FILTER
  Dcv:
    Enabled: false
  LocalStorage:
    RootVolume:
      Size: 775
      VolumeType: gp3
      DeleteOnTermination: true
    EphemeralVolume:
      MountDir: /head_root
  CustomActions:
    OnNodeConfigured:
      Script: 
        s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh       # head and each compute can have different scripts if desired
      Args:
      - us-west-2
      - BUCKET
      - na
      - na
  Iam:
    S3Access:
    - BucketName: BUCKET
      EnableWriteAccess: false
    AdditionalIamPolicies:
    - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
    - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: false
    ScaledownIdletime: 5
    Dns:
      DisableManagedDns: false
    QueueUpdateStrategy: DRAIN
  SlurmQueues:
  - Name: i8
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: r7gb64
      Instances:
      - InstanceType: r7i.2xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.2488 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r6gb64
      Instances:
      - InstanceType: r6i.2xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.2462 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  - Name: i128
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: c6gb256
      Instances:
      - InstanceType: c6i.metal
      - InstanceType: c6i.32xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.8034 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: m6gb512
      Instances:
      - InstanceType: m6i.32xlarge
      - InstanceType: m6i.metal
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.1581 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r6gb1024r6
      Instances:
      - InstanceType: r6i.metal
      - InstanceType: r6i.32xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.0494 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  - Name: i192
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: c7gb384
      Instances:
      - InstanceType: c7i.48xlarge
      - InstanceType: c7i.metal-48xl
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.4093 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: m7gb768
      Instances:
      - InstanceType: m7i.metal-48xl
      - InstanceType: m7i.48xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.5016 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r7gb1536
      Instances:
      - InstanceType: r7i.48xlarge
      - InstanceType: r7i.metal-48xl
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.209 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Monitoring:
  DetailedMonitoring: false
  Logs:
    CloudWatch:
      Enabled: true
      RetentionInDays: 3  # must be 0,1,3,5,7,14,30,60,90...
SharedStorage:  # This is the local FS which is expensive but fast, could be swapped for EFS, etc.
- MountDir: /fsx    # The cost of this will be roughly $22.93 per day, so it should not be kept hot unless in active use.
  Name: fsx-daylily-07123j     # WARNING, EDIT NAME WILL DEL EXISTING DATA
  StorageType: FsxLustre
  FsxLustreSettings:
    ImportPath: s3://BUCKET/data/
    StorageCapacity: 4800
    DeploymentType: SCRATCH_2
    AutoImportPolicy: NEW_CHANGED_DELETED
    DeletionPolicy: Retain    # Set to true to keep the FSX after the cluster is deleted
Tags:  # TAGs necessary for per-user/project/job cost tracking 
- Key: aws-parallelcluster-username
  Value: daylily
- Key: aws-parallelcluster-jobid
  Value: NA
- Key: aws-parallelcluster-project
  Value: da-us-west-2d-daylily-07123j
- Key: aws-parallelcluster-clustername
  Value: daylily-07123j
- Key: aws-parallelcluster-enforce-budget
  Value: enforce
DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 3600
    ComputeNodeBootstrapTimeout: 3600
...

If having problems with custom image creation

n/a

The text was updated successfully, but these errors were encountered:

gmarciani · 2025-01-30T18:12:52Z

Hi @iamh2o ,

thank you for your interested in PCUI and sorry for the late response.

We tracked the bug fixing in our backlog and keep you posted here the fix will be planned.

Thanks,
Giacomo

gmarciani · 2025-01-30T22:27:34Z

I was able to reproduce the issue.

Reproducer

PCUI 2024.11.0
PC 3.11.1
Cluster: a simple cluster with 1 compute resource of 10 dynamic nodes
Region: doesn't matter, used us-east-1
Jobs: submitted a burst of 10000 sleep jobs

Root cause
When the number of jobs in queue is 500+, the output of the SSM command invoked here to retrieve the queue status is truncated by SSM because the output of the squeue command it's too long (SSM truncates at 24000 chars).

The truncated output cannot be parse as a JSON here, causing a JSON parsing error Expecting value: line 1579 column 14 (char 23980).

We are working on the resolution and will keep you posted here.

…n queue status is requested for a cluster having 200+ jobs in queue. aws#376

iamh2o added the bug Something isn't working label Nov 24, 2024

gmarciani added the Backlog Issue has been tracked in team backlog label Jan 30, 2025

gmarciani added a commit to gmarciani/aws-parallelcluster-ui that referenced this issue Feb 3, 2025

[DONOTMERGE] Drafted changes to fix issue causing burst of errors whe…

282db4f

…n queue status is requested for a cluster having 200+ jobs in queue. aws#376

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large numbers of jobs cause slow loading and many error messages in `Job status` tab #376

Large numbers of jobs cause slow loading and many error messages in `Job status` tab #376

iamh2o commented Nov 24, 2024 •

edited

Loading

gmarciani commented Jan 30, 2025

gmarciani commented Jan 30, 2025 •

edited

Loading

Large numbers of jobs cause slow loading and many error messages in Job status tab #376

Large numbers of jobs cause slow loading and many error messages in Job status tab #376

Comments

iamh2o commented Nov 24, 2024 • edited Loading

Description

Steps to reproduce the issue

Expected behaviour

Job status errors

Actual behaviour

Job ID Link Error

Required info

Additional info

If having problems with cluster creation or update

If having problems with custom image creation

gmarciani commented Jan 30, 2025

gmarciani commented Jan 30, 2025 • edited Loading

Large numbers of jobs cause slow loading and many error messages in `Job status` tab #376

Large numbers of jobs cause slow loading and many error messages in `Job status` tab #376

iamh2o commented Nov 24, 2024 •

edited

Loading

gmarciani commented Jan 30, 2025 •

edited

Loading