Slurm not able to start jobs in compute partition #854

Tristan-Kosciuch · 2023-01-24T19:16:22Z

Tristan-Kosciuch
Jan 24, 2023

Jobs that I start which are supposed to be in the compute partition are stuck in BeginTime. Once the start time is reached the job gets requeued. There are no other jobs running.

An example job

sbatch -N1 -p compute --wrap="srun hostname"

My blueprint.yaml

blueprint_name: slurm-lustre-dvmdostem-v5

vars:
  project_id:  ## GCP project ID ##
  deployment_name: slurm-lustre-dvmdostem-v5
  region: us-central1
  zone: us-central1-c
  lustre_mgs_ip: 10.0.0.218@tcp
  startup_timeouts: 300
  network_name: slurm-gcp-v5-net
  subnetwork_name: slurm-gcp-v5-primary-subnet
  family: dvmdostem-lustre-slurm-22-05-6-ubuntu
  project: spherical-berm-323321
  data_bucket: dvm-dos-tem-outputs
  
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: wcrc-tfstate-9486302

deployment_groups:
- group: primary
  modules:
  - id: network1
    source: modules/network/pre-existing-vpc

  - id: lustrefs
    source: modules/file-system/pre-existing-network-storage
    settings:
      server_ip: $(vars.lustre_mgs_ip)
      remote_mount: /exacloud
      local_mount: /mnt/exacloud
      fs_type: lustre

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-highcpu-2
      instance_image:
        family: $(vars.family)
        project: $(vars.project)

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - lustrefs
    # - datafs
    - debug_node_group
    settings:
      partition_name: debug
      enable_placement: false
      is_default: true

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 20
      machine_type: n2-highcpu-96
      instance_image:
        family: $(vars.family)
        project: $(vars.project)
      # set the two following options to true to use spot VMs that are much less expensive but may be stopped at any time
      # preemptible: false
      # enable_spot_vm: false
      # spot_instance_config:
      #   termination_action: "STOP"

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - lustrefs
    # - datafs
    - compute_node_group
    settings:
      partition_name: compute

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - lustrefs
    # - datafs
    - debug_partition
    - compute_partition
    settings:
      machine_type: n2-standard-2
      disk_type: pd-standard
      source_image_family: $(vars.family)
      source_image_project: $(vars.project)
      login_startup_scripts_timeout: $(vars.startup_timeouts)
      controller_startup_scripts_timeout: $(vars.startup_timeouts)
      compute_startup_scripts_timeout: $(vars.startup_timeouts)

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-2
      disk_type: pd-standard
      disable_login_public_ips: false
      source_image_family: $(vars.family)
      source_image_project: $(vars.project)

  - id: hpc_dashboard
    source: modules/monitoring/dashboard
    outputs: [instructions]

Answered by tpdownes

Jan 24, 2023

Hi @Tristan-Kosciuch! Thanks for reporting the problem. I think it's worth pursuing this a bit more. The log file that would typically contain the most useful information for scaling machines up is /var/log/slurm/resume.log on the controller. Likewise problems scaling down are typically found in /var/log/slurm/suspend.log, also on the controller.

As an initial guess, setting enable_placement: false combined with the smaller VM size is what probably helped you. The first setting indicates to Compute Engine that you want machine nearby one another so that network latency is minimized. This request, especially for larger VMs, may run into real-world constraints on the availability of hardwar…

View full answer

Tristan-Kosciuch · 2023-01-24T20:24:22Z

Tristan-Kosciuch
Jan 24, 2023
Author

I got it working. Not sure the exact culprit as I changed a few things at once. The changes I made are: increase controller VM size, restrict compute partition to the same zone within us-central1 as the controller, reduce the compute VM size, and set enable_placement: false.

My working blueprint:

blueprint_name: slurm-lustre-dvmdostem-v5

vars:
  project_id:  ## GCP project ID ##
  deployment_name: slurm-lustre-dvmdostem-v5
  region: us-central1
  zone: us-central1-c
  lustre_mgs_ip: 10.0.0.218@tcp
  startup_timeouts: 300
  network_name: slurm-gcp-v5-net
  subnetwork_name: slurm-gcp-v5-primary-subnet
  family: dvmdostem-lustre-slurm-22-05-6-ubuntu
  project: spherical-berm-323321
  data_bucket: dvm-dos-tem-outputs
  
terraform_backend_defaults:
  type: gcs
  configuration:
    bucket: wcrc-tfstate-9486302

deployment_groups:
- group: primary
  modules:
  - id: network1
    source: modules/network/pre-existing-vpc

  - id: lustrefs
    source: modules/file-system/pre-existing-network-storage
    settings:
      server_ip: $(vars.lustre_mgs_ip)
      remote_mount: /exacloud
      local_mount: /mnt/exacloud
      fs_type: lustre

  - id: debug_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 4
      machine_type: n2-highcpu-2
      instance_image:
        family: $(vars.family)
        project: $(vars.project)

  - id: debug_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - lustrefs
    - debug_node_group
    settings:
      partition_name: debug
      enable_placement: false
      is_default: true
      zone_policy_allow:
      - us-central1-c
      zone_policy_deny:
      - us-central1-f
      - us-central1-b
      - us-central1-a

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 30
      machine_type: n2-highcpu-48
      instance_image:
        family: $(vars.family)
        project: $(vars.project)

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - lustrefs
    - compute_node_group
    settings:
      partition_name: compute
      enable_placement: false
      zone_policy_allow:
      - us-central1-c
      zone_policy_deny:
      - us-central1-f
      - us-central1-b
      - us-central1-a

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - lustrefs
    - debug_partition
    - compute_partition
    settings:
      machine_type: n2-standard-4
      disk_type: pd-standard
      source_image_family: $(vars.family)
      source_image_project: $(vars.project)
      login_startup_scripts_timeout: $(vars.startup_timeouts)
      controller_startup_scripts_timeout: $(vars.startup_timeouts)
      compute_startup_scripts_timeout: $(vars.startup_timeouts)

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-2
      disk_type: pd-standard
      disable_login_public_ips: false
      source_image_family: $(vars.family)
      source_image_project: $(vars.project)

  - id: hpc_dashboard
    source: modules/monitoring/dashboard
    outputs: [instructions]

0 replies

tpdownes · 2023-01-24T20:51:28Z

tpdownes
Jan 24, 2023
Maintainer

Hi @Tristan-Kosciuch! Thanks for reporting the problem. I think it's worth pursuing this a bit more. The log file that would typically contain the most useful information for scaling machines up is /var/log/slurm/resume.log on the controller. Likewise problems scaling down are typically found in /var/log/slurm/suspend.log, also on the controller.

As an initial guess, setting enable_placement: false combined with the smaller VM size is what probably helped you. The first setting indicates to Compute Engine that you want machine nearby one another so that network latency is minimized. This request, especially for larger VMs, may run into real-world constraints on the availability of hardware. So you should disable this feature when your applications are either single-node (no TCP connections between nodes, such as MPI applications) or your TCP/MPI communication is relatively low and thus not very sensitive to network proximity.

In any case, I suggest you share any errors you see in resume.log from the time of the failed job submission.

6 replies

tpdownes Jan 27, 2023
Maintainer

It's a bit hard to speak to this issue since it could obviously be crashing for a variety of reasons without na error message.

A couple general points:

If you are compiling a custom OpenMPI, I would probably do so with the built-in support for the Slurm scheduler rather than SSH. But that does not appear to be the issue if it works in one partition vs the other. You may be doing this already.
Consider setting bandwidth_tier: tier_1_enabled on your compute partition as that will enable higher bandwidth and lower latency networking.

Tristan-Kosciuch Jan 27, 2023
Author

I've encountered the job stuck in BeginTime error again.

/var/log/slurm/resume.log

googleapiclient.errors.HttpError: <HttpError 503 when requesting https://compute.googleapis.com/compute/v1/projects/spherical-berm-323321/regions/us-central1/instances/bulkInsert?alt=json returned "Region does not currently have sufficient capacity for the requested resources.". Details: "[{'message': 'Region does not currently have sufficient capacity for the requested resources.', 'domain': 'global', 'reason': 'insufficientCapacity'}]">

/var/log/slurm/suspend.log

2023-01-27 18:33:19,112 INFO: epilog suspend slurmlustr-compute-ghpc-1 job_id=9
2023-01-27 18:33:19,150 INFO: epilog suspend slurmlustr-compute-ghpc-0 job_id=8
2023-01-27 18:33:19,537 INFO: suspend slurmlustr-compute-ghpc-[0-1]

I'm running this in us-central1-c using n2-highcpu-96 nodes. My project has a quota of 300 N2 CPUs which isn't being reached by this slurm job. I can make a n2-highcpu-96 VM in us-central1-c from the GCP console UI.

Tristan-Kosciuch Feb 10, 2023
Author

Any insight into the insufficient capacity error? I don't think its my project quota and I'm not running massively big slurm jobs.

tpdownes Feb 10, 2023
Maintainer

Please look at /slurm/scripts/config.yaml and confirm whether the compute partition has enable_placement set to false or true. If it is set to true, the most likely explanation is that there is nowhere in us-central1-c with 2 n2-highcpu-96 in a compact placement (i.e. close to each other network-wise).

If it is set to false, then you may have just gotten unlucky with the request by Slurm vs when you tried at the console (which would not use compact placement). You are correct that if you can get the VMs at the console, it's probably not a quota issue.

Things I would consider:

Do you really need compact placement?

If no, then ensure it's disabled in the config.yaml file (there is a mechanism to update this when you make changes to the blueprint and re-apply, but the YAML file is the source of truth for the API call so confirm it there)

If yes, you might consider a larger number of smaller VMs because a request is more likely to succeed when our cloud resources are fragmented. (i.e. there are 4 48-core slots available even if there are not 2 96-core slots)

You might also try n2-standard-96 or allowing more zones.

Tristan-Kosciuch Feb 10, 2023
Author

I have

partitions:
  compute:
    enable_job_exclusive: true
    enable_placement_groups: false

I'll try using the c2 series of VMs and smaller VM size. Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm not able to start jobs in compute partition #854

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Slurm not able to start jobs in compute partition #854

Tristan-Kosciuch Jan 24, 2023

Replies: 2 comments · 6 replies

Tristan-Kosciuch Jan 24, 2023 Author

tpdownes Jan 24, 2023 Maintainer

tpdownes Jan 27, 2023 Maintainer

Tristan-Kosciuch Jan 27, 2023 Author

Tristan-Kosciuch Feb 10, 2023 Author

tpdownes Feb 10, 2023 Maintainer

Tristan-Kosciuch Feb 10, 2023 Author

Tristan-Kosciuch
Jan 24, 2023

Replies: 2 comments 6 replies

Tristan-Kosciuch
Jan 24, 2023
Author

tpdownes
Jan 24, 2023
Maintainer

tpdownes Jan 27, 2023
Maintainer

Tristan-Kosciuch Jan 27, 2023
Author

Tristan-Kosciuch Feb 10, 2023
Author

tpdownes Feb 10, 2023
Maintainer

Tristan-Kosciuch Feb 10, 2023
Author