Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal server error when submitting a job with a task that doesn't set MaxCores #3

Open
lupreCSC opened this issue Jun 3, 2024 · 1 comment

Comments

@lupreCSC
Copy link

lupreCSC commented Jun 3, 2024

Another bug I just ran into which occurs when a job is submitted that contains a single task which does set MinCores but not MaxCores, e.g.,

{
    "Name": "heappe_job",
    "ClusterId": 1,
    "ProjectId": 1,
    "FileTransferMethodId": 1,
    "Tasks": [
        {
            "Name": "task_1",
            "MinCores": 1,
            "WalltimeLimit": 600,
            "Priority": 4,
            "StandardOutputFile": "stdout.log",
            "StandardErrorFile": "stderr.log",
            "LogFile": "stdlog",
            "ProgressFile": "progress",
            "ClusterNodeTypeId": 1,
            "CommandTemplateId": 1,
            "TemplateParameterValue": [
                {
                    "CommandParameterIdentifier": "inputParam",
                    "ParameterValue": "testValue"
                }
            ],
        },
    ]
}

Creating the job works, but submitting this on a slurm cluster via the SubmitJob endpoint results in an error response 500: Problem Problem occured! Contact the administrators..

Checking the API logs shows:

INFO  2024-06-03 16:04:26 HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <username> is submitting the job with info Id 117 
ERROR 2024-06-03 16:04:26 HEAppE.RestApi.ExceptionMiddleware - Sequence contains no elements 
System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.PrepareNameOfNodes(ICollection`1 requestedNodeGroups, Int32 nodeCount) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 353
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.SetRequestedResourceNumber(IEnumerable`1 requestedNodeGroups, ICollection`1 requiredNodes, String placementPolicy, IEnumerable`1 paralizationSpecs, Int32 minCores, Int32 maxCores, Int32 coresPerNode) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 262
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.SchedulerDataConvertor.ConvertTaskSpecificationToTask(JobSpecification jobSpecification, TaskSpecification taskSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/SchedulerDataConvertor.cs:line 91
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmDataConvertor.ConvertJobSpecificationToJob(JobSpecification jobSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmDataConvertor.cs:line 185
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmSchedulerAdapter.SubmitJob(Object connectorClient, JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmSchedulerAdapter.cs:line 71
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.RexSchedulerWrapper.SubmitJob(JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/RexSchedulerWrapper.cs:line 61
   at HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic.SubmitJob(Int64 createdJobInfoId, AdaptorUser loggedUser) in /src/BusinessLogicTier/Logic/JobManagement/JobManagementLogic.cs:line 133
   at HEAppE.ServiceTier.JobManagement.JobManagementService.SubmitJob(Int64 createdJobInfoId, String sessionCode) in /src/ServiceTier/JobManagement/JobManagementService.cs:line 49
   at HEAppE.RestApi.Controllers.JobManagementController.SubmitJob(SubmitJobModel model) in /src/RestApi/Controllers/JobManagementController.cs:line 82
   at lambda_method2342(Closure, Object, Object[])
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.SyncActionResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeActionMethodAsync()
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeNextActionFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextResourceFilter>g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeFilterPipelineAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
   at HEAppE.RestApi.ExceptionMiddleware.InvokeAsync(HttpContext context) in /src/RestApi/ExceptionMiddleware.cs:line 72

It appears that SlurmTaskAdapter.SetRequestedResourceNumber line 258/259 this results in a nodeCount of 0, which causes the conditional
if (requestedNodeGroups?.Count == nodeCount) in SlurmTaskAdapter.PrepareNameOfNodes to evaluate to true (for an empty list of requestedNodeGroups), which in turn causes the exception when trying to access the First item in that collection in line 353.

The following fixes would be advisable:

  • SlurmTaskAdapter.SetRequestedResourceNumber should reject a nodeCount argument value of 0
  • Computation of nodeCount in SlurmTaskAdapter.SetRequestedResourceNumber should be robust to maxCores not being set (which probably means a better default value has to be provided from a calling method (maybe maxCores = minCores in this case) OR during task creation there should be a validation error if MaxCores is not set in the request

Finally, I don't really understand what MinCores and MaxCores actually relate to in the request - they suggest that the job can get some kind of variable number of cores between these limits, but why? Is that based on what resources are available on the cluster at the time of submission? This does not appear to be how HPC scheduler usually work, so I find this a bit confusing. Also there should probably be a list of which arguments are required and which are optional, this isn't entirely clear to me at the moment. Especially when it comes to the LogFile, ProgressFile and how they differ from the StandardOutputFile and why they are apparently mandatory is at this point unclear to me, as they don't seem to be actually created during running the job.

@lupreCSC lupreCSC changed the title Internal server error when submitting a job with a task that doesn't set max_cores Internal server error when submitting a job with a task that doesn't set MaxCores Jun 3, 2024
@jkonvicka
Copy link
Contributor

jkonvicka commented Sep 16, 2024

Hi @lupreCSC,
we are looking into it. It will be fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants