Release/1.3.3 (#90)

* Bugfix/squeue last job workaround (#74) * When squeue -j is given only one job id, and that job id is invalid, the squeue exit status is 1. That causes a WorkflowMgr::SchedulerDown exception, preventing sacct from being run. Rocotorun will refuse to submit any more jobs at that point until the user manually intervenes. * When squeue -j is given only one job id, and that job id is invalid, the squeue exit status is 1. That causes a WorkflowMgr::SchedulerDown exception, preventing sacct from being run. Rocotorun will refuse to submit any more jobs at that point until the user manually intervenes. * Do not catch an exception that is never thrown. Remove some commented-out code. Co-authored-by: Christopher Harrop <[email protected]> Co-authored-by: samuel.trahan <[email protected]> * Add support for the <exclusive> and <shared> tags to slurm. The --exclusive option is used for <exclusive>, unless <shared> is also set. (#75) Co-authored-by: Sam Trahan <[email protected]> * Enable configurable timeouts for batch queue job retrieval commands. (#77) * Feature/faster rocotorun (#78) * Totally untested changes! Do not calculate list of cycles or tasks until they are required. Special case for checking workflow: the WorkflowSubsetOptions are queried for all_tasks and all_cycles before generating the list of tasks or cycles. * WorkflowSubsetOptions: accessors need to be public * Corrections to prior commit: 1. WFMStatOption make_selection needs to be selection() and public due to change to WorkflowSubsetOption 2. Missing a pair of @ in workflowengine.rb 3. Another missing @ in workflowsubsetoptions.rb * Add the cron script that generates the sacct cache. * Use cached output from sacct, if present, instead of running sacct. The file is $ROCOTO_SACCT_CACHE, or $HOME/sacct-cache/sacct.txt if that variable is unset. * Totally untested changes! Do not calculate list of cycles or tasks until they are required. Special case for checking workflow: the WorkflowSubsetOptions are queried for all_tasks and all_cycles before generating the list of tasks or cycles. * WorkflowSubsetOptions: accessors need to be public * Corrections to prior commit: 1. WFMStatOption make_selection needs to be selection() and public due to change to WorkflowSubsetOption 2. Missing a pair of @ in workflowengine.rb 3. Another missing @ in workflowsubsetoptions.rb * give a sensible message instead of raising an exception when rocotochecking a job that does not exist * set.delete, not set.remove * Fix bug triggered when no sacct cache exists. Remove unused variable from lsfbatchsystem. Co-authored-by: samuel.trahan <[email protected]> Co-authored-by: Christopher Harrop <[email protected]> * Fix typos in README.md (#79) * Feature/2.7 compatibility (#81) * Fix Ruby 2.7 compatibility issues. * Remove old version of sqlite3-ruby tarfile. * Replace queries to all Slurm clusters with queries to Slurm federation. (#82) Avoids issues for Slurm clusters that are not configured with clusters. Co-authored-by: Christopher Harrop <[email protected]> * Fix bug that caused duplicate job retrievals from the database. (#83) Co-authored-by: Christopher Harrop <[email protected]> * Increase DRb timeouts to improve recovery during heavy loads. (#84) May sometimes result in longer intervals between successful completion of rocotorun commands, but improves overall resiliency by allowing slow operations to complete instead of producing persistent timeouts and failure. Co-authored-by: Christopher Harrop <[email protected]> * Remove unnecessary check for Slurm job status information. (#85) Co-authored-by: Christopher Harrop <[email protected]> * Add missing config argument during initialization of moabtorque (#86) batch system. Co-authored-by: Christopher Harrop <[email protected]> * Write each workflow's Rocoto logs in its own directory instead of $HOME/.rocoto/log (#87) Co-authored-by: Christopher Harrop <[email protected]> * Update RELEASE_NOTES. * Update VERSION * Bug fixes: `WorkflowSelection.select_tasks` was ignoring `@all_tasks` information, and two related warning messages were confusing. (#88) Co-authored-by: Samuel Trahan <[email protected]> * Update RELEASE_NOTES. Co-authored-by: Samuel Trahan <[email protected]> Co-authored-by: samuel.trahan <[email protected]> Co-authored-by: Christopher Harrop <[email protected]> Co-authored-by: Christopher Harrop <[email protected]> Co-authored-by: Ben Johnson <[email protected]>
christopherwharrop · Jan 7, 2021 · 8a91063 · 8a91063
1 parent 356d2e1
commit 8a91063
Show file tree

Hide file tree

Showing 33 changed files with 320 additions and 1,015 deletions.
diff --git a/INSTALL b/INSTALL
@@ -98,7 +98,7 @@ LIBXML2_VERSION="2.9.4"
 LIBXML_RUBY_VERSION="3.0.0"
 SYSTEMTIMER_VERSION="1.2.3"
 SQLITE3_VERSION="autoconf-3120200"
-SQLITE3_RUBY_VERSION="1.3.11"
+SQLITE3_RUBY_VERSION="1.4.2"
 OPEN4_VERSION="1.3.4"
 RUBYSL_DATE_VERSION="2.0.9"
 RUBYSL_PARSEDATE_VERSION="1.0.1"

diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
 # Rocoto Workflow Management System
 
 ## Introduction
-Workflow Management is a concept that originated in the 1970's to handle business process management. Workflow management systems were developed to manage complex collections of business processes that need to be carried out in a certain way with complex interdependencies and requirements. Scientific Workflow Management is much newer, and is very much like its business counterpart, except that it is usually data oriented instead of process oriented. That is, scientific workflows are driven by the scientific data that "flows" through them. Scientific workflow tasks are usually triggered by the availability of some kind of input data, and a task's result is usally some kind of data that is fed as input to another task in the workflow. The individual tasks themselves are scientific codes that perform some kind of computation or retreive or store some type of data for a computation. So, whereas a business workflow is comprised of a diverse set of processes that have to be completed in a certain way, sometimes carried out by a machine, sometimes carried out by a human being, a scientific workflow is usually comprised of a set of computations that are driven by the availabiilty of input data.
+Workflow Management is a concept that originated in the 1970's to handle business process management. Workflow management systems were developed to manage complex collections of business processes that need to be carried out in a certain way with complex interdependencies and requirements. Scientific Workflow Management is much newer, and is very much like its business counterpart, except that it is usually data oriented instead of process oriented. That is, scientific workflows are driven by the scientific data that "flows" through them. Scientific workflow tasks are usually triggered by the availability of some kind of input data, and a task's result is usually some kind of data that is fed as input to another task in the workflow. The individual tasks themselves are scientific codes that perform some kind of computation or retrieve or store some type of data for a computation. So, whereas a business workflow is comprised of a diverse set of processes that have to be completed in a certain way, sometimes carried out by a machine, sometimes carried out by a human being, a scientific workflow is usually comprised of a set of computations that are driven by the availability of input data.
 
 ## Why Workflow Management?
-The day when a scientist could conduct his or her numerical modeling and simulation research by writing, running, and monitoriing the progress of a modest Fortran code or two, is quickly becoming a distant memory. It is a fact that researchers now often have to make hundreds or thousands of runs of a numerical model to get a single result. In addition, each end-to-end "run" of the model often entails running many different codes for pre- and post-processing in addition to the model itself. And, in some cases, multiple models and their associated pre- and post-processing tasks are coupled together to build a larger, more complex model. The codes that comprise the end-to-end modeling systems often have complex interdependencies that dictate the order in which they can be run. And, in order to run the end-to-end system efficiently, concurrency must be used when dependencies allow it. The problem of scale and complexity is exacerbated by the fact that these codes are usually run on high performance machines that are notoriously difficult for scientists to use, and which tend to exhibit frequent failures. As machines get larger and larger, the failure rate of hardware and software components increases commensurately. Ad-hoc management of the execution of a complex modeling system is often difficult even for a single end-to-end run on a machine that never fails. Multiply that by the thousands of runs needed to perform a scientific experiment, in a hostile computing environment where hardware and facility outages are not uncommon, and you have a very challenging situation. For simulations that must run reliably in realtime, the situtation is almost hopeless. The traditional ad-hoc techniques for automating the execution of modeling systems (e.g. driver scripts, batch job chains or trees) do not provide sufficient fault tolerance for the scale and complexity of current and future workflows, nor are they reusable; each modeling system requires a custom automation system.
+The day when a scientist could conduct his or her numerical modeling and simulation research by writing, running, and monitoring the progress of a modest Fortran code or two, is quickly becoming a distant memory. It is a fact that researchers now often have to make hundreds or thousands of runs of a numerical model to get a single result. In addition, each end-to-end "run" of the model often entails running many different codes for pre- and post-processing in addition to the model itself. And, in some cases, multiple models and their associated pre- and post-processing tasks are coupled together to build a larger, more complex model. The codes that comprise the end-to-end modeling systems often have complex interdependencies that dictate the order in which they can be run. And, in order to run the end-to-end system efficiently, concurrency must be used when dependencies allow it. The problem of scale and complexity is exacerbated by the fact that these codes are usually run on high performance machines that are notoriously difficult for scientists to use, and which tend to exhibit frequent failures. As machines get larger and larger, the failure rate of hardware and software components increases commensurately. Ad-hoc management of the execution of a complex modeling system is often difficult even for a single end-to-end run on a machine that never fails. Multiply that by the thousands of runs needed to perform a scientific experiment, in a hostile computing environment where hardware and facility outages are not uncommon, and you have a very challenging situation. For simulations that must run reliably in realtime, the situation is almost hopeless. The traditional ad-hoc techniques for automating the execution of modeling systems (e.g. driver scripts, batch job chains or trees) do not provide sufficient fault tolerance for the scale and complexity of current and future workflows, nor are they reusable; each modeling system requires a custom automation system.
 
 A Workflow Management System addresses the problems of complexity, scale, reliability, and reusability by providing two things:
 
@@ -14,7 +14,7 @@ A Workflow Management System addresses the problems of complexity, scale, reliab
 ## Prerequisites
 Depending on how the components of a modeling system are designed and how existing software for running them is designed, some changes may be necessary to make use of a workflow management system. In order to take full advantage of the features offered by a workflow management system, the model system components must be well designed. In particular the following best practices should be followed:
 
-Each workflow task must correctly check for its successful completion, and must return a non-zero exit status upon failure. An exit status of 0 means success, regardless of what actually happened. No workflow task should contain automation features. Automation is the workflow managmenet system's responsibility. A workflow management system cannot manage tasks or jobs that it is not aware of. Enable reuse of workflow tasks by using principles of modular design to build autonomous model components with well-defined interfaces for input and output that can be run stand-alone. Prefer the construction of small model components that do only one thing. It is easy to combine several small, well-designed, components together to build a larger, more complex workflow task. It is generally much more difficult to divide large, complex, model components into smaller ones to form multiple workflow tasks. Avoid combining serial and parallel processing in the same workflow task unless the serial processing is very short in duration.
+Each workflow task must correctly check for its successful completion, and must return a non-zero exit status upon failure. An exit status of 0 means success, regardless of what actually happened. No workflow task should contain automation features. Automation is the workflow management system's responsibility. A workflow management system cannot manage tasks or jobs that it is not aware of. Enable reuse of workflow tasks by using principles of modular design to build autonomous model components with well-defined interfaces for input and output that can be run stand-alone. Prefer the construction of small model components that do only one thing. It is easy to combine several small, well-designed, components together to build a larger, more complex workflow task. It is generally much more difficult to divide large, complex, model components into smaller ones to form multiple workflow tasks. Avoid combining serial and parallel processing in the same workflow task unless the serial processing is very short in duration.
 
-## Documenation
+## Documentation
 Detailed documentation is provided at http://christopherwharrop.github.io/rocoto/
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,5 +1,20 @@
 # Release Notes
 
+## New for Version 1.3.3
+
+* Performance improvements
+* Store configuration in ~/.rocoto/$VERSION/rocotorc instead of ~/.rocoto/rocotorc to allow use of multiple versions of Rocoto.
+* Add capability to control batch system command timeouts from ~/.rocoto/$VERSION/rocotorc file.
+* Store logs in ~/.rocoto/$VERSION/$WORKFLOW_ID/log instead of ~/.rocoto/log to improve logging
+* Increase internal inter-server timeouts to increase resiliency on systems under heavy loads.
+* Fix bugs and deprecation warnings when using Ruby 2.7.x
+* Add support for the <exclusive> and <shared> tags to Slurm
+* Fix bug when using command line options to select all tasks
+
+## New for Version 1.3.2
+
+* Fix bug in Slurm batch system interface that caused UNAVAILABLE states to persist forever.
+
 ## New for Version 1.3.1
 
 * Fix XML validation bug that caused sensitivity to ordering of tags

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-1.3.1
+1.3.3
diff --git a/bin/cache-rocoto-sacct-command.sh b/bin/cache-rocoto-sacct-command.sh
@@ -0,0 +1,25 @@
+#! /bin/bash --login
+
+if [[ -t 1 ]] ; then
+    set -x
+fi
+
+set -ue
+
+six_days_ago=$( date +%m%d%y -d "6 days ago" )
+tgtfile="$HOME/sacct-cache/sacct.txt"
+workdir=$( dirname "$tgtfile" )
+[[ -d "$workdir" ]] || mkdir "$workdir" || sleep 3
+temp=$( mktemp --tmpdir="$workdir" )
+
+set +ue
+
+(
+    set -ue
+    sacct -S "$six_days_ago" -L -o "jobid,user%30,jobname%30,partition%20,priority,submit,start,end,ncpus,exitcode,state%12" -P > "$temp" ;
+    (( $( wc -l < "$temp" ) > 1 )) && /bin/mv -f "$temp" "$tgtfile"
+)
+
+if [[ -e "$temp" ]] ; then
+    rm -f "$temp"
+fi
diff --git a/lib/wfmstat/statusengine.rb b/lib/wfmstat/statusengine.rb
@@ -92,7 +92,7 @@ def wfmstat
         @workflowIOServer=WorkflowMgr::WorkflowIOProxy.new(@dbServer,@config,@options)
 
         # Open the workflow document
-        @workflowdoc = WorkflowMgr::WorkflowXMLDoc.new(@options.workflowdoc,@workflowIOServer)
+        @workflowdoc = WorkflowMgr::WorkflowXMLDoc.new(@options.workflowdoc,@workflowIOServer,@config)
 
         @workflowdoc.features_supported?
 
@@ -206,7 +206,7 @@ def checkTasks
         @workflowIOServer=WorkflowMgr::WorkflowIOProxy.new(@dbServer,@config,@options)
 
         # Open the workflow document
-        @workflowdoc = WorkflowMgr::WorkflowXMLDoc.new(@options.workflowdoc,@workflowIOServer)
+        @workflowdoc = WorkflowMgr::WorkflowXMLDoc.new(@options.workflowdoc,@workflowIOServer,@config)
 
         @subset=@options.selection.make_subset(tasks=@workflowdoc.tasks,cycledefs=@workflowdoc.cycledefs,dbServer=@dbServer)
 
@@ -447,6 +447,8 @@ def print_taskinfo(task)
     ##########################################
     def print_cycleinfo(cycle,cycledefs,task)
 
+      return if task.nil? or task.attributes.nil?
+
       # Make sure the cycle is valid for this task
       cycle_is_valid=true
       unless task.attributes[:cycledefs].nil?

diff --git a/lib/wfmstat/wfmstatoption.rb b/lib/wfmstat/wfmstatoption.rb
@@ -60,8 +60,17 @@ def add_opts(opts)
 
     end # add_opts
 
-    def make_selection()
-      @selection=WorkflowMgr::WorkflowDBSelection.new(@all_tasks,@task_options,@cycles,@default_all)
+public
+
+    ##########################################
+    #
+    # selection
+    #
+    ##########################################
+    def selection
+      if @selection.nil?
+        @selection=WorkflowMgr::WorkflowDBSelection.new(@all_tasks,@task_options,@cycles,@default_all)
+      end
       return @selection
     end
 

diff --git a/lib/workflowmgr/bqs.rb b/lib/workflowmgr/bqs.rb
@@ -14,7 +14,6 @@ class BQS
 
     require 'thread/pool'
     require 'workflowmgr/workflowdb'
-    require 'workflowmgr/sgebatchsystem'
     require 'workflowmgr/moabbatchsystem'
     require 'workflowmgr/moabtorquebatchsystem'
     require 'workflowmgr/torquebatchsystem'

diff --git a/lib/workflowmgr/bqsproxy.rb b/lib/workflowmgr/bqsproxy.rb
@@ -58,7 +58,7 @@ def initialize(batchSystem,config,options)
           define_method m do |*args|
             retries=0
             begin
-              WorkflowMgr.timeout(90) do
+              WorkflowMgr.timeout(150) do
                 @bqServer.send(m,*args)
               end
             rescue DRb::DRbConnError

diff --git a/lib/workflowmgr/cobaltbatchsystem.rb b/lib/workflowmgr/cobaltbatchsystem.rb
@@ -6,6 +6,7 @@
 module WorkflowMgr
 
   require 'workflowmgr/batchsystem'
+  require 'workflowmgr/utilities'
 
   ##########################################
   #
@@ -25,7 +26,10 @@ class COBALTBatchSystem < BatchSystem
     # initialize
     #
     #####################################################
-    def initialize(cobalt_root=nil)
+    def initialize(cobalt_root=nil,config)
+
+      # Get timeouts from the configuration
+      @qstat_timeout=config.JobQueueTimeout
 
       # Initialize an empty hash for job queue records
       @jobqueue={}
@@ -110,7 +114,7 @@ def status(jobid)
     #####################################################
     def submit(task)
       # Initialize the submit command
-      cmd="qsub --debuglog #{ENV['HOME']}/.rocoto/tmp/\\$jobid.log"
+      cmd="qsub --debuglog #{ENV['HOME']}/.rocoto/#{WorkflowMgr.version}/tmp/\\$jobid.log"
       input="#!/bin/sh\n"
 
       # Add Cobalt batch system options translated from the generic options specification
@@ -175,7 +179,7 @@ def submit(task)
 
       # Get a temporary file name to use as a wrapper and write job spec into it
       tfname=Tempfile.new('qsub.in').path.split("/").last
-      tf=File.new("#{ENV['HOME']}/.rocoto/tmp/#{tfname}","w")
+      tf=File.new("#{ENV['HOME']}/.rocoto/#{WorkflowMgr.version}/tmp/#{tfname}","w")
       tf.write(input)
       tf.flush()
       tf.chmod(0700)
@@ -226,7 +230,7 @@ def refresh_jobqueue
         queued_jobs=""
         errors=""
         exit_status=0
-        queued_jobs,errors,exit_status=WorkflowMgr.run4("qstat -l -f -u #{username} ",30)
+        queued_jobs,errors,exit_status=WorkflowMgr.run4("qstat -l -f -u #{username} ",@qstat_timeout)
 
         # Raise SchedulerDown if the showq failed
         raise WorkflowMgr::SchedulerDown,errors unless exit_status==0
@@ -305,7 +309,7 @@ def refresh_jobacct(jobid)
 
       begin
 
-        joblogfile = "#{ENV['HOME']}/.rocoto/tmp/#{jobid}.log"
+        joblogfile = "#{ENV['HOME']}/.rocoto/#{WorkflowMgr.version}/tmp/#{jobid}.log"
         return unless  File.exists?(joblogfile)
         joblog = IO.readlines(joblogfile,nil)[0]
 

diff --git a/lib/workflowmgr/dbproxy.rb b/lib/workflowmgr/dbproxy.rb
@@ -56,7 +56,7 @@ def initialize(config,options)
             retries=0
             busy_retries=0.0
             begin
-              WorkflowMgr.timeout(45) do
+              WorkflowMgr.timeout(150) do
                 @dbServer.send(m,*args)
               end
             rescue DRb::DRbConnError

diff --git a/lib/workflowmgr/launchserver.rb b/lib/workflowmgr/launchserver.rb
@@ -51,14 +51,16 @@ def WorkflowMgr.launchServer(server)
     Process.waitpid(child_pid)
 
     # Initialize the URI and pid of the server process
-    uri=""
+    uri_str=""
+    encoded_uri=""
     server_pid=0
 
     begin
 
       # Read the URI and pid of the server from the read end of the pipe we just created
       WorkflowMgr.timeout(10) do
-        uri=rd.gets
+        uri_str=rd.gets
+        uri_str.chomp! unless uri_str.nil?
         server_pid=rd.gets
         server_pid.chomp! unless server_pid.nil?
         rd.close
@@ -76,7 +78,8 @@ def WorkflowMgr.launchServer(server)
     end
 
     # Connect to the server process and return the object being served
-    return [ DRbObject.new(nil,uri), Socket::getaddrinfo(Socket.gethostname, nil, nil, Socket::SOCK_STREAM)[0][3], server_pid ]
+    encoded_uri=uri_str.encode("UTF-8", "binary", invalid: :replace, undef: :replace, replace: '')
+    return [ DRbObject.new(nil,encoded_uri), Socket::getaddrinfo(Socket.gethostname, nil, nil, Socket::SOCK_STREAM)[0][3], server_pid ]
 
   end