GitHub - cme/forkargs: Parallelising job management in the style of xargs.

cme / forkargs Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Parallelising job management in the style of xargs.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Jamfile		Jamfile
Makefile		Makefile
README		README
forkargs.c		forkargs.c

Repository files navigation

forkargs
--------

Forkargs behaves a little like xargs, but providing the capability of
executing jobs in parallel, to exploit the available resources of
modern multi-core or multiprocessor machines in a handy, easy-to-use
format.

Important differences from xargs are:

  * Jobs are executed in parallel to exploit multi-core machines

  * Arguments are not batched together into single commands

  * Arguments from input are treated as a single argument per
    line. This makes treatment of filenames with spaces much, much
    easier.

Quick examples:

        # Compress all text files with bzip, executing 4 jobs at once.
        find . -name '*.txt' | forkargs -j4 bzip2 -9

This solves the problem of balancing prallelism with resource
requirements for common vastly-parallel jobs.

For example, it would be possible to perform the above with:

    for x in `find . -name '*.txt'`; do bzip2 -9 "$x" & done

which would execute all the bzip2 commands in parallel; however, this
could lead to an arbitrary number of concurrent processes clamouring
for OS resources (CPU time, memory), reducing system responsiveness
until completion of sufficient tasks.


Options
-------

    -j <slots>

        Specify parallel jobs (cc. 'make -j'). The 'slots' argument
        may be an integer specifying the number of jobs to execute
        locally in parallel, or a slot definition list (see below).
        By default, forkargs uses as many slots as there are available
        processors on the system.

    -k
        Continue on errors.
    -v
        Verbose output: print each command before executing.
    -n
        Do not test remote machines for accessibility before issuing
        commands to them.
    -f <file>
        Read input arguments from a named file rather than from stdin

Environment
-----------

The environment variable FORKARGS_J, if it exists, will provide a
default 'slots' value if no '-j' option is passed.

Remote Execution and Slots
--------------------------

forkargs also supports remote execution of commands using ssh. This
allows forkargs to be used as a job distribution tool for small
clusters. This is most useful in clusters with shared filesystems, but
depending on the nature of the jobs being dispatched, rsync might be
useful for mirroring filesystem state and gathering results.

Each remote job is started as a separate ssh session. This adds
significant latency at the start of the job, and makes it almost
essential to set up ssh keys appropriately to avoid having to enter
passwords. 

(TODO: use a single connection.)

If any remote hosts are to be used, forkargs will initially attempt to
ssh to the host in order to check that it is accessible. If this test
ssh command fails, slots on that remote machine will be flagged as
faulted and no commands will be issued to them.

Remote hosts are specified using a generalisation of the '-j'
option. The argument to the '-j' option defines a list of execution
"slots". The format is a comma-separated list of entries defining some
number of slots, each of which may execute a job in parallel. Each
item may take the forms:

    n
        Defines 'n' (an integer) slots on the local machine. Directly
        comparable to make's '-j'.
    hostname
        Defines a slot on remote host 'hostname' to be spawned by ssh. 
        The hostname can incorporate a username, to be passed directly
        to ssh.
    n * hostname
        Defines 'n' (an integer) identical slots on a remote machine 'hostname'

'hostname' entries can optionally specify a username (to log in as
with ssh) and a directory to change to on the remote machine, in the
common 'user@hostname:directory' format used by rsync, etc.

The following are all equivalent to '-j3':
    -j 1,1,1
    -j 1,2
    -j '3*localhost'

The order of the slots is significant. Available jobs are always
started preferentially on the first slots. This is helpful if the
latency on establishing an ssh connection becomes a constraining
factor, or if machines have different performance
characteristics. This way the faster machines can be used
preferentially to reduce the overall execution time if individual
jobs are long-running.

For example:

    ls '*.wav' | forkargs -j '2,2*colin@willow' \
        sh -c 'lame $1 `basename $1.wav`'

Complex command lines
---------------------

The commands are all passed a single line of the input, as a single
argument. That's handy for a huge number of commands, such as those
shown above. However, sometimes different, or more complicated
processing of the arguments is required. Such cases can be constructed
by using /bin/sh as a wrapper to process arguments, for example:

    find . -name '*.png' \
        | forkargs -j4 sh -c 'convert -scale 64x64 "$1" "$1".thumb.jpg'


TO DO
-----

Use a single SSH connection per remote slot.

Better testing for accessibility of remote machines. Rather than
waiting until all remote machines have been tested before issuing any
command, we could:
  * allow the local slots to start processing immediately
  * before issuing the first command to each remote slot, spawn a test
    process. Once it has been successfully tested, commands may be 
    issued to it.
  * If the local machine can process all the commands before the test
    SSH returns (successfully or unsuccessfully) then forkargs should
    just complete.

When testing remote machine slots, test the working directory too.

Provide local/remote pre- and post-commands.
  * per-host setup and teardown command options eg. to distribute
    inputs, and recover outputs, via eg. rsync.
  * per-slot pre- and post-commands to execute before/after each job.

Combining the above, add an option to sync working directory to remote
working directories, and sync back (assuming results are additive) at
the end. This happens to be precisely my most frequent use-case.