-
Notifications
You must be signed in to change notification settings - Fork 0
cme/forkargs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
forkargs -------- Forkargs behaves a little like xargs, but providing the capability of executing jobs in parallel, to exploit the available resources of modern multi-core or multiprocessor machines in a handy, easy-to-use format. Important differences from xargs are: * Jobs are executed in parallel to exploit multi-core machines * Arguments are not batched together into single commands * Arguments from input are treated as a single argument per line. This makes treatment of filenames with spaces much, much easier. Quick examples: # Compress all text files with bzip, executing 4 jobs at once. find . -name '*.txt' | forkargs -j4 bzip2 -9 This solves the problem of balancing prallelism with resource requirements for common vastly-parallel jobs. For example, it would be possible to perform the above with: for x in `find . -name '*.txt'`; do bzip2 -9 "$x" & done which would execute all the bzip2 commands in parallel; however, this could lead to an arbitrary number of concurrent processes clamouring for OS resources (CPU time, memory), reducing system responsiveness until completion of sufficient tasks. Options ------- -j <slots> Specify parallel jobs (cc. 'make -j'). The 'slots' argument may be an integer specifying the number of jobs to execute locally in parallel, or a slot definition list (see below). By default, forkargs uses as many slots as there are available processors on the system. -k Continue on errors. -v Verbose output: print each command before executing. -n Do not test remote machines for accessibility before issuing commands to them. -f <file> Read input arguments from a named file rather than from stdin Environment ----------- The environment variable FORKARGS_J, if it exists, will provide a default 'slots' value if no '-j' option is passed. Remote Execution and Slots -------------------------- forkargs also supports remote execution of commands using ssh. This allows forkargs to be used as a job distribution tool for small clusters. This is most useful in clusters with shared filesystems, but depending on the nature of the jobs being dispatched, rsync might be useful for mirroring filesystem state and gathering results. Each remote job is started as a separate ssh session. This adds significant latency at the start of the job, and makes it almost essential to set up ssh keys appropriately to avoid having to enter passwords. (TODO: use a single connection.) If any remote hosts are to be used, forkargs will initially attempt to ssh to the host in order to check that it is accessible. If this test ssh command fails, slots on that remote machine will be flagged as faulted and no commands will be issued to them. Remote hosts are specified using a generalisation of the '-j' option. The argument to the '-j' option defines a list of execution "slots". The format is a comma-separated list of entries defining some number of slots, each of which may execute a job in parallel. Each item may take the forms: n Defines 'n' (an integer) slots on the local machine. Directly comparable to make's '-j'. hostname Defines a slot on remote host 'hostname' to be spawned by ssh. The hostname can incorporate a username, to be passed directly to ssh. n * hostname Defines 'n' (an integer) identical slots on a remote machine 'hostname' 'hostname' entries can optionally specify a username (to log in as with ssh) and a directory to change to on the remote machine, in the common 'user@hostname:directory' format used by rsync, etc. The following are all equivalent to '-j3': -j 1,1,1 -j 1,2 -j '3*localhost' The order of the slots is significant. Available jobs are always started preferentially on the first slots. This is helpful if the latency on establishing an ssh connection becomes a constraining factor, or if machines have different performance characteristics. This way the faster machines can be used preferentially to reduce the overall execution time if individual jobs are long-running. For example: ls '*.wav' | forkargs -j '2,2*colin@willow' \ sh -c 'lame $1 `basename $1.wav`' Complex command lines --------------------- The commands are all passed a single line of the input, as a single argument. That's handy for a huge number of commands, such as those shown above. However, sometimes different, or more complicated processing of the arguments is required. Such cases can be constructed by using /bin/sh as a wrapper to process arguments, for example: find . -name '*.png' \ | forkargs -j4 sh -c 'convert -scale 64x64 "$1" "$1".thumb.jpg' TO DO ----- Use a single SSH connection per remote slot. Better testing for accessibility of remote machines. Rather than waiting until all remote machines have been tested before issuing any command, we could: * allow the local slots to start processing immediately * before issuing the first command to each remote slot, spawn a test process. Once it has been successfully tested, commands may be issued to it. * If the local machine can process all the commands before the test SSH returns (successfully or unsuccessfully) then forkargs should just complete. When testing remote machine slots, test the working directory too. Provide local/remote pre- and post-commands. * per-host setup and teardown command options eg. to distribute inputs, and recover outputs, via eg. rsync. * per-slot pre- and post-commands to execute before/after each job. Combining the above, add an option to sync working directory to remote working directories, and sync back (assuming results are additive) at the end. This happens to be precisely my most frequent use-case.
About
Parallelising job management in the style of xargs.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published