README.examples

APPLICATIONS:
=============

------------------------
swps3 (release 20080605)
------------------------
swps3 is an optimised version of the striped Smith-Waterman local 
alignment algorithm for both the Cell/BE and ×86/x86_64 CPUs
with SSE2 instructions. swps3 has been developed at the Computer Science 
Department of the ETH Zürich.
The project website is : http://www.inf.ethz.ch/personal/sadam/swps3/

We modified the original code in order to use the FastFlow framework.
The new version (swps3_ff) can be compiled this way: 
 #cd swps3
 #make -f Makefile.sse2_ff

Command line parameters are:
swps3_ff [-h] [-s] [-j num] [-i num] [-e num] [-t num] matrix query db
  matrix: scoring matrix file
  query : query amino acid sequence in FASTA format
  db    : amino acid sequence data base in FASTA format
  -h    : print help
  -s    : run the scalar version (without vectorised SSE instructions)
  -j num: start num worker threads 
  -i num: gap insertion score (default: 12)
  -e num: gap extension score (default: 2)
  -t num: score limit (default: DBL_MAX)

You can find the Swiss-Prot protein DB in FASTA format here:
http://www.uniprot.org/downloads
  

------------------------
simple_mandelbrot		
------------------------
Simple Mandelbrot computes and displays the Mandelbrot set. The application 
has been thought as a naive example of an embarrassingly parallel application
on a stream. The mandebrot set is represented as a matrix of pixels. 
The coordinates of the lines of the complex plane are dispatched by a 
scheduler thread, which "create a stream" and send its items (lines) to workers
in a round robin fashion. Workers allocate a raw of pixels, computes the 
Mandelbrot set for line and then send the raw of pixel to a collector, 
which gather the lines from all workers and display them on the screen.

The application has been realised in different version: sequential (mandel_seq),
pthread-based (mandel_pt), using fastflow (mandel_ff), using fastflow 
with FastFlow's memory allocator (mandel_ff_mem_all). 
All applications use the same sequential code, which has been manually 
copy&pasted.  

Performances are very dependent on the graphic subsystem. So if you want to test
performance, please compile all versions with MANDEL_NO_DISPLAY environment 
variable set (see Makefile). In our tests, the lesser the number of iterations,
the higher the FastFlow performance is with respect to the mandel_pt version.

Command line parameters are:
mandel_<ver> size niterations retries nworkers [template-type]
 size         : size of the pixel matrix
 niterations  : max number of iterations to compute each point of the set
 retries      : number of times the experiment is run
 nworkers     : number of workers used in the computation (threads = nworkers+1)
 template-type: 0 force not to use the Collector, 1 force to use the Collector

Default values are: 800 1024 1 2 1
Note that with less than 4 cores, the template-type parameter is set to 0 if 
not differently specified in the command line.


------------------------
qt_mandelbrot
------------------------
The "QT Mandelbrot"  is an interactive application that computes the 
Mandelbrot set.  It is part of the Trolltech QT examples and it consists 
of two classes: RenderThread.cpp, i.e. a QThread subclass that renders the
Mandelbrot set, and MandelbrotWidget.cpp class, i.e  a QWidget
subclass that shows the Mandelbrot set on screen and lets the user
zoom and scroll. During the time where the worker thread is
recomputing the fractal to reflect the new zoom factor position, the
main thread simply scales the previously rendered pixmap to provide
immediate feedback. The result does not look as good as what the worker
thread eventually ends up providing, but at least it makes the
application more responsive.

These two instances of the classes are run as QT threads; it shows how
to use a worker thread to perform heavy computations without blocking
the main thread's event loop. The application is multithreaded
(exploiting QT threads) but threads are not used to boost heavy
computations since the whole computation is done within a single
thread, but to decouple two or more activities and to enhance
responsiveness.

Despite it an easy application, the full-fledged parallelization of
the whole application is not trivial. The two threads synchronise one
each other via QT events; to guarantee responsiveness the
MandelbrotWidget thread may start, restart, and abort the
activity of RenderThread. This kind of behavior, as well as the
integration with other threading libraries, makes non trivial the
porting to frameworks such as TBB and OpenMP.

The Fastflow accelerated version just make parallel the
RenderThread by using a farm accelerator on the outer
loop that traverse the pixmap of the Mandelbrot set. The farm
accelerator is created once, then run and frozen each time an interrupt 
signal is raised from MandelbrotWidget. 

------------------------
nqueens
------------------------
Th N-queens is a problem in which you have to place N queens on an NxN 
board-size chess such that no queen can attach the others.

The proposed solution is based on a heavily optimised sequential C program
written by Jeff Somers (the original version is implemented in the file nq.c) 
using the FastFlow farm skeleton without the Collector module.

The FastFlow version produces a stream of independent tasks corresponding 
to the initial placement of a number of queens on the board (such number
corresponds to the command line argument 'depth'). The placement of the 
remaining queens, starting from the initial placement received as a task 
from the input stream, is handled by one of the worker threads.

Performance:

Machine: Duad quad-cores Xeon E5420 @2.5 GHz (8-cores).
^^^^^^^
Command: nq_ff_acc <board-size> 8 4

board-size  # solutions   seq. time   FastFlow Acc.  Speedup 
                	     (nq)      (nq_ff_acc)
18x18        666,090,624     6:52          1:06        6.24
19x19      4,968,057,848    53:28          8:26        6.34
20x20     39,029,188,884  7:16:27       1:08:56        6.52
21x21    314,666,222,712 ~2.7days       9:48:28        6.69


Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
Command: nq_ff_acc <board-size> 16 4

18x18&       666,090,624&      5:53	     34        10.4
19x19&	   4,968,057,848&     44:56	   4:23        10.2
20x20&	  39,029,188,884&   6:07:21       35:41        10.3
21x21&	 314,666,222,712&  ~2.2days     5:07:19        10.3


Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^
Command: nq_ff_acc <board-size> 24 4

18x18&       666,090,624&      5:32	     18        
19x19&	   4,968,057,848&     43:06	   1:13      
20x20&	  39,029,188,884&   5:50:32       17:46      
21x21&	 314,666,222,712&      ---      2:31:58         --


Command line parameters of the FastFlow version are:
nq_ff_acc <width of board> [<num-workers> <depth>]
 width of board: problem size 
 num-workers   : number of workers thread used in the computation (default: 2)
 depth         : initial values of queens to place (default: 1)


------------------------
cholesky
------------------------
The cholesky application performs the Cholesky decomposition on a stream 
of complex hermitian matrices.

Cholesky factorization, or Cholesky decomposition, is a way to decompose a 
symmetric, positive-definite hermitian matrix into the product of a lower 
triangular matrix and its conjugate transpose.
If a complex matrix A is Hermitian and positive definite, then A can be 
decomposed as A = LL*, where L is a lower triangular matrix with strictly 
positive diagonal entries, and L* denotes the conjugate transpose of L.
The Cholesky decomposition is mainly used for the numerical solution of 
linear equations Ax = b. We can solve Ax = b by first computing the Cholesky 
decomposition A = LLT, then solving Ly = b for y, and finally solving LTx = y 
for x.

There are various methods for calculating the Cholesky decomposition. 
We have implemented two of them. The first one is the standard version 
deeply explained in http://en.wikipedia.org/wiki/Cholesky_decomposition. 
The second one is the block version, which considers the matrix as divided 
in sub-matrices: the algorithm procedes working on these sub-matrices instead 
of single elements. A deep explanation of the block variant of Cholesky 
decomposition can be found in "Programming Matrix Algorithms-by-Blocks for 
Thread-Level Parallelism" by Gregorio Quintana-Ortì et al.

We implemented a farm parallelization of both these versions. The emitter 
distributes the tasks to the workers. Each task is represented by a pointer 
to an input matrix A and a pointer to a result matrix L. There is no collector 
filter process: each worker reads inputs and saves results directly on main 
memory.

.......to be completed.....

------------------------
matmul
------------------------
Simple NxN integer matrix multiplication (A=BxC). 
The FastFlow version proposed splits the computation in NxN tasks in order to 
compute the C[i][j] element. Better performance could be obtained splitting 
the computation in N tasks allowing each worker threads to compute an
entire row of the C matrix.

Performance for N=1024.
Command: ff_matmul <nworkers>

Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
Sequential time 10780.9 (ms)

<nworkers>  FastFlow time (ms)   OpenMP time (ms) (gcc 4.4.3)
     4           2982.24              2767.48
     8           1495.87              1455.73
    12           1496.01              1492.87
    16           1495.74              1433.31


Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU)
^^^^^^^
Sequential time 22324.1 (ms)

<nworkers>  FastFlow time (ms)   OpenMP time (ms) (gcc 4.3.2)
     4           12173.5  	 15286.4
     8		 11116.2 	 13725.2
    12            9844.62 	  8607.82
    16            5890.27 	  7621.7
    20		  5518.41 	  7490.09
    24		  5516.41 	  6420.61
    28		  5194.5	  6327.92
    32		  5161.4  	  6043.53


Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^
Sequential time 21967.9 (ms)

<nworkers>  FastFlow time (ms)   OpenMP time (ms) (gcc 4.3.2)
     4         10448.3              8892.13 
     8		6111.95             5138.76
    12          4530.39             4336.8
    16          4771.28             4145.57
    20		4621.81             4284.28
    24	        5318.03	            4346.22


------------------------
pbzip2
------------------------

PBZIP2 is a parallel implementation by Jeff Gilchrist, of the bzip2 
block-sorting file compressor that uses pthreads and achieves near-linear 
speedup on SMP machines. You can find PBZIP2 at http://compression.ca/pbzip2/.

We ported the PBZIP2 code under the FastFlow framework; the source code 
of the new version can be found in pbzip2_ff.cpp. 
Our main objective in rewriting the application, were to show how to rewrite  
a pthread-based task-farm application using FastFlow's task-farm schema. 
A detailed description of the PBZIP2 algorithm can be found in the paper:
"Parallel Data Compression with BZIP2" by Jeff Gilchrist available at
http://gilchrist.ca/jeff/papers/Parallel_BZIP2.pdf

Observe, the main focus of the FastFlow porting is not improving the
performance of Gilchrist's PBZIP2, which is already very efficient; it
exhibits nearly optimal speedup. The FastFlow porting rather aims to 
demonstrate that:
- FastFlow makes it possible to achieve the parallelisation of the
problem exploiting high-level parallel patterns instead of a hand-tuned 
design with almost no performance penalty and productivity gain.
- Fastflow non-blocking synchronizations exibit reasonable
performances against traditional blocking synchronizations also in
worst case scenarios, such as coarse grain elaborations.
 
In this latter respect, imagine a farm schema where you need the collector filter 
in order to perform some post elaboration (i.e. writing data into files, 
buffering tasks for maintaining data ordering, etc.). In this case, the 
non-blocking collector thread might not have anything to compute for long 
periods of time because worker threads are slow in producing output tasks.
So, it would be better to stop the collector thread letting it wait to be 
awoken by one of the workers as soon as there are some tasks to post-elaborate. 
In these situations, non-blocking behaviour might waste CPU cycles, 
preventing other threads to do something useful (obviously this 
could happen mainly in those cases where one have more threads than cores).

We proved that by carefully using FastFlow mechanisms you can obtain almost
the same linear speedup of the hand-tuned mutex-based version of the same 
code even in the non optimal case of farm schema with collector filter 
and coarse grained computation.  

Performance (*)
Commands used:
	 pbzip2_ff -k -v -f linux-2.6.32.8.tar           (Compression)
	 pbzip2_ff -k -v -f -d linux-2.6.32.8.tar.bz2    (Decompression)


Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
                 bzip2(v.1.0.3)   pbzip2     pbzip2_ff

Compression          64.25         7.083      7.064
Decompression        14.67         1.8060     1.8728


Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU)
^^^^^^^^
                 bzip2(v.1.0.5)   pbzip2     pbzip2_ff

Compression          61.13         3.446733     3.432033
Decompression        13.02         1.311154     1.8728


Machine: Dual 12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^

                 bzip2(v.1.0.5)   pbzip2(v.1.1.1)   pbzip2_ff

Compression          69.15           4.334679         4.354582
Decompression        16.92           1.453813         1.931218


* We executed 10-runs, removed the min and the max values obtained, and 
  than computed the mean.