-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.examples
326 lines (259 loc) · 13.5 KB
/
README.examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
APPLICATIONS:
=============
------------------------
swps3 (release 20080605)
------------------------
swps3 is an optimised version of the striped Smith-Waterman local
alignment algorithm for both the Cell/BE and ×86/x86_64 CPUs
with SSE2 instructions. swps3 has been developed at the Computer Science
Department of the ETH Zürich.
The project website is : http://www.inf.ethz.ch/personal/sadam/swps3/
We modified the original code in order to use the FastFlow framework.
The new version (swps3_ff) can be compiled this way:
#cd swps3
#make -f Makefile.sse2_ff
Command line parameters are:
swps3_ff [-h] [-s] [-j num] [-i num] [-e num] [-t num] matrix query db
matrix: scoring matrix file
query : query amino acid sequence in FASTA format
db : amino acid sequence data base in FASTA format
-h : print help
-s : run the scalar version (without vectorised SSE instructions)
-j num: start num worker threads
-i num: gap insertion score (default: 12)
-e num: gap extension score (default: 2)
-t num: score limit (default: DBL_MAX)
You can find the Swiss-Prot protein DB in FASTA format here:
http://www.uniprot.org/downloads
------------------------
simple_mandelbrot
------------------------
Simple Mandelbrot computes and displays the Mandelbrot set. The application
has been thought as a naive example of an embarrassingly parallel application
on a stream. The mandebrot set is represented as a matrix of pixels.
The coordinates of the lines of the complex plane are dispatched by a
scheduler thread, which "create a stream" and send its items (lines) to workers
in a round robin fashion. Workers allocate a raw of pixels, computes the
Mandelbrot set for line and then send the raw of pixel to a collector,
which gather the lines from all workers and display them on the screen.
The application has been realised in different version: sequential (mandel_seq),
pthread-based (mandel_pt), using fastflow (mandel_ff), using fastflow
with FastFlow's memory allocator (mandel_ff_mem_all).
All applications use the same sequential code, which has been manually
copy&pasted.
Performances are very dependent on the graphic subsystem. So if you want to test
performance, please compile all versions with MANDEL_NO_DISPLAY environment
variable set (see Makefile). In our tests, the lesser the number of iterations,
the higher the FastFlow performance is with respect to the mandel_pt version.
Command line parameters are:
mandel_<ver> size niterations retries nworkers [template-type]
size : size of the pixel matrix
niterations : max number of iterations to compute each point of the set
retries : number of times the experiment is run
nworkers : number of workers used in the computation (threads = nworkers+1)
template-type: 0 force not to use the Collector, 1 force to use the Collector
Default values are: 800 1024 1 2 1
Note that with less than 4 cores, the template-type parameter is set to 0 if
not differently specified in the command line.
------------------------
qt_mandelbrot
------------------------
The "QT Mandelbrot" is an interactive application that computes the
Mandelbrot set. It is part of the Trolltech QT examples and it consists
of two classes: RenderThread.cpp, i.e. a QThread subclass that renders the
Mandelbrot set, and MandelbrotWidget.cpp class, i.e a QWidget
subclass that shows the Mandelbrot set on screen and lets the user
zoom and scroll. During the time where the worker thread is
recomputing the fractal to reflect the new zoom factor position, the
main thread simply scales the previously rendered pixmap to provide
immediate feedback. The result does not look as good as what the worker
thread eventually ends up providing, but at least it makes the
application more responsive.
These two instances of the classes are run as QT threads; it shows how
to use a worker thread to perform heavy computations without blocking
the main thread's event loop. The application is multithreaded
(exploiting QT threads) but threads are not used to boost heavy
computations since the whole computation is done within a single
thread, but to decouple two or more activities and to enhance
responsiveness.
Despite it an easy application, the full-fledged parallelization of
the whole application is not trivial. The two threads synchronise one
each other via QT events; to guarantee responsiveness the
MandelbrotWidget thread may start, restart, and abort the
activity of RenderThread. This kind of behavior, as well as the
integration with other threading libraries, makes non trivial the
porting to frameworks such as TBB and OpenMP.
The Fastflow accelerated version just make parallel the
RenderThread by using a farm accelerator on the outer
loop that traverse the pixmap of the Mandelbrot set. The farm
accelerator is created once, then run and frozen each time an interrupt
signal is raised from MandelbrotWidget.
------------------------
nqueens
------------------------
Th N-queens is a problem in which you have to place N queens on an NxN
board-size chess such that no queen can attach the others.
The proposed solution is based on a heavily optimised sequential C program
written by Jeff Somers (the original version is implemented in the file nq.c)
using the FastFlow farm skeleton without the Collector module.
The FastFlow version produces a stream of independent tasks corresponding
to the initial placement of a number of queens on the board (such number
corresponds to the command line argument 'depth'). The placement of the
remaining queens, starting from the initial placement received as a task
from the input stream, is handled by one of the worker threads.
Performance:
Machine: Duad quad-cores Xeon E5420 @2.5 GHz (8-cores).
^^^^^^^
Command: nq_ff_acc <board-size> 8 4
board-size # solutions seq. time FastFlow Acc. Speedup
(nq) (nq_ff_acc)
18x18 666,090,624 6:52 1:06 6.24
19x19 4,968,057,848 53:28 8:26 6.34
20x20 39,029,188,884 7:16:27 1:08:56 6.52
21x21 314,666,222,712 ~2.7days 9:48:28 6.69
Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
Command: nq_ff_acc <board-size> 16 4
18x18& 666,090,624& 5:53 34 10.4
19x19& 4,968,057,848& 44:56 4:23 10.2
20x20& 39,029,188,884& 6:07:21 35:41 10.3
21x21& 314,666,222,712& ~2.2days 5:07:19 10.3
Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^
Command: nq_ff_acc <board-size> 24 4
18x18& 666,090,624& 5:32 18
19x19& 4,968,057,848& 43:06 1:13
20x20& 39,029,188,884& 5:50:32 17:46
21x21& 314,666,222,712& --- 2:31:58 --
Command line parameters of the FastFlow version are:
nq_ff_acc <width of board> [<num-workers> <depth>]
width of board: problem size
num-workers : number of workers thread used in the computation (default: 2)
depth : initial values of queens to place (default: 1)
------------------------
cholesky
------------------------
The cholesky application performs the Cholesky decomposition on a stream
of complex hermitian matrices.
Cholesky factorization, or Cholesky decomposition, is a way to decompose a
symmetric, positive-definite hermitian matrix into the product of a lower
triangular matrix and its conjugate transpose.
If a complex matrix A is Hermitian and positive definite, then A can be
decomposed as A = LL*, where L is a lower triangular matrix with strictly
positive diagonal entries, and L* denotes the conjugate transpose of L.
The Cholesky decomposition is mainly used for the numerical solution of
linear equations Ax = b. We can solve Ax = b by first computing the Cholesky
decomposition A = LLT, then solving Ly = b for y, and finally solving LTx = y
for x.
There are various methods for calculating the Cholesky decomposition.
We have implemented two of them. The first one is the standard version
deeply explained in http://en.wikipedia.org/wiki/Cholesky_decomposition.
The second one is the block version, which considers the matrix as divided
in sub-matrices: the algorithm procedes working on these sub-matrices instead
of single elements. A deep explanation of the block variant of Cholesky
decomposition can be found in "Programming Matrix Algorithms-by-Blocks for
Thread-Level Parallelism" by Gregorio Quintana-Ortì et al.
We implemented a farm parallelization of both these versions. The emitter
distributes the tasks to the workers. Each task is represented by a pointer
to an input matrix A and a pointer to a result matrix L. There is no collector
filter process: each worker reads inputs and saves results directly on main
memory.
.......to be completed.....
------------------------
matmul
------------------------
Simple NxN integer matrix multiplication (A=BxC).
The FastFlow version proposed splits the computation in NxN tasks in order to
compute the C[i][j] element. Better performance could be obtained splitting
the computation in N tasks allowing each worker threads to compute an
entire row of the C matrix.
Performance for N=1024.
Command: ff_matmul <nworkers>
Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
Sequential time 10780.9 (ms)
<nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.4.3)
4 2982.24 2767.48
8 1495.87 1455.73
12 1496.01 1492.87
16 1495.74 1433.31
Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU)
^^^^^^^
Sequential time 22324.1 (ms)
<nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.3.2)
4 12173.5 15286.4
8 11116.2 13725.2
12 9844.62 8607.82
16 5890.27 7621.7
20 5518.41 7490.09
24 5516.41 6420.61
28 5194.5 6327.92
32 5161.4 6043.53
Machine: Dual D12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^
Sequential time 21967.9 (ms)
<nworkers> FastFlow time (ms) OpenMP time (ms) (gcc 4.3.2)
4 10448.3 8892.13
8 6111.95 5138.76
12 4530.39 4336.8
16 4771.28 4145.57
20 4621.81 4284.28
24 5318.03 4346.22
------------------------
pbzip2
------------------------
PBZIP2 is a parallel implementation by Jeff Gilchrist, of the bzip2
block-sorting file compressor that uses pthreads and achieves near-linear
speedup on SMP machines. You can find PBZIP2 at http://compression.ca/pbzip2/.
We ported the PBZIP2 code under the FastFlow framework; the source code
of the new version can be found in pbzip2_ff.cpp.
Our main objective in rewriting the application, were to show how to rewrite
a pthread-based task-farm application using FastFlow's task-farm schema.
A detailed description of the PBZIP2 algorithm can be found in the paper:
"Parallel Data Compression with BZIP2" by Jeff Gilchrist available at
http://gilchrist.ca/jeff/papers/Parallel_BZIP2.pdf
Observe, the main focus of the FastFlow porting is not improving the
performance of Gilchrist's PBZIP2, which is already very efficient; it
exhibits nearly optimal speedup. The FastFlow porting rather aims to
demonstrate that:
- FastFlow makes it possible to achieve the parallelisation of the
problem exploiting high-level parallel patterns instead of a hand-tuned
design with almost no performance penalty and productivity gain.
- Fastflow non-blocking synchronizations exibit reasonable
performances against traditional blocking synchronizations also in
worst case scenarios, such as coarse grain elaborations.
In this latter respect, imagine a farm schema where you need the collector filter
in order to perform some post elaboration (i.e. writing data into files,
buffering tasks for maintaining data ordering, etc.). In this case, the
non-blocking collector thread might not have anything to compute for long
periods of time because worker threads are slow in producing output tasks.
So, it would be better to stop the collector thread letting it wait to be
awoken by one of the workers as soon as there are some tasks to post-elaborate.
In these situations, non-blocking behaviour might waste CPU cycles,
preventing other threads to do something useful (obviously this
could happen mainly in those cases where one have more threads than cores).
We proved that by carefully using FastFlow mechanisms you can obtain almost
the same linear speedup of the hand-tuned mutex-based version of the same
code even in the non optimal case of farm schema with collector filter
and coarse grained computation.
Performance (*)
Commands used:
pbzip2_ff -k -v -f linux-2.6.32.8.tar (Compression)
pbzip2_ff -k -v -f -d linux-2.6.32.8.tar.bz2 (Decompression)
Machine: Dual quad-core Xeon E5520 Nehalem @2.26GHz (8-cores 16-contexts)
^^^^^^^
bzip2(v.1.0.3) pbzip2 pbzip2_ff
Compression 64.25 7.083 7.064
Decompression 14.67 1.8060 1.8728
Machine: 32-cores AMD machine (4 CPUs 8 cores x CPU)
^^^^^^^^
bzip2(v.1.0.5) pbzip2 pbzip2_ff
Compression 61.13 3.446733 3.432033
Decompression 13.02 1.311154 1.8728
Machine: Dual 12-cores Opteron 6174 @ 2.2GHz (12-cores 2-CPUs)
^^^^^^^
bzip2(v.1.0.5) pbzip2(v.1.1.1) pbzip2_ff
Compression 69.15 4.334679 4.354582
Decompression 16.92 1.453813 1.931218
* We executed 10-runs, removed the min and the max values obtained, and
than computed the mean.