Speedup: Time for 1 processor / Time for P processors

Steps to parallelize: Break up a problem into many pieces of work -> Assign work to workers

(1) Decomposition: Create at least enough tasks to keep all execution units on a machine busy. This depends largely on the programmer. Compilers are not good at parallelizing code yet.

Amdahl's law: If S fraction of the program is inherently sequential, maximum speedup with parallel execution is 1/S.

(2) Assignment: Many languages and runtimes take responsiblity for the assignment.

For example, in the SPMD code, the programmer only decomposes the work into loops and the system manages the assignment of iterations to program instances. Another example is that ISPC can also assign tasks at the thread level.

(3) Orchestration: Communication between workers, adding synchronizations, scheduling tasks.

(4) Mapping to hardware: Mapping workers (threads) to execution units. The programmer does not have to concern with it.

OS maps thread to hardware execution context on CPU. Compiler maps program instances to vector instruction lanes. Hardware maps CUDA thread blocks to GPU cores.

Mapping decisions: Related threads on the same processor maximizes locality, data sharing, minimize costs of synchronization. Unrelated threads on the same processor (bandwidth-limited vs compute-limited) may utilize the machine more efficiently

2D-grid based solver example

Perform Gauss-Seidel (Avg of 5 cells) sweeps over grid till convergence.

/* Pseudocode */

const int n;
float* A; /* Assume grid of N+2 x N+2 elements */

void solve(float* A) {
	float diff, prev;
	bool done = false;
	
	while (!done) {
		diff = 0.f;
		for (int i=1; i<n; i++) {
			for (int j=1; j<n; j++) {
				prev = A[i,j];
				A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i,j+1] +
										A[i-1,j] + A[i+1,j]);
				diff += fabs(A[i,j] - prev);
			}
		}
		if (diff/(n*n) < TOLERANCE)
			done = true;
	}
}

Decomposition for 2D-grid based solver

Parallelism is difficult. Each row element depends on the element to left. Each row depends on the previous row.

There is independent work along the diagonals. Possible strategy is to break down each diagonal into tasks and update in parallel. However, there is not much parallelism at the beginning and end of the computation.

Algorithm can be changed if domain knowledge is available. New approach is to update red cells and black cells in parallel. New algorithm iterates to approximately the same solution.

Assignment for 2D-grid based solver

Blocked assignment may require less communication with other processors. For example, P2 has to find out two rows from P1 and P3. For interleaved assignment, P2 has to find out 6 rows from other processors.

Data-parallel model: Perform same operation on each element on the array. Programming in numPy is a good example.

Stream programming model: Elements in a stream are processed independently by "kernels" which are side-effect-free functions.

const int N = 1024;
stream<float> input(N);
stream<float> output(N);
stream<float> tmp(N);

foo(input, tmp);
bar(tmp, output);

Gather operation: Gather from buffer mem_base into R1 according to indices specified by R0.

gather(R1, R0, mem_base)

Shared address space pseudocode for 2D-grid based solver

Pseudocode only showing the red cells. Blocked assignment is done for each thread. Barrier blocks until the expected number of threads arrive.

int n;
bool done = false;
float diff = 0.0;
LOCK mylock;
BARRIER myBarrier;

float* A = allocate(n+2, n+2);

/* Function solve() is run in all threads */

void solve(float* A) {
	int threadId = getThreadId();
	int myMin = 1 + (threadId * n / NUM_PROCESSORS);
	int myMax = myMin + (n / NUM_PROCESSORS);
	
	while (!done) {
		diff = 0.f;
		barrier(myBarrier, NUM_PROCESSORS);
		for (i = myMin to myMax) {
			for (j = red cells in the row) {
				float prev = A[i,j];
				A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i,j+1] +
										A[i-1,j] + A[i+1,j]);
				lock(myLock);
				diff += abs(A[i,j] - prev);
				unlock(mylock);
			}
		}
		barrier(myBarrier, NUM_PROCESSORS);
		if (diff/(n*n) < TOLERANCE)
			done = true;
		barrier(myBarrier, NUM_PROCESSORS); /* All threads same answer */
	}
}

However, every update to an (i, j) element requires acquiring the lock to modify diff. This is an expensive operation and only one thread can execute at a time. To improve performance, accumulate into a per-thread diff variable then complete the reduction globally at the end of the iteration.

/* The rest of code */

while (!done) {
	float myDiff = 0.f;
	diff = 0.f;
	barrier(myBarrier, NUM_PROCESSORS);
	for (i = myMin to myMax) {
		for (j = red cells in the row) {
			float prev = A[i,j];
			A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i,j+1] +
									A[i-1,j] + A[i+1,j]);
			myDiff += abs(A[i,j] - prev);
		}
	}
	lock(myLock); /* Lock is only acquired one time per thread */
	diff += myDiff;
	unlock(mylock);
	
	/* The rest of code */
}

Barrier: All computations by all threads before the barrier must complete before any computation in any thread after the barrier begins. Three barriers in the example code.

First barrier - Ensure that each thread do not reset the diff while others are incrementing it.

Second barrier - Ensures all the threads have contributed to diff before running conditional check.

Third barrier - Ensures the checking is done by all threads before going to the next iteration.

Shared address space pseudocode with one barrier only

Store a diff array with size of 3. One accumulator for the past iteration, one accumulator for the current iteration, one for preparing the next iteration.

int n;
bool done = false;
LOCK mylock;
BARRIER myBarrier;
float diff[3]; // Global array to store diffs

float* A = allocate(n+2, n+2);

/* Function solve() is run in all threads */

void solve(float* A) {
	float myDiff;  // Thread local variable
	int index = 0; // Thread local variable
	
	diff[0] = 0.0f;
	barrier(myBarrier, NUM_PROCESSORS);
	
	while (!done) {
		myDiff = 0.f;
		//
		// Perform accumulation into myDiff
		//
		lock(myLock);
		diff[index] += myDiff;
		unlock(mylock);
		
		diff[(index+1) % 3] = 0.0f; // Reset the next accumulator
		barrier(myBarrier, NUM_PROCESSORS);
		if (diff[index]/(n*n) < TOLERANCE)
			break;
		index = (index + 1) % 3;
	}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04 - Parallel programming basics.md

04 - Parallel programming basics.md

Files

04 - Parallel programming basics.md

Latest commit

History

04 - Parallel programming basics.md

File metadata and controls