Skip to content

Latest commit

 

History

History
186 lines (139 loc) · 4.58 KB

11 - Fine-grained sync and lock-free programming.md

File metadata and controls

186 lines (139 loc) · 4.58 KB

Wrong lock implementation: Since load-test-store is not atomic, two processors can acquire the lock at the same time.

lock:    ld   R0, mem[addr]
         cmp  R0, #0
	     bnz  lock           // If nonzero, jump to "lock"
	     st   mem[addr], #1

unlock:  st   mem[addr], #0

Test-and-set lock implementation

ts   R0, mem[addr]  // Load mem[addr] into R0
                    // If R0 is 0, set 1 to mem[addr]
					// Returns R0

If test-and-set instruction fails, do it again. This is equivalent to acquiring lock.

lock:    ts   R0, mem[addr]
	     bnz  lock
unlock:  st   mem[addr], #0

Test-and-set with coherence traffic

Processor 1 runs ts instruction and invalidates cache lines in other caches. The other processors will loop the ts instruction until they acquire the lock.

As the ts instruction is a write and the cache has been invalidated, the next processor would broadcast BusRdX to bring address into cache. Processor 1 then would have to invalidate the cache line. The ts instruction fails because the lock has already been acquired by processor 1.

The invalidation process repeats when another processor broadcasts BusRdX to start running the ts instruction.

Even though processor 1 took the lock, it does not have the cache line anymore. It would have to broadcast BusRdX again to update the cache line for releasing the lock.

Test-and-test-and-set lock implementation

Additional test before the test-and-set instruction. This is only a read instruction, so the other processors will only be loading into a SHARED state. There is no interconnect traffic when the lock is taken. There will only be traffic again after the lock is released.

Slightly higher latency than test-and-set in uncontended case but generates less interconnect traffic. However, all the processors try to acquire lock at the same time when the lock is released (a lot of traffic).

void Lock(int* lock) {
	while (1) {
		while (*lock != 0);
		
		if (test_and_set(*lock) == 0)
			return;
	}
}

void Unlock(int* lock) {
	*lock = 0;
}

Ticket lock

When acquiring the lock, each thread is given an integer as a ticket. Each only needs to write one time when taking a ticket. No need to write when actually acquiring a lock.

struct lock {
	int next_ticket;
	int now_serving;
};

void Lock(lock* l) {
	int my_ticket = atomic_increment(&l->next_ticket);
	while (my_ticket != l->now_serving);
}

void Unlock(lock* l) {
	l->now_serving++;
}

CUDA atomic compare-and-swap

Atomic function in CUDA that acts like test-and-set.

int atomicCAS(int* addr, int compare, int val) {
	int old = *addr;
	*addr = (old == compare) ? val : old;
	return old;
}

How do you implement atomic_min? Atomicity is important because if we load two integers then do the comparison, another thread may have modified one of the integers in the meantime.

The implementation below makes sure that the operations are carried out correctly, even when others have modified the memory.

int atomic_min(int* addr, int x) {
	int old = *addr;
	int new = min(old, x);
	/* If the address value is not changed, update it */
	while (atomicCAS(addr, old, new) != old) {
		old = *addr;
		new = min(old, x);
	}
}

Implementing a lock with compare-and-swap.

// Simple
typedef int lock;
void lock(Lock* l) {
	while (atomicCAS(1, 0, 1) == 1);
}
void unlock(Lock* l) {
	*l = 0;
}

// More efficient
void lock(Lock* l) {
	while (1) {
		while (*l == 1);
		if (atomicCAS(1, 0, 1) == 0)
			return;
	}
}

C++ 11 atomic<T>: Provides atomic read, write, read-modify-write of entire C++ objects. Provides memory ordering semantics for operations before and after the atomic operations (Default: Sequential consistency).

Example of sorted linked list: For subsequent demonstrations of simultaneous operations.

struct Node {
	int value;
	Node* next;
};
struct List {
	Node* head;
};

void insert(List* list, int value) {
	Node* n = new Node;
	n->value = value;
	
	// Assume insert before head is handled
	
	Node* prev = list->head;
	Node* cur = list->head->next;
	
	while (cur) {
		if (cur->value > value)
			break;
		
		prev = cur;
		cur = cur->next;
	}
	
	n->next = cur;
	prev->next = n;
}

void delete(List* list, int value) {
	// Assume delete first node is handled
	
	Node* prev = list->head;
	Node* cur = list->head->next;
	
	while (cur) {
		if (cur->value == value) {
			prev->next = cur->next;
			delete cur;
			return;
		}
		
		prev = cur;
		cur = cur->next;
	}
}