-
Notifications
You must be signed in to change notification settings - Fork 0
/
predatordetection.tex
176 lines (139 loc) · 13.4 KB
/
predatordetection.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
\label{sec:detection}
\subsection{Overview}
\label{sec:overview}
\Predator{} follows the same idea of detecting false sharing, described in Section~\ref{sec:detectionidea}. We compute cache invalidations based only on memory accesses history of each cache line, and only report those instances that may affect performance.
\Predator{} is based on the similar {\it basic observation}: if a thread writes a cache line after other threads have accessed the same cache line, this write operation causes at least a cache invalidation. Drawing from this observation, \Predator{} tracks cache invalidations of all cache lines and ranks the severity of performance degradation according to the number of cache invalidations.
% by keeping track of accesses from different threads.
To track memory accesses, \Predator{} relies on the compiler instrumentation. The design tradeoff of choosing compiler instrumentation, instead of other mechanisms, has been discussed in Section~\ref{sec:instrumentationtradeoff}. A compiler can easily identify read or write accesses. However, a compiler does not know how and when those instructions are being executed, since that depends on a specific execution, input, and runtime environment.
Therefore, \Predator{} combines a runtime system with compiler instrumentation to track cache invalidations: the compiler instruments memory accesses so the runtime system is notified when an access is executed (see Section~\ref{sec:compiler}), and the runtime system is responsible for collecting and analyzing actual memory accesses to detect and report false sharing (see Section~\ref{sec:runtime}).
\subsection{Compiler Instrumentation}
\label{sec:compiler}
\Predator{} relies on LLVM to perform instrumentation at the intermediate representation level~\cite{llvm}.
It traverses all functions one by one and searches for memory accesses to those global and heap variables. For each memory access, \Predator{} instruments a function call to invoke the runtime system, passing with the address, access type and unit of this memory access (how many bytes). \Predator{} currently omits accesses to stack variables by default because stack variables are normally used for thread local storage and therefore do not normally introduce false sharing. However, instrumentation on stack variables
can always be turned on when necessary.
The instrumentation pass is placed at the very end of the LLVM optimization passes so that only those memory accesses surviving all previous LLVM optimization passes are instrumented. This technique is similar to the one used by AddressSanitizer~\cite{AddressSanitizer}.
\subsection{Runtime System}
\label{sec:runtime}
\Predator{}'s runtime system collects every memory access by handling those functions calls inserted during the compiler instrumentation phase. It analyzes possible cache invalidations based on the basic observation discussed in Section~\ref{sec:overview}. Finally, it precisely reports any performance-degrading false sharing problems it finds. For global variables involved in false sharing, \Predator{} reports their name, address and size; for heap
objects, \Predator{} reports the callsite stack for their allocations, address and size. In addition, \Predator{} provides word granularity access information for those cache lines involved in false sharing: how many reads or writes are issued by which thread. This information can further help users diagnose and fix false sharing instances.
\subsubsection{Tracking Cache Invalidations}
\Predator{} only reports those global variables or heap objects on cache lines with a large number of cache invalidations, thus possibly affecting performance of applications. To track cache invalidations, \Predator{} maintains a two-entries-cache-history table for each cache line. In this table, each entry has two fields: thread ID and access type (read or write). Thread ID is used to identify the origin of each access. As stated earlier, only accesses from different threads (with a different thread ID) can cause a cache invalidation.
For every new access to a cache line $L$, \Predator{} checks $L$'s history table $T$ to decide whether the current access leads to a cache invalidation. As described in Section~\ref{}, only write accesses can cause cache invalidations and read accesses only create a copy of data in the cache of the current core that the current thread is running on. Also, it is noticed that table $T$ only has two status: full and not full. There is no ``empty'' status since every cache invalidation should replace its table with the current write access, setting the first entry to the current access (with its thread ID and write access type).
\begin{itemize}
\item
For a read access $R$,
\begin{itemize}
\item
If $T$ is full, there is no need to record this read access.
\item
If $T$ is not full and another existing entry has a different thread
ID, then \Predator{} records this $R$ (and its thread) by adding a new entry to the table.
\end{itemize}
\item
For a write access $W$,
\begin{itemize}
\item
If $T$ is full, then $W$ can cause a cache invalidation since at least one of two existing accesses are issued by a different thread (one thread can only occupy one entry).
After recording this invalidation, \Predator{} updates the existing entry with $W$(and its thread).
\item
If $T$ is not full,
\Predator{} checks whether $W$ and the existing entry has the same thread ID. If so, $W$ cannot cause a cache invalidation and there is no need to do anything. Otherwise, \Predator{} identifies a possible cache invalidation on this line: it increments the number of cache invalidations and updates the existing entry with the current $W$ access.
\end{itemize}
\end{itemize}
\subsubsection{Reporting False Sharing}
\label{sec:predatorcallsite}
\Predator{} reports false sharing precisely and accurately. Accurately means \Predator{} only reports those false sharing instances with a large number of cache invalidations, which may possibly cause performance problems. \Predator{} also differentiate actual false sharing from true sharing, since true sharing can also induce a large number of cache invalidations.
\Predator{} employs the following mechanisms to achieve this target, as well as reducing the performance overhead.
\begin{itemize}
\item
In order to accurately differentiate those false sharing problems with true sharing problems, \Predator{} tracks word-level accesses for those cache lines involved in a big number of cache invalidations, which has been discussed in Section~\ref{sec:accuratedetect}.
\item
\predator{} relies on \texttt{backtrace()} function in the \texttt{glibc} library to obtain the whole callsite stack, which is slower but much more robust to be used than the ways used in \SheriffDetect{}. Thus, it can report the callsite stack for those heap objects.
\item
For every access, \Predator{} needs to lookup the corresponding cache line's metadata. Because this operation is so frequent, at every access, lookups need to be very efficient. Like AddressSanitizer~\cite{AddressSanitizer} and other systems~\cite{Valgrind, qinzhao}, \Predator{} uses a shadow memory mechanism to store metadata for every piece of application data. Thus, \Predator{} can compute and locate corresponding metadata directly via address arithmetic.
\item
In order to support shadow memory, \Predator{} uses a pre-defined starting address and fixed size for its heap. It also contains a custom memory allocator, which is described in Section~\ref{sec:customheap}. However, using this custom memory allocator also implies that false sharing caused by a memory allocator cannot be detected by \Predator{}: two threads allocate heap objects from the same cache line concurrently. But this should not be a serious problem since all modern memory allocators, like Hoard, already avoid this kind of false sharing and we should always use this kind of memory allocator.
\end{itemize}
\subsection{Optimizations}
\label{optimization}
Tracking every memory access can be extremely expensive, thus \Predator{} utilizes the following mechanisms to further reduce performance and memory overhead.
\subsubsection{Threshold-Based Tracking Mechanism}
\label{sec:thresholdtracking}
\Predator{} aims to detect false sharing that significantly degrades performance. Since cache invalidations are the root cause of performance degradation and only writes
can possibly cause cache invalidations,
cache lines with a small number of writes are never be a target with a significant performance impact.
For this reason, \Predator{} only tracks cache invalidations
once the number of writes to a cache line crosses a
pre-defined threshold, which we refer to as the {\it Tracking-Threshold}.
Before this threshold is reached, \Predator{} only tracks the number of writes on a cache line while skipping tracking reads. This mechanism reduces performance and memory overhead
at the same time.
In the current implementation, \Predator{} maintains two arrays in shadow memory:
{\it CacheWrites} tracks the number of memory writes on every cache line, and
{\it CacheTracking} tracks detailed information
for each cache line once the number of writes on a cache line exceeds the {\it Tracking-Threshold}.
If the threshold is not reached, there is no need to check the corresponding {\it CacheTracking}.
Figure~\ref{fig:algorithm} illustrates the detailed mechanism.
\begin{figure}[!t]
\begin{lstlisting}[style=tt]
void HandleAccess(unsigned long addr, bool isWrite) {
unsigned long cacheIndex=addr>>CACHELINE_SIZE_SHIFTS;
cachetrack *track=NULL;
if(CacheWrites[cacheIndex]<TRACKING_THRESHOLD) {
if(isWrite) {
if(ATOMIC_INCR(&CacheWrites[cacheIndex])
==TRACKING_THRESHOLD-1) {
track=allocCacheTrack();
ATOMIC_CAS(&CacheTracking[cacheIndex],0,track));
}
}
}
else {
track=CacheTracking[index]);
if(track){
// Track cache invalidations and detailed accesses
track->handleAccess(addr, isWrite);
}
}
}
\end{lstlisting}
\caption{Pseudo-code of handling an access in \Predator{}.\label{fig:algorithm}}
\end{figure}
To avoid expensive lock operations, \Predator{} uses atomic instruction to increment the {\it CacheWrites} counter for each cache line.
When the number of writes of a cache line reaches the predefined threshold,
it allocates space to track detailed cache invalidations and word-level information. \Predator{} also
uses an atomic compare-and-swap to set the cache tracking address for this cache line in the shadow mapping.
After {\it CacheWrites} on a cache line reaches the {\it Tracking-Threshold}, all read and write accesses on this cache line are tracked.
\subsubsection{Selective Compiler Instrumentation}
\label{sec:selectinstrumentation}
\Predator{} relies on compiler instrumentation to provide memory access information to the runtime system
and detects false sharing based on the sequences of memory accesses on every cache line.
The performance overhead of a specific program is always proportional to the degree of instrumentation: more instrumentation generally indicates more performance overhead.
Thus, \Predator{} provides a flexible framework to instrument programs depending on the performance requirements of the user.
Currently, \Predator{} only instruments once for each type of memory access on each address
to the same basic block.
This selective instrumentation does not normally affect the effectiveness of detection.
Because \Predator{} aims to detect false sharing cases with a large number of cache invalidations,
less tracking of accesses inside a basic block can induce fewer cache invalidations, but it should not affect the overall behavior of cache invalidations.
% detection will not cause performance problem.
To improve performance further,
\Predator{} can be easily extended to support more flexible instrumentation as follows:
\begin{itemize}
\item
\Predator{} could selectively instrument both reads and writes or only writes.
Instrumenting only writes reduces overhead while detecting write-write false sharing,
as \Sheriff{} does.
\item
\Predator{} can be set to instrument or skip specific code or data.
For example, the user could provide a black-list so that given modules,
functions or variables are not instrumented.
Conversely, the user could provide a white-list so that only specified functions or variables are instrumented.
\end{itemize}
\subsubsection{Sampling Mechanism}
\label{sec:sample}
As described in Section~\ref{sec:thresholdtracking}, when the number of writes on a cache line is larger than {\it Tracking-Threshold}, every
access must be tracked in detail: we have to track word-level
information, update the number of accesses and possible cache invalidations, and update the cache access history table
of this cache line. When a cache line is involved in false or true sharing, updating those counters can exacerbate the performance impact of sharing: not only is there an cache invalidation on this application's cache line, but there is also at least another cache invalidation caused by updating the metadata of this corresponding cache line.
To further reduce the performance overhead, \Predator{} only samples the first specified number of accesses during each sampling interval.
Currently, \Predator{} maintains an access counter for each cache line
and only tracks the first $10,000$ accesses out of every 1 million accesses on a cache line, with 1\% sampling rate. Section~\ref{sec:predatorsensitivity} further evaluates the effect of different sampling rates on performance and effectiveness.