Lieven Eeckhout, et. al. @ Ghent University, Belgium & University of Texas at Austin
Current computer architecture research relies heavily on architectural simulation to obtain insight into the cycle-level behavior of modern microarchitectures. Unfortunately, such architectural simulations are extremely time-consuming. Sampling is an often-used technique to reduce the total simulation time. This is achieved by selecting a limited number of samples from a complete benchmark execution. One important issue with sampling, however, is the unknown hardware state at the beginning of each sample. Several approaches have been proposed to address this problem by warming up the hardware state before each sample. This paper presents the boundary line reuse latency (BLRL) which is an accurate and efficient warmup strategy. BLRL considers reuse latencies (between memory references to the same memory location) that cross the boundary line between the pre-sample and the sample to compute the warmup that is required for each sample. This guarantees a nearly perfect warmup state at the beginning of a sample. Our experimental results obtained using detailed processor simulation of SPEC CPU2000 benchmarks show that BLRL significantly outperforms the previously proposed memory reference reuse latency (MRRL) warmup strategy. BLRL achieves a warmup that is only half the warmup for MRRL on average for the same level of accuracy.
当前的计算机架构研究非常依赖于架构仿真,来得到现代微架构的周期级行为的洞见。不幸的是,这样的架构仿真是非常耗时的。通常使用采样技术来降低总计的仿真时间。这是通过从完整的benchmark执行中,选择一定量的样本实现的。但是,采样涉及到的一个重要问题是,在每个样本的开始时,未知的硬件状态的问题。提出了几种方法来解决这个问题,在每个样本之前对硬件状态进行预热。本文提出了边界线重用延迟(BLRL)方法,这是一个精确高效的预热方法。BLRL考虑了重用的延迟(参考的内存,和相同的内存位置之间),这是在预采样和每个样本都需要计算的预热的样本的边界线。这确保了在样本的开始的预热状态几乎是完美的预热状态。我们用SPEC CPU2000在详细的处理器仿真上得到的试验结果表明,BLRL比之前提出的MRRL预热策略要好很多。在相同的准确率下,BLRL获得的预热,平均只是MRRL的一半。
Current microarchitectural research relies heavily on cycle-level architectural simulations that model the execution of a benchmark on a microprocessor. Cycle-level simulations model a microarchitecture at a fairly detailed level. The price paid for such detailed simulations obviously is simulation speed. Simulating a full benchmark execution can take days or even weeks to complete. If we take into account that during microarchitectural research a multitude of design alternatives need to be evaluated, we easily end up with months or even years of simulation. As such, detailed simulation of full benchmark executions is infeasible.
当前的微架构研究严重依赖于周期级的架构仿真,对benchmark在微处理器上的执行进行建模。周期级的仿真对微架构在非常详细的级别上进行建模。这样详细的仿真的代价,很明显就是仿真速度。对完整的benchmark进行仿真运行,要完成会消耗几天甚至好多个星期。如果我们考虑在微架构研究中,需要评估很多设计的选择,所以我们需要进行数月甚至数年的仿真。这样,完整的基准测试执行的详细的仿真基本是不可行的。
Several approaches have been proposed in the recent literature to address this problem. One particular approach is sampled simulation. Sampled simulation means that a selected number of execution intervals, called samples, are simulated from a complete benchmark execution. Since the number of samples and their sizes are limited, significant simulation speedups are obtained. However, there is one particular issue that needs to be dealt with, namely the cold-start problem. The cold-start problem refers to the unknown hardware state at the beginning of each sample. An attractive solution to the cold-start problem is to simulate a number of instructions from the pre-sample without computing performance metrics. This is to warmup large hardware structures so that the hardware state at the beginning of the sample is a close estimate of the hardware state in case of detailed full benchmark simulation. Owing to an extremely long history in the microarchitectural state (e.g. in large caches), the warmup phase needs to be proportionally long. Since warm simulation can be a significant part of the total sampled simulation time, it is important to study efficient but accurate warmup strategies. Reducing the warmup length can yield significant simulation speedups.
最近的文献中已经提出了几种方法来处理这个问题。一种特别的方法是采样仿真。采样仿真的意思是,从完整的benchmark执行中,我们选出几个执行片段,称为样本,进行仿真。由于样本的数量和大小是有限的,所以可以得到显著的仿真加速。但是,有一个需要处理的很特别的问题,即冷启动的问题。冷启动问题是指,在每个样本的开始时未知的硬件状态。冷启动问题的一个解决方法是,对样本之前一定数量的指令进行仿真,而不计算其性能度量。这是对大型硬件结构进行预热,这样采样开始的硬件状态,是详细的完整benchmark仿真的硬件状态的非常接近的估计。由于微架构状态有极其长的历史(如,在大型缓存中),预热阶段的长度需要是按比例的长。由于预热仿真会是总计的采样仿真时间相当的一部分,研究高效但准确的预热策略就非常重要了。减少预热长度会产生显著的仿真加速。
This paper presents the boundary line reuse latency (BLRL) as a highly accurate and efficient warmup strategy. BLRL uses the reuse latencies (between two memory references accessing the same memory location) that cross the boundary line between the pre-sample and the sample to determine the warmup length per sample. By doing so, a nearly perfect warmup state is guaranteed at the beginning of each sample. We also compare BLRL with the previously proposed state-of-the-art memory reference reuse latency (MRRL) warmup strategy and conclude that BLRL significantly outperforms MRRL. Our experimental results using SPEC CPU2000 benchmarks show that BLRL achieves about half the warmup time of MRRL for the same level of accuracy to estimate the average number of cycles per instruction (CPI). This paper extends the paper published in [1] by (i) considering detailed processor simulation instead of cache simulation and (ii) comparing BLRL with MRRL.
本文提出了BLRL作为高度精确和高效的预热策略。BLRL使用重用延迟(两个内存参考访问同样的内存位置),在预采样和采样之间的边缘线,来确定每个样本的预热长度。这样,就可以确保在每个样本的开始都有接近完美的预热状态。我们还与之前提出的最好的MRRL预热策略进行了比较,得出的结论是,BLRL比MRRL要好很多。我们使用SPEC CPU2000的试验结果表明,BLRL的预热时间是MRRL的一半,得到的准确率是类似的。本文对[1]的文章进行了扩展,(i)考虑了详细的处理器仿真,而不是缓存仿真,(ii)比较了BLRL和MRRL。
The original paper considered cache simulation and did not provide a comparison with the existing state-of-the-art. The remainder of this paper is organized as follows. The next section gives an introduction to sampled processor simulation, after which we discuss existing warmup strategies in Section 3. Section 4 proposes our new warmup strategy, called BLRL. Section 5 details our experimental setup. In Section 6 we present and discuss our results. Finally, we conclude in Section 7.
原始文章考虑了缓存仿真,没有与目前最好的结果进行比较。本文剩余的部分组织如下。下一部分介绍了采样处理器仿真,然后第3部分讨论了现有的预热策略。第4部分提出了我们新的预热策略,称为BLRL。第5部分详述了我们的试验设置。在第6部分我们给出了结果并进行了讨论。最后,在第7部分我们给出了结论。
In sampled processor simulation, a number of samples are chosen from a complete benchmark execution (Figure 1). The instructions between two samples are part of the pre-sample. Sampled simulation uses only the instructions in the sample to report performance results; instructions in the pre-sample are not considered.
在采样处理器仿真中,会从完整benchmark执行中选择几个样本(如图1所示)。两个样本之间的指令是pre-sample的一部分。采样仿真只使用样本中的指令,来给出性能结果;在pre-sample中的指令并不进行考虑。
There are basically two issues with sampling. The first issue is the selection of representative samples. The problem is to select samples in such a way that the sampled execution is an accurate picture of the complete execution of the program. As such, it is important not to limit the selection of samples to the initialization phase of the program execution. This is a manifestation of the more general observation that a program goes through various phases of execution and that the sampling should reflect this notion. In other words, samples should be chosen in such a way that all major phases are represented in the sampled execution. Several approaches have been described in the recent literature to select such samples: random sampling by Conte et al. [2], profile-driven sampling by scaling the basic block execution counts by Dubey and Nair [3], by selecting basic blocks with representative context information using the R-metric by Iyengar et al. [4, 5], periodic selection as done in SMARTS [6], selection based on clustering similarly behaving intervals as done by Lafage and Seznec [7] as well as in SimPoint [8, 9, 10].
采样有两个基本问题。第一个是代表性样本的选择。问题是选择的样本要能够精确的描述程序完成执行的特征。这样,很重要的是,不要将样本的选择限制在程序执行的初始化阶段。这是更一般性的观察的表现,即程序会经历执行的各种阶段,采样要反应这种情况。换句话说,样本的选择要能代表所有主要的阶段。最近的文献在选择样本上提出了几种方法:[2]的随机采样,[3]用基本块的执行数来缩放profile驱动的采样,[4,5]的使用R-metric来选择具有代表性上下文信息的基本块,[6]中的周期性选择,[7,8,9,10]中的基于对类似行为的片段聚类的选择。
The second issue is the correct hardware state at the beginning of each sample. This is well known in the literature as the cold-start problem. At the beginning of a sample, the correct hardware state is unknown since the instructions preceding the sample are not simulated during sampled processor simulation. Several techniques have been proposed in the literature to address this important issue (see the next section for a detailed discussion). Most of these use a number of instructions preceding the sample to warm up the hardware state before each sample. Under such a warmup strategy, sampled simulation consists of three steps (Figure 1). The first step is cold simulation in which the program execution is fast-forwarded, i.e. functional simulation without updating microarchitectural state. In case of trace-driven simulation, the instructions under cold simulation can even be discarded from the trace, i.e. need not be stored on disk. The second step is warm simulation which updates the microarchitectural state. This is typically done for large hardware structures such as caches, translation lookaside buffers and branch predictors. Under warm simulation, no performance metrics are calculated. It is important to note that the warm simulation phase can be very long since the microarchitectural state can have an extremely long history. The third step is hot simulation which includes detailed processor simulation while computing performance metrics, e.g. calculating cache and branch predictor miss rates, number of instructions retired per cycle and so on. These three steps are repeated for each sample.
第二个问题是每个样本开始时的正确硬件状态。这是文献中著名的冷启动问题。在一个样本的开始,正确的硬件状态是未知的,因为在采样处理器仿真中样本之前的指令没有进行仿真。文献中提出了几种技术来解决这个重要的问题(下一节有详细的讨论)。多数方法使用了每个样本之前一定数量的指令进行硬件状态的预热。在这样的预热策略下,采样仿真包含了三个步骤(图1)。第一步是程序执行在快进时的冷仿真,即不更新微架构状态下的功能仿真。在迹驱动仿真的情况下,在冷仿真下的指令甚至可以与迹分离抛弃,即,不需要在磁盘上存储。第二步是预热的仿真,会更新微架构的状态。这通常是对大型硬件结构进行,如缓存,TLB和BPU。在预热的仿真中,不计算性能metrics。必须要指出,预热仿真阶段可能会很长,因为微架构状态会有非常长的历史。第三部就是热仿真了,包含详细的处理器仿真,同时计算性能metrics,如,计算cache和分支预测miss率,每个周期retired指令数,等等。这三步对每个样本都重复进行。
Obviously, cold simulation is faster than warm simulation and warm simulation is faster than hot simulation. Austin et al. [11] report simulation speeds for the various simulation tools in the SimpleScalar ToolSet. They report that sim-fast, which corresponds to cold simulation, attains a simulation, speed of 7 million instructions per second (MIPS). Warm simulation, which is a combination of sim-bpred and sim-cache, attains ∼3 MIPS. Hot simulation is the slowest way of simulation with a speed of 0.3 MIPS. Owing to the fact that sampled execution only simulates a small fraction (typically <3%) of the complete program execution in full detail at the cycle level (under hot simulation), the total simulation time under sampled simulation is largely determined by the simulation speed under cold and warm simulation. As such, shortening the total time spent under warm simulation, i.e. trading warm simulation for cold simulation, can yield significant simulation speedups. To clarify this, we first compute the total simulation time as a function of the number of instructions under cold, warm and hot simulation. The total simulation time T_exec for execution-driven simulation is proportional to
很明显,冷仿真比预热仿真要快,预热仿真比热仿真要快。[11]给出了SimpleScalar工具集中的各种仿真工具的仿真速度。对于sim-fast,即冷仿真,给出了7MIPS的仿真速度。预热仿真是sim-bpred和sim-cache的组合,获得了3MIPS的速度。热仿真是最慢的仿真方式,速度为0.3MIPS。由于采样执行只对完整程序执行的很小一部分进行详细的周期级的仿真,采样仿真的总共时间,主要是由冷仿真和预热仿真的仿真速度决定。这样,降低预热仿真的时间,会得到显著的仿真加速。为澄清,我们首先将总计仿真时间作为冷、预热、热仿真的指令数的函数来进行计算。总计的仿真时间T_exec正比于下式
where f_c, f_w and f_h are the fractions of instructions under cold, warm and hot simulation respectively. In practice, the value for f_h can be >2%—we obtained this number from the SimPoint data (http://www.cs.ucsd.edu/∼calder/simpoint) [8, 9, 10]. If we consider the fact that a 1M instruction sample requires 10–20M warm simulation instructions on average to guarantee accurate warmup (as will be demonstrated in this paper), f_w then ranges from 20 to 40%. As a result, shortening the warm simulation fraction f_w by 50%—which is the average reduction obtained according to our results if an appropriate warmup strategy is chosen—can decrease the total simulation time by 8–15%. For trace-driven simulation, the total simulation time T_trace is proportional to
其中f_c, f_w和f_h分别是冷、预热和热仿真时的指令数。在实际中,f_h的值可以大于2%,我们从SimPoint的数据中得到这个。如果我们考虑这个事实,即1M指令的样本平均需要10-20M预热的仿真指令来确保准确的预热,f_w的范围就是20%-40%。结果是,将预热部分的指令缩短50%(如果选择合适的预热策略,根据我们的结果,平均会得到这样的减少)会将总共的仿真时间减少8-15%。对于基于迹的仿真,总计仿真时间T_trace正比于下式
Reducing the warm simulation phase by 50% reduces the total simulation time by 33–50%. Because of these significant simulation time reductions, it is important to study efficient but accurate warmup techniques.
将预热仿真阶段减少50%,会将总计仿真时间减少33-50%。因为这种显著的仿真时间减少,要研究高效准确的预热技术,就非常重要。
This section gives a detailed description of previously proposed warmup strategies. 本节详细叙述一下之前提出的预热策略。
- The cold or no warmup scheme [12, 13] assumes an empty cache at the beginning of each sample. Obviously, this scheme will overestimate the cache miss rate. However, the bias can be small for large samples.
冷或没有预热的方案,假设在每个样本之前都是空的cache。很显然,这个方案会过高估计cache miss率。但是,对于大型样本,偏差会比较小。
- Another option is to checkpoint [14] or to store the hardware state at the beginning of each sample and impose this state during sampled simulation. This approach yields a perfectly warmed up hardware state. However, the storage needed to store these checkpoints can explode in case of many samples. In addition, the hardware state needs to be stored for each specific hardware configuration. For example, for each cache and branch predictor configuration a checkpoint needs to be made. Obviously, the latter constraint implies that the complete program execution needs to be simulated for these various hardware structures.
另一个选项是checkpoint或将每个样本开始时的硬件状态保存,在采样仿真的时候加上这个状态。这个方法会得到完美的预热硬件状态。但是,要存储这些checkpoint需要的容量在很多样本的时候会爆炸。此外,对每个特定的硬件配置都要存储相应的硬件状态。比如,对每个cache和BPU的配置,都需要存储一个checkpoint。很明显,后者的约束意味着,完整的程序执行需要对这些各种硬件结构都进行仿真。
- Stitch [13] approximates the hardware state at the beginning of a sample with the hardware state at the end of the previous sample.
Stitch[13]用之前的样本最后的硬件状态,来近似现在样本开始的硬件状态。
- The prime-xx% method [13] assumes an empty hardware state at the beginning of each sample and uses xx% of each sample to warm up the cache. Actual simulation then starts after these xx% instructions. The warmup scheme prime-50% is also called half in the literature.
Prime-xx%方法[13]假设在每个样本的开始的硬件状态是空的,并用每个样本的xx%,来预热缓存。在这xx%的指令后,就开始真正的仿真。预热方案prime-50%在文献中也被称为一半。
- A combination of the two previous approaches was proposed by Conte et al. [2]: the hardware state at the beginning of each sample is the state at the end of the previous sample plus warming up using a fraction of the sample.
[2]提出了之前的两个方法的组合:每个样本开始时的硬件状态,是之前样本最后的状态,加上使用该样本一部分进行的预热的状态。
- Another approach proposed by Kessler et al. [13, 15] is to assume an empty cache at the beginning of each sample and to estimate which cold-start misses would have missed if the cache state at the beginning of the sample was known.
[13,15]提出了另一种方法,假设在每个样本开始的时候cache是空的,并假设如果样本开始时的cache状态是已知的话,哪些冷启动的misses会被miss掉。
- Nguyen et al. [16] use W instructions to warm up the cache which is calculated as follows: W = (C/L)/(m · r), where C is the cache capacity, L is the line size, m is the cache miss rate and r is the memory reference ratio. The problem with this approach is that the cache miss rate m is unknown; this is exactly what we are trying to approximate through sampling.
[16]使用W条指令来预热cache,计算如下:W=(C/L)/(m · r),其中C是cache容量,L是line大小,m是cache miss率,r是memory reference率。这种方法的问题是,cache miss率是未知的;这也是我们尝试用采样来进行近似的。
- No-state-loss (NSL) [17] scans the pre-sample and records the latest reference to each unique memory location. These references are subsequently used to warm up the caches. NSL guarantees perfect warmup for caches with least-recently used replacement. This approach was proposed in the context of sampled cache simulation; however, extending this approach to a warmup strategy for sampled processor simulation (with, e.g. branch predictor warmup) is not easily done.
NSL[17]扫描了pre-sample,记录下对每个独特内存位置的最新reference。这些reference后续用于对cachees进行预热。NSL确保对内存有完美的预热,使用的是最近最少被使用的替代。这种方法的提出,是在采样cache仿真的上下文中的;但是,将这种方法拓展到采样处理器仿真的预热策略并不太容易。
- Minimal subset evaluation (MSE) proposed by Haskins and Skadron [18] determines the warmup length as follows. First, the user specifies the desired probability that the cache state at the beginning of the sample under warmup equals the cache state under perfect warmup. Second, the MSE formulas are used to determine how many unique references are required during warmup. Third, using a memory reference profile of the pre-sample it is calculated where exactly in the pre-sample the warmup should be started in order to cover these unique references.
[18]提出了MSE,按照下面的方式来确定预热长度。首先,用户指定在预热后的样本开始的cache状态,等于完美预热下的cache状态的期望概率。第二,MSE公式用于确定在预热的过程中需要多少唯一的references。第三,使用计算出来的pre-sample的内存reference profile,其中在pre-sample的哪些地方应该开始预热,才能覆盖这些唯一的reference。
The problem with most of these methods, except for NSL and MSE, is that they do not guarantee a (nearly) perfect hardware state at the beginning of each sample. In the following subsection we will discuss one warmup method in more detail that alleviates this problem and is considered as the current state-of-the-art in efficient and accurate warmup strategies, namely MRRL [19]. MRRL is a continuation of the work by Haskins and Skadron on MSE [18]. As stated in the introduction, we will compare our newly proposed BLRL with MRRL in more detail in Section 6. We do not compare BLRL with NSL and MSE because extending NSL to processor simulation is non-trivial and MSE was superseded by MRRL.
除了NSL和MSE,这些方法中的大多数的问题是,他们不能保证在每个样本的开始有接近完美的硬件状态。在下面的章节中,我们更详细的讨论一种可以缓解这个问题的预热方法,当前也是最好的高效准确预热策略,即MRRL。MRRL是MSE[18]工作的后续。引言中我们介绍过,我们会在第6部分更详细的比较我们提出的BLRL和MRRL。我们不会将BLRL与NSL和MSE进行比较,因为将NSL拓展到处理器仿真不太容易,MSE被MRRL超越了。
Haskins and Skadron [19] propose MRRL for accurately warming up the hardware state at the beginning of each sample. As suggested, MRRL refers to the number of instructions between consecutive references to the same memory location, i.e. the number of instructions between a reference to address A and the next reference to A. For their purpose, they divide the pre-sample–sample pair into N_B non-overlapping buckets each containing L_B contiguous instructions; in other words, the total pre-sample–sample pair consists of N_B · L_B instructions (Figure 2). The buckets receive an index from 0 to N_B − 1 in which index 0 is the first bucket in the pre-sample. The first N_B,P buckets constitute the pre-sample and the remaining N_B,S buckets constitute the sample; obviously, N_B = N_B,P + N_B,S.
[19]提出了MRRL,在每个样本的开始,准确的预热硬件状态。MMRL是指,对相同内存位置的连续引用的间隔指令数量,即,引用地址A的指令,与下一条引用A的指令,之间的指令数。他们将pre-sample-sample对分成N_B个不重叠的bucket,每个包含L_B条连续的指令;换句话说,presample-sample对包含N_B · L_B条指令(图2)。这些buckets的索引从0到N_B − 1,索引0是pre-sample的第一个bucket。前N_B,P buckets组成了presample,剩下的N_B,S个buckets构成了sample;显然,N_B = N_B,P + N_B,S。
The MRRL warmup strategy also maintains N_B counters c_i (0 ≤ i < N_B). These counters, c_i, will be used to build the histogram of MRRLs. Through profiling, the MRRL is calculated for each reference and the associated counter is updated accordingly. For example, for a bucket size L_B = 10,000 (as used by Haskins and Skadron [19]) an MRRL of 124,534 will increment counter c_12 . When the complete pre-sample–sample pair is profiled, the MRRL histogram p_i , 0 ≤ i < N_B, is computed. This is done by dividing the bucket counters with the total number of references in the pre-sample–sample pair, i.e. p_i = c_i / (\sum_{j=0}^{N_B − 1} c_j). As such, p_i = Prob[i · L_B < MRRL ≤ (i + 1) · L_B − 1]. Not surprisingly, the largest p_i s are observed for small values of i owing to the notion of temporal locality in computer program address streams. Using the histogram p_i, Haskins and Skadron calculate the bucket corresponding to a given percentile K%, i.e. bucket k for which \sum_{m=0}^{k−1} p_m < K% and \sum_{m=0}^k p_m ≥ K%. This means that of all the references in the current pre-sample–sample pair, K% have a reuse latency that is smaller than k · L_B. As such, Haskins and Skadron define these k buckets as their warmup buckets. In other words, warm simulation is started k · L_B instructions before the sample.
MRRL预热策略还维护着N_B个计数器c_i (0 ≤ i < N_B)。这些计数器c_i,会用于构建MRRL的直方图。通过profiling,对每个reference都计算其MRRL,也更新其相关的计数器。比如,对于bucket大小L_B = 10000,MRRL会为c_12增加124534。当对完整的presample-sample进行了profiling,就计算MRRL的直方图p_i,0 ≤ i < N_B。这是通过将bucket计数器的值除以presample-sample对的总计引用数量得到的,即,p_i = c_i / (\sum_{j=0}^{N_B − 1} c_j)。这样,p_i = Prob[i · L_B < MRRL ≤ (i + 1) · L_B − 1]。并不令人惊讶的是,最大的p_i是在比较小的i处观察到的,因为计算机程序地址流拥有空域局部性。使用直方图p_i,[19]计算了对应一定百分位K%的bucket,即,对于这个bucket K,\sum_{m=0}^{k−1} p_m < K% 并且 \sum_{m=0}^k p_m ≥ K%。这意味着,在当前的presample-sample对中,所有的references,有K%的reuse latency都是小于K · L_B的。这样,[19]定义这k的buckets作为其预热buckets。换句话说,预热仿真是从样本前的k · L_B条指令开始的。
An important disadvantage of MRRL is that if there is a mismatch in the MRRL behavior in the pre-sample versus the sample, it might result in a suboptimal warmup strategy in which the warmup is either too short to be accurate or too long for the attained level of accuracy. For example, if the reuse latencies are generally larger in the sample than in the pre-sample–sample pair, the warmup will be too short and, by consequence, the accuracy might be poor. Conversely, if reuse latencies are generally shorter in the sample than in the pre-sample–sample pair, the warmup will be too long for the attained level of accuracy. One way of solving this problem is to choose a large enough percentile K%. The result is that the warmup will be longer than needed for the attained accuracy.
MRRL的一个重要劣势是,如果presample和sample中的MRRL行为不匹配,那么就会得到非最优的预热策略,即预热可能太短了,不会很准确,或太长了。比如,如果sample中的reuse latencies比presample-sample对中的要大,那么预热就太短了,结果是,准确率会比较差。相反的,如果sample中的reuse latencies比presample-sample对中的要短,那么预热就过长了。解决这个问题的一种方法是,选择足够大的百分位K%。结果是,预热会过长。
BLRL is quite different from MRRL although it is also based on reuse latencies. In BLRL, the sample is scanned for reuse latencies that cross the pre-sample–sample boundary line, i.e. a memory location is referenced in the pre-sample and the next reference to the same memory location is in the sample. For each of these cross boundary line reuse latencies, the pre-sample reuse latency is calculated. This is done by subtracting the distance in the sample from the MRRL. For example, if instruction i has a cross boundary line reuse latency x, the pre-sample reuse latency then is x − (i − N_B,P · L_B) (Figure 3). A histogram is built up using these pre-sample reuse latencies. As is the case for MRRL, BLRL uses N_B,P buckets of size L_B to limit the size of the histogram. This histogram is then normalized to the number of reuse latencies crossing the pre-sample–sample boundary line. The required warmup length is then computed to include a given percentile K% of all reuse latencies that cross the pre-sample–sample boundary line.
BLRL与MRRL是非常不同的,但也是基于reuse latencies。在BLRL中,对sample进行扫描,得到跨越presample-sample边界线的reuse latencies,即,一个内存位置在presample中进行了引用,同样的内存位置在sample中进行了下一次引用。对于每个这样的跨边界线reuse latencies,计算presample reuse latencies,即从MRRL中减去在sample中的距离。比如,如果指令i的BLRL为x,而presample reuse latency是x - (i − N_B,P · L_B)(图3)。用这些presample reuse latencies构建一个直方图。对于MRRL是这个情况,BLRL使用N_B,P个大小为L_B的buckets,来限制直方图的大小。直方图根据跨越了presample-sample的边界线的reuse latencies的数量进行归一化。需要的预热长度的计算,是要包括给定百分位K%的所有跨越presample-sample边界线的reuse latencies的数量。
There are three key differences between BLRL and MRRL. First, BLRL considers reuse latencies for memory references originating from instructions in the sample only whereas MRRL considers reuse latencies for memory references originating from instructions in both the pre-sample and sample. Second, BLRL considers only reuse latencies that cross the pre-sample–sample boundary line; MRRL considers all reuse latencies. Third, in contrast to MRRL which uses the reuse latency to update the histogram, BLRL uses the pre-sample reuse latency.
BLRL和MRRL有三个关键的区别。第一,BLRL考虑的reuse latencies,是只来自于sample中的指令的memory reference,而MRRL考虑的是来自于presample和sample的指令的memory reference的reuse latencies。第二,BLRL考虑的reuse latencies,是跨越了presample-sample的边界线的;MRRL考虑了所有的reuse latencies。第三,MRRL使用的是reuse latencies来更新直方图,而BLRL使用的是presample中的reuse latencies。
We expect BLRL to be highly accurate and efficient since it tracks the individual cross boundary line reuse latencies. These cross boundary reuse latencies in fact point to the memory locations that need to be warmed up. There is, however, one potential scenario in which BLRL will attain poor performance. Consider the case that the number of cross boundary line reuse latencies is relatively small compared with the size of the sample and that these reuse latencies have a very long pre-sample reuse latency. This will result in a long warmup; however, it will not contribute to the attained accuracy since the number of cross boundary line reuse latencies is small. As such, the warmup will be too long for the given level of accuracy. However, we expect this scenario to be rare. This is supported by the experimental results from Section 6 which show that BLRL is both more accurate and leads to shorter warmup than MRRL.
我们期望BLRL是准确高效的,因为其跟踪的是单个的跨边界线的reuse latencies。这些跨边界的reuse latencies实际上标记了要预热的memory locations。但是,有一种可能的场景,BLRL可能会得到较差的性能。考虑下面的情况,跨边界线的reuse latencies的数量,与sample的大小相比较,是较小的,但是这些reuse latencies有较长的presample reuse latency。这会导致有较长的预热;但是,这不会对获得准确率有多少贡献,因为跨边界线的reuse latencies的数量很小。这样,对于给定的准确率水平来说,预热会太长了。但是,我们期望这种场景是很少的。这有第6部分的试验结果支持,这些结果表明,BLRL会得到更精确的结果,而且比MRRL的预热长度更短。
For the evaluation we use 10 SPEC CPU2000 integer benchmarks (http://www.spec.org; Table 1). The binaries, which were compiled and optimized for the Alpha 21264 processor, are taken from the SimpleScalar website (http://www.simplescalar.com). All measurements presented in this paper are obtained using the MRRL software (http://www.cs.virginia.edu/∼jwh6q/mrrl-web/) which in turn is based on the SimpleScalar software [20]. The baseline processor simulation model is given in Table 2.
对于评估,我们使用10个 SPEC CPU2000整数benchmarks。这些binaries是从SimpleScalar网站上下载得到的,是对Alpha 21264进行编译和优化的。本文给出的所有度量,都是使用MRRL软件的,这也是基于SimpleScalar软件的。基准处理器仿真模型如表2所示。
In this paper we consider a sample size of 1M instructions. This sample size is in the range of sample sizes that are likely to benefit the most from efficient warmup strategies. Larger sample sizes, e.g. 100M instruction samples, do not need warmup. No warmup, i.e. only cold simulation during the pre-sample, is sufficient to faithfully estimate the performance for 100M instruction samples. Smaller sample sizes on the other hand, e.g. 1000 and 10,000 instruction samples as used in SMARTS [6], require thousands of samples to obtain accurate performance predictions. In such sampling scenarios, the pre-sample sizes are generally smaller than the observed reuse latencies. An example scenario for SMARTS uses 3000 periodically chosen 1000 instruction samples from a 100B instruction program execution. As such, the pre-sample size is ∼30,000 instructions on average. The reuse latencies that we observe in this study often exceed 30,000 instructions. As such, full warmup simulation of caches and branch predictors during each pre-sample, as is done in SMARTS [6], is a practical solution for small sample sizes. Shortening this warmup could help, but the benefit of doing it is probably limited. Note that a 1M instruction sample is also the one chosen in [19] for evaluating MRRL.
本文中,我们考虑的sample大小为1M指令。这个sample大小是从高效预热策略中受益最多的大小。更大的sample大小,如100M指令的sample,不需要预热。对于100M指令的sample,没有预热,即presample中只有冷仿真,足以忠实的估计性能。另一方面,更小的指令sample,如1000和10000,如SMARTS中使用的,需要数千个samples来得到准确的性能预测。在这样的采样场景中,presample大小一般是比观察到的reuse latencies要小。SMARTS的一个例子场景,从100B的指令程序执行中,使用3000个周期性选择的1000条指令samples。这样,presample的大小平均是大约30000。这样,在每个presample中对cache和BPU进行完整的预热仿真,是在小型sample中的实际解,在SMARTS中是这样做的。缩短这个预热可能有帮助,但是这样做的收益可能是有限的。在[19]中,评估MRRL时,也是选择了1M的指令sample。
We consider 50 samples (each containing 1M instructions). We select a sample for every 100M instructions. These samples were taken from the beginning of the program execution to limit the simulation time while evaluating the various warmup strategies with varying percentiles K%. Taking samples deeper down the program execution would have been too time-consuming given the large fast forwarding needed. However, we believe this does not affect the conclusions from this paper, since the warmup strategies that are evaluated in this paper can be applied to any collection of samples. Once a set of samples is provided, either warmup strategy can be applied to it.
我们考虑50个samples(每个包含1M指令)。我们每100M指令选择一个sample。这些samples是从程序执行的开始进行取的,以限制仿真时间,同时用不同的百分位K%来评估各种预热策略。从程序执行的更深处取样本,会太耗时,因为需要很多fast forwarding。但是,我们相信这对本文得到的结论没有影响,因为本文中评估的预热策略可以应用到任意的samples集中。一旦提供了样本集,可以对其应用各种预热策略。
We quantify the performance of a warmup strategy using two metrics: accuracy and warmup length. The warmup length is defined as the number of instructions under warm simulation. The accuracy is quantified as follows. We first measure the CPI for each sample under full warmup, i.e. by assuming warm simulation during the complete pre-sample. We then compute the CPI for each sample under a given warmup strategy. Using these two CPI numbers we compute the CPI prediction error on a per-sample basis. This is done as follows:
我们使用两种度量来量化warmup策略的性能:准确率和warmup长度。warmup长度定义为预热仿真的指令数量。准确率量化如下。我们首先度量每个sample在full warmup下的CPI,即,假设在整个presample中都进行预热仿真。然后我们计算在给定的warmup策略下每个sample的CPI。使用这两个CPI数值,我们在per-sample的基础上计算CPI预测错误率。计算如下:
where CPI short and CPI full are the CPI under shortened and full warmup respectively. As such, we obtain 50 CPI prediction errors. We subsequently compute the average per-sample CPI prediction error, μ_error. The reason why we use average per-sample CPI prediction errors instead of aggregate CPI prediction errors—by comparing the overall CPI under shortened warmup versus full warmup—is that the latter approach might hide inaccuracies in particular samples from the aggregate CPI numbers. For example, a positive CPI prediction error in one sample can be compensated for by a negative CPI prediction error in another sample. Using the average per-sample CPI prediction error, μ_error, alleviates this problem.
其中CPI short和CPI full分别是在缩短的和full warmup下的CPI。这样,我们得到50个CPI预测错误率。然后我们计算平均的per-sample CPI预测错误率,μ_error。我们使用平均per-sample CPI预测错误率,而不适用累积CPI预测错误率的原因,是后面的方法会将特定样本中不准确率隐藏到累积CPI值中。比如,一个sample中的正CPI预测误差,可能会被另一个样本中的负预测误差所中和。使用平均的per-sample CPI预测误差μ_error,会缓解这个问题。
In its rightmost column, Table 1 shows μ_error under the no-warmup strategy, i.e. no warm simulation during the presample. These data show that warmup is clearly needed to address the cold-start problem. Note that this error is to be considered on top of the sampling error. Indeed, the overall CPI error (theoretically) is the sum of the sampling error plus the error due to inaccurate warmup. Perelman et al. [9] report average CPI sampling errors ranging from 2 to 4%. These errors are due to sampling inaccuracies only since they assume perfect warmup in their experiments. As such, the additional error due to the cold-start problem should be small enough not to increase the overall CPI error too much.
在表1最右边的列中给出了没有预热的策略中的μ_error,即,在presample中没有预热仿真。这些数据说明,非常需要预热来解决冷启动问题。注意,这个误差要在采样误差之上进行考虑。整体的CPI误差(理论上),是采样误差,加上由于不准确的预热导致的误差。[9]给出的平均CPI采样误差从2%-4%。这些误差是因为采样导致的不准确性,但是他们假设在试验中是完美的预热。这样,由于冷启动问题导致的额外误差要足够小,不要对整体CPI误差增加太多。
Next to benchmark-specific information, we will also report the average numbers over all benchmarks. More specifically, we will average the warmup length and the average per-sample CPI prediction error, μ_error over all benchmarks. This will be done using the arithmetic average for the following reasons. For the warmup length, the arithmetic average is directly proportional to the total simulation time spent in warmup when simulating the complete benchmark suite. For the CPI prediction error, μ_error , the arithmetic average penalizes large inaccuracies more than the geometric average would do. For example, if one particular benchmark has a larger error than the other benchmarks, the arithmetic average error will be larger than the geometric average error. This makes sense for our purpose since we want the prediction errors for all benchmarks to be low.
除了benchmark特定的信息,我们还给出了所有benchmarks上的平均数据。更具体的,我们将在所有benchmarks上对预热长度,per-sample的CPI预测误差和μ_error进行平均。这是使用代数平均,原因如下。对于预热长度,代数平均与在仿真完整的benchmark包时在预热上所花的总计时间成正比。对于CPI预测误差,μ_error,代数平均比集合平均对大的不准确度惩罚的更厉害。比如,如果一个特定的benchmark比其他benchmarks有更大的误差,代数平均误差会比几何平均误差要更大。这对我们的目的是有意义的,因为我们希望所有benchmarks的预测误差都要较低。
Table 3 shows the results of comparing BLRL and MRRL. This is done for different values of the percentile K%. For BLRL, we use K = 85%, K = 90% and K = 95%; for MRRL, we use K = 99.5% and 99.9%. To compare the performance of a warmup strategy, we take both the warmup length and the CPI prediction error into account. We observe that BLRL performs significantly better than MRRL on average. For example, compare BLRL-90% versus MRRL-99.9%: BLRL-90% attains a higher accuracy than MRRL-99.9% (μ_error = 0.30% versus 0.43% respectively) with a shorter warmup length (589M versus 896M instructions respectively). In other words, the error is reduced by 30% while having a 34% shorter warmup. Or, when comparing BLRL-85% versus MRRL-99.9%, we observe that BLRL attains the same accuracy as MRRL with a warmup length that is 49% shorter (453M versus 896M respectively). Note that although the error rate reductions seem significant in relative terms, they are not that significant in absolute terms. Haskins and Skadron [19] showed, based on statistical tests, that the per-sample CPI numbers obtained through MRRL are statistically insignificant from the per-sample CPI numbers obtained through full warmup. Therefore, the improvement in terms of accuracy demonstrated here through BLRL is to be seen within this margin of error. Therefore, we conclude that BLRL achieves a similar level of accuracy as MRRL but achieves this level of accuracy with a significantly shorter warmup length—the warmup length under BLRL is nearly half the warmup length under MRRL.
表3展示了比较BLRL和MRRL的结果,这是在不同的百分位K%数值下进行的。对于BLRL,我们使用K = 85%, K = 90%和K = 95%;对MRRL, 我们使用K = 99.5%和99.9%。为比较预热策略的性能,我们将预热长度和CPI预测误差都纳入考虑。我们观察到,BLRL的性能比MRRL要好很多。比如,将BLRL-90%余MRRL-99.9%比较:BLRL-90%的准确率比MRRL-99.9%要高(μ_error分别为0.30%和0.43%),预热长度要短(分别为589M和896M条指令)。换句话说,误差减少了30%,预热长度减少了34%。比较BLRL-85%和MRRL-99.9%,我们观察到,BLRL与MRRL的准确率接近,预热长度短了49%(分别是453M和896M)。注意,虽然错误率的降低从相对数值来说比较显著,但是在绝对数值来说没那么显著。[19]表明,基于统计测试,通过MRRL获得的per-sample CPI数值在统计上与完全预热得到的per-sample CPI数值类似。因此,BLRL在这里展现出的准确率的改进,是在这个误差范围内的。因此,我们得出结论,BLRL获得准确率与MRRL类似,但是所用的预热长度却大大缩短,基本上是一半左右。
Table 4 compares BLRL with a fixed percentile K BLRL = 90% versus MRRL with a variable percentile K MRRL % for 1M instruction samples on a per-benchmark basis. The motivation for such an analysis is to quantify how much shorter the warmup is for BLRL than for MRRL to yield the same level of accuracy. For this table, we have run a large number of experiments with varying percentiles K MRRL % for MRRL. For each benchmark a different percentile K MRRL % is chosen so that the the CPI prediction error for MRRL is close to that for BLRL. In its two rightmost columns, Table 4 presents the reduction in CPI prediction error (in percentage point) and warmup length (in millions of instructions). Positive values indicate that BLRL yields better accuracy and shorter warmup than MRRL. These data show that for 7 out of the 10 benchmarks, BLRL attains smaller errors and shorter warmup than MRRL. For four benchmarks, the reduction in warmup length over MRRL is very large: bzip2 (39%), gcc (59%), twolf (59%) and mcf (93%). For one benchmark, namely vpr, MRRL attains a smaller error and a shorter warmup than BLRL. For two benchmarks (crafty and eon), BLRL and MRRL are comparable.
表4比较了BLRL在固定百分位K=90%,与MRRL变化K时,对每个benchmark在1M指令上的比较。这样一种分析的动机是,BLRL在与MRRL准确率类似的情况下,预热长度会减少多少。在这个表格下,我们对MRRL改变K%的情况下进行了很多次试验。对每个benchmark,选择不同的MRRL K%,这样MRRL的CPI误差要比BLRL要小。在最右边的两列中,表4给出了在CPI预测误差的降低(以百分比),和预热长度(以百万条指令)。正的值说明BLRL比MRRL得到更好的准确率和更短的预热长度。这些数据表明,10个benchmarks中的7个,BLRL比MRRL获得了更小的误差,更短的预热时间。对于4个benchmarks,BLRL比MRRL缩短的预热长度是很大的:bzip2 (39%), gcc (59%), twolf (59%) 和mcf (93%)。对一个基准测试,即vpr,MRRL比BLRL获得了更小的误差,更短的预热时间。对于两个benchmarks(crafty和eon),BLRL和MRRL是类似的。
Another way of looking at the performance of warmup strategies is to plot the average CPI prediction error, μ_error, versus warmup length. Figure 4 shows such a graph for four benchmarks: twolf, gcc, bzip2 and vpr. The different points for each curve correspond to different values of the percentiles K BLRL % and K MRRL %. Obviously, increasing percentiles K% correspond to increasing warmup lengths and decreasing CPI prediction errors. This graph shows that for twolf, gcc and bzip2, BLRL is significantly better than MRRL. We observed similar graphs for most of the other benchmarks. For vpr on the other hand, MRRL seems to outperform BLRL.
另一种观察预热策略的性能的方法,是画出平均CPI预测误差μ_error和预热长度的图。图4对4个benchmarks画出了这样一个图:twolf, gcc, bzip2和vpr。每个曲线上的不同点对应着百分位K BLRL %和K MRRL %的不同值。很明显,增加百分位K%对应着增大预热长度和降低CPI预测误差。这个图表明,对于twolf,gcc和bzip2,BLRL比MRRL要明显更好。对其他benchmarks,我们观察到类似的图。在vpr上,MRRL似乎比BLRL要好。
In order to understand where the (small) CPI prediction errors come from, we have performed an error analysis. For this purpose we have collected various metrics under BLRL-85% and MRRL-99.9%: the L1 I-cache miss rate, the L1 D-cache miss rate, the unified L2-cache miss rate and the branch misprediction rate (Table 5). The error rates shown in this table are absolute error rates which are computed as follows:
为理解CPI预测误差为什么会小,我们进行了误差分析。为此,我们收集了在BLRL-85%和MRRL-99.9%下的各种度量:the L1 I-cache miss rate, the L1 D-cache miss rate, the unified L2-cache miss rate and the branch misprediction rate (Table 5)。这个表格中给出的误差率是绝对误差,计算如下:
where M short and M full denote a metric M under shortened and full warmup respectively. We use this absolute error rate instead of the relative error rate since the miss rates are small numbers—a relative error rate would enlarge small differences for small numbers without providing a useful meaning. We conclude from Table 5 that the error rates for the L1 I and D caches are zero in nearly all cases. For the branch misprediction rates we observe higher error rates, but these errors are still <0.11%. The highest error rates are observed for the L2-cache miss rates. For example, for bzip2 the error rates are 2.94 and 2.28% for MRRL and BLRL respectively; for vortex the error rates are 0.86 and 0.84% for MRRL and BLRL respectively. Note that the higher error rates for the L2 caches for these two benchmarks result in higher error rates in overall CPI (Table 3).
其中M short和M full分别表示在缩短的和完整的预热下的度量M。我们使用绝对误差率,而不是相对误差率,因为miss rates都是很小的数,相对误差率会放大小的差异,而没有实际的意义。我们从表5中给出结论,对于L1 I和D cachees,在所有情况下几乎都为0。对于分支误预测率,我们观察到更高的误差率,但这些误差仍然<0.11%。最高的误差率是在L2-cache miss rates。比如,对于bzip2,误差率对于MRRL和BLRL分别为2.94%和2.28%;对于vortex,误差率对于MRRL和BLRL分别为0.86%和0.84%。注意,对这两个benchmarks,L2 caches更高的误差率,会得到整体CPI的误差率也更高(表3)。
Architectural simulation is an essential tool for micro-architectural research to obtain insight into the cycle-level behavior of current microprocessors. Unfortunately, these architectural simulations are extremely time-consuming, especially if industry standard benchmarks need to be simulated to completion. Sampled simulation is an often-used solution to drastically reduce the total simulation time. In sampled simulation, a well-chosen set of samples is selected such that they represent an accurate picture of the complete benchmark execution.
架构仿真对于微架构研究是关键的,可以得到周期级的微处理器的行为的洞见。不幸的是,这些架构仿真是非常耗时的,尤其是要对工业标准benchmarks进行仿真。采样仿真是通常采用的解决方案,可以极大的降低仿真时间。在采样仿真中,选择良好的样本集可以代表完整benchmark执行的准确行为。
An important problem with sampling, however, is the unknown hardware state at the beginning of each sample. To accurately estimate this hardware state researchers have proposed various warmup strategies. This is done by simulating additional instructions from the pre-sample without computing performance metrics; this is particularly useful for large hardware structures such as caches and branch predictors. Since warm simulation has a significant impact on the overall sampled simulation time, it is important to study efficient but accurate warmup strategies. In this paper we proposed BLRL which uses reuse latencies (between memory references to the same memory location) that cross the boundary line between the pre-sample and the sample. BLRL uses a percentage (e.g. 90%) of these reuse latencies to calculate the warmup length per sample. This paper also compared BLRL with the previously proposed MRRL. Our experimental results using SPEC CPU2000 and detailed processor simulation showed that BLRL outperforms MRRL significantly. BLRL achieves a warmup that is nearly half the size, on average, of the warmup under MRRL for the same level of accuracy.
但是,采样的一个重要问题是,在每个样本开始时的未知硬件状态。为准确的估计这种硬件状态,研究者提出了各种预热策略。这是通过对presample中的指令进行仿真,但是不计算性能度量实现的;这对于大型硬件结构,比如caches和分支预测器,非常有用。由于预热仿真对整体采样仿真时间有显著影响,研究高效但准确的预热策略就非常重要。本文中,我们提出了BLRL,使用reuse latencies(对相同内存位置的内存引用之间的),跨越presample和sample边界线的。BLRL使用一定百分比(如,90%)的这些reuse latencies,来计算每个样本的预热长度。本文还将BLRL与之前提出的MRRL进行了比较。我们使用SPEC CPU2000的试验结果,和详细的处理器仿真表明,BLRL超过了MRRL很多。BLRL在与MRRL获得类似的准确率的情况下,其预热长度缩短了接近一半。