-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush cache #246
base: master
Are you sure you want to change the base?
Flush cache #246
Conversation
…ar one where the memory clock would always be seen as not-equal due to a rounding error
…goes to framework time instead of benchmark time
…goes to framework time instead of benchmark time
…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy
…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few (small) suggestions.
@@ -85,7 +85,7 @@ Steps without :bash:`sudo` access (e.g. on a cluster): | |||
- /path/to/directory | |||
* [Optional] both Mamba and Miniconda can be automatically activated via :bash:`~/.bashrc`. Do not forget to add these (usually provided at the end of the installation). | |||
* Exit the shell and re-enter to make sure Conda is available. :bash:`cd` to the kernel tuner directory. | |||
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. | |||
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. | |
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable. |
def allocate_ndarray(self, array): | ||
return hip.hipMalloc(array.nbytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to store the allocated memory?
def allocate_ndarray(self, array): | |
return hip.hipMalloc(array.nbytes) | |
def allocate_ndarray(self, array): | |
alloc = hip.hipMalloc(array.nbytes) | |
self.allocations.append(alloc) | |
return alloc |
# get the number of registers per thread used in this kernel | ||
num_regs = cuda.cuFuncGetAttribute(cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, self.func) | ||
assert num_regs[0] == 0, f"Retrieving number of registers per thread unsuccesful: code {num_regs[0]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this code to a helper function?
def benchmark_default(self, func, gpu_args, threads, grid, result): | ||
"""Benchmark one kernel execution at a time""" | ||
def flush_cache(self): | ||
"""This special function can be called to flush the L2 cache.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing the comment to:
"""This special function can be called to flush the L2 cache.""" | |
"""Flush the L2 cache by overwriting it with zeros.""" |
I am surprised that this works at all, I thought that memset
just touched the device memory.
@@ -577,9 +595,12 @@ def compile_and_benchmark(self, kernel_source, gpu_args, params, kernel_options, | |||
|
|||
# benchmark | |||
if func: | |||
# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead | |
# Setting the NVML parameters takes a non neglibible amount of time. By setting them | |
# here, this time is added to the framework time rather than to benchmark time. |
@@ -0,0 +1,16 @@ | |||
from kernel_tuner.observers.observer import BenchmarkObserver | |||
|
|||
class RegisterObserver(BenchmarkObserver): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new observer, but adding it seems outside the scope of this PR which is about flushing the L2 cache.
highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])]) | ||
return subset_cc[highest_cc_index] | ||
# if all else fails, return the default 52 | ||
return '52' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return '52' | |
return valid_cc[0] |
…nted by CuPy, and attempt free of previous allocation after checking if flush is possible
… added interfacing for flushing L2 and recopying arguments
Quality Gate passedIssues Measures |
Quality Gate passedIssues Measures |
@@ -367,7 +369,7 @@ def benchmark_default(self, func, gpu_args, threads, grid, result, flush_cache=T | |||
] | |||
|
|||
self.dev.synchronize() | |||
for _ in range(self.iterations): | |||
for i in range(self.iterations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
doesn't seem to be used below. the for-loop on line 377 even defines its own i
.
self.flush_array = np.zeros((self.dev.cache_size_L2 // t(0).itemsize), order='F').astype(t) | ||
self.flush_type = np.uint8 | ||
size = (self.dev.cache_size_L2 // self.flush_type(0).itemsize) | ||
# self.flush_array = np.zeros((size), order='F', dtype=self.flush_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# self.flush_array = np.zeros((size), order='F', dtype=self.flush_type) |
@@ -47,7 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None | |||
self.devprops = dev.attributes | |||
self.cc = dev.compute_capability | |||
self.max_threads = self.devprops["MaxThreadsPerBlock"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cast this to int
for consistency?
This pull request adds the ability to flush the L2 cache between iterations on the GPU backends.