Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush cache #246

Draft
wants to merge 30 commits into
base: master
Choose a base branch
from
Draft

Flush cache #246

wants to merge 30 commits into from

Conversation

fjwillemsen
Copy link
Collaborator

This pull request adds the ability to flush the L2 cache between iterations on the GPU backends.

…ar one where the memory clock would always be seen as not-equal due to a rounding error
…goes to framework time instead of benchmark time
…goes to framework time instead of benchmark time
…ck on gpu-architecture compiler option, added gpu-architecture auto-adding to CuPy
…d tests for this function, removed setting --gpu-architecture for CuPy as it is already set internally
Copy link
Collaborator

@csbnw csbnw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few (small) suggestions.

@@ -85,7 +85,7 @@ Steps without :bash:`sudo` access (e.g. on a cluster):
- /path/to/directory
* [Optional] both Mamba and Miniconda can be automatically activated via :bash:`~/.bashrc`. Do not forget to add these (usually provided at the end of the installation).
* Exit the shell and re-enter to make sure Conda is available. :bash:`cd` to the kernel tuner directory.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.

Comment on lines +89 to +90
def allocate_ndarray(self, array):
return hip.hipMalloc(array.nbytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to store the allocated memory?

Suggested change
def allocate_ndarray(self, array):
return hip.hipMalloc(array.nbytes)
def allocate_ndarray(self, array):
alloc = hip.hipMalloc(array.nbytes)
self.allocations.append(alloc)
return alloc

Comment on lines +205 to +207
# get the number of registers per thread used in this kernel
num_regs = cuda.cuFuncGetAttribute(cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, self.func)
assert num_regs[0] == 0, f"Retrieving number of registers per thread unsuccesful: code {num_regs[0]}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this code to a helper function?

def benchmark_default(self, func, gpu_args, threads, grid, result):
"""Benchmark one kernel execution at a time"""
def flush_cache(self):
"""This special function can be called to flush the L2 cache."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the comment to:

Suggested change
"""This special function can be called to flush the L2 cache."""
"""Flush the L2 cache by overwriting it with zeros."""

I am surprised that this works at all, I thought that memset just touched the device memory.

@@ -577,9 +595,12 @@ def compile_and_benchmark(self, kernel_source, gpu_args, params, kernel_options,

# benchmark
if func:
# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# setting the NVML parameters here avoids this time from leaking into the benchmark time, ends up in framework time instead
# Setting the NVML parameters takes a non neglibible amount of time. By setting them
# here, this time is added to the framework time rather than to benchmark time.

@@ -0,0 +1,16 @@
from kernel_tuner.observers.observer import BenchmarkObserver

class RegisterObserver(BenchmarkObserver):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new observer, but adding it seems outside the scope of this PR which is about flushing the L2 cache.

highest_cc_index = max([i for i, cc in enumerate(subset_cc) if int(cc[1]) <= int(compute_capability[1])])
return subset_cc[highest_cc_index]
# if all else fails, return the default 52
return '52'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return '52'
return valid_cc[0]

Copy link

sonarqubecloud bot commented Mar 7, 2024

Quality Gate Passed Quality Gate passed

Issues
19 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link

Quality Gate Passed Quality Gate passed

Issues
20 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@@ -367,7 +369,7 @@ def benchmark_default(self, func, gpu_args, threads, grid, result, flush_cache=T
]

self.dev.synchronize()
for _ in range(self.iterations):
for i in range(self.iterations):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i doesn't seem to be used below. the for-loop on line 377 even defines its own i.

self.flush_array = np.zeros((self.dev.cache_size_L2 // t(0).itemsize), order='F').astype(t)
self.flush_type = np.uint8
size = (self.dev.cache_size_L2 // self.flush_type(0).itemsize)
# self.flush_array = np.zeros((size), order='F', dtype=self.flush_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.flush_array = np.zeros((size), order='F', dtype=self.flush_type)

@@ -47,7 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
self.devprops = dev.attributes
self.cc = dev.compute_capability
self.max_threads = self.devprops["MaxThreadsPerBlock"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cast this to int for consistency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants