Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush cache #246

Draft
wants to merge 30 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
81a68a4
Added RegisterObserver with common interface among backends
fjwillemsen Feb 8, 2024
943b3c4
Added test for RegisterObserver, added clause in case of mocktest
fjwillemsen Feb 8, 2024
1681730
Added useful error message in case Register Observer is not supported
fjwillemsen Feb 9, 2024
f153945
Added tests for Register Observer for OpenCL and HIP backends
fjwillemsen Feb 9, 2024
7bd7c2b
Added instruction for pointing cache directory elsewhere
fjwillemsen Feb 12, 2024
9dea137
Non-argument streams are now correctly passed in the CuPy and NVCUDA …
fjwillemsen Feb 12, 2024
df54145
Fixed several issues pertaining to the setting of clocks, in particul…
fjwillemsen Feb 15, 2024
4cc4a13
Time spent setting NVML parameters (clock & memory frequency, power) …
fjwillemsen Feb 15, 2024
e309bc1
Time spent setting NVML parameters (clock & memory frequency, power) …
fjwillemsen Feb 15, 2024
d6aac8b
Removed redundant print statement
fjwillemsen Feb 15, 2024
a020791
Added L2 cache size property to CUDA backends
fjwillemsen Feb 28, 2024
6e6e5fb
Added specification to CUPY compiler options
fjwillemsen Feb 28, 2024
f15338f
Added L2 cache size property to OpenCL, HIP and mocked PyCUDA backends
fjwillemsen Feb 28, 2024
00ac419
Added function to check for compute capability validity, improved che…
fjwillemsen Feb 28, 2024
55ab074
Added a flush kernel to clear the L2 cache between runs
fjwillemsen Feb 28, 2024
e106bae
Added a flush kernel to clear the L2 cache between runs
fjwillemsen Feb 28, 2024
0cb5e3a
Made function for scaling the compute capability to a valid one, adde…
fjwillemsen Feb 29, 2024
b682506
Applied suggestions from comments by @csbnw
fjwillemsen Mar 1, 2024
da907b1
Removed redundant comments / printing
fjwillemsen Mar 1, 2024
2396bdf
Added L2 cache size information to backends
fjwillemsen Mar 1, 2024
eced775
Added L2 flush kernel
fjwillemsen Mar 1, 2024
143889f
Switched to new attempt for flushing L2 using memset
fjwillemsen Mar 1, 2024
651eea7
Added implementation of allocate numpy array function
fjwillemsen Mar 1, 2024
7d8d48f
Added new flush L2 cache method using memset
fjwillemsen Mar 2, 2024
9911f4c
Added a standard method for freeing memory from the GPU
fjwillemsen Mar 4, 2024
47c2cca
Circumvented an issue where list.remove(val) was not properly impleme…
fjwillemsen Mar 4, 2024
157ca41
Added the ability to recopy array arguments with every kernel launch,…
fjwillemsen Mar 7, 2024
98afa60
Renamed to for clarity, added check
fjwillemsen Mar 7, 2024
cfecdc5
Improved getting L2 cache size
fjwillemsen Apr 12, 2024
108e14c
Small improvements to flushing arrays
fjwillemsen Apr 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Steps without :bash:`sudo` access (e.g. on a cluster):
- /path/to/directory
* [Optional] both Mamba and Miniconda can be automatically activated via :bash:`~/.bashrc`. Do not forget to add these (usually provided at the end of the installation).
* Exit the shell and re-enter to make sure Conda is available. :bash:`cd` to the kernel tuner directory.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linu, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.
* [Optional] if you have limited user folder space, the Pip cache can be pointed elsewhere with the environment variable :bash:`PIP_CACHE_DIR`. The cache location can be checked with :bash:`pip cache dir`. On Linux, to point the entire :bash:`~/.cache` default elsewhere, use the :bash:`XDG_CACHE_HOME` environment variable.

* [Optional] update Conda if available before continuing: :bash:`conda update -n base -c conda-forge conda`.
#. Setup a virtual environment: :bash:`conda create --name kerneltuner python=3.11` (or whatever Python version and environment name you prefer).
#. Activate the virtual environment: :bash:`conda activate kerneltuner`.
Expand Down
11 changes: 11 additions & 0 deletions kernel_tuner/backends/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from __future__ import print_function

from abc import ABC, abstractmethod
from numpy import ndarray


class Backend(ABC):
Expand Down Expand Up @@ -65,6 +66,16 @@ class GPUBackend(Backend):
def __init__(self, device, iterations, compiler_options, observers):
pass

@abstractmethod
def allocate_ndarray(self, array: ndarray) -> any:
"""This method must allocate on the GPU a buffer for a given np.ndarray and return the pointer."""
pass

@abstractmethod
def free_mem(self, pointer):
"""This method must free on the GPU a buffer for a given pointer."""
pass

@abstractmethod
def copy_constant_memory_args(self, cmem_args):
"""This method must implement the allocation and copy of constant memory to the GPU."""
Expand Down
21 changes: 19 additions & 2 deletions kernel_tuner/backends/cupy.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""This module contains all Cupy specific kernel_tuner functions."""
from __future__ import print_function
from warnings import warn

import numpy as np

Expand Down Expand Up @@ -46,6 +47,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
self.devprops = dev.attributes
self.cc = dev.compute_capability
self.max_threads = self.devprops["MaxThreadsPerBlock"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cast this to int for consistency?

self.cache_size_L2 = int(self.devprops["L2CacheSize"])

self.iterations = iterations
self.current_module = None
Expand Down Expand Up @@ -82,6 +84,18 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
self.env = env
self.name = env["device_name"]

def allocate_ndarray(self, array):
alloc = cp.array(array)
self.allocations.append(alloc)
return alloc

def free_mem(self, pointer):
# iteratively comparing is required as comparing with list.remove is not properly implemented
to_remove = [i for i, alloc in enumerate(self.allocations) if cp.array_equal(alloc, pointer)]
assert len(to_remove) == 1
self.allocations.pop(to_remove[0])
del pointer # CuPy uses Python reference counter to free upon disuse

def ready_argument_list(self, arguments):
"""Ready argument list to be passed to the kernel, allocates gpu mem.

Expand All @@ -97,8 +111,7 @@ def ready_argument_list(self, arguments):
for arg in arguments:
# if arg i is a numpy array copy to device
if isinstance(arg, np.ndarray):
alloc = cp.array(arg)
self.allocations.append(alloc)
alloc = self.allocate_ndarray(arg)
gpu_args.append(alloc)
# if not a numpy array, just pass argument along
else:
Expand All @@ -124,6 +137,7 @@ def compile(self, kernel_instance):
compiler_options = self.compiler_options
if not any(["-std=" in opt for opt in self.compiler_options]):
compiler_options = ["--std=c++11"] + self.compiler_options
# CuPy already sets the --gpu-architecture by itself, as per https://github.com/cupy/cupy/blob/main/cupy/cuda/compiler.py#L145

options = tuple(compiler_options)

Expand All @@ -132,6 +146,7 @@ def compile(self, kernel_instance):
)

self.func = self.current_module.get_function(kernel_name)
self.num_regs = self.func.num_regs
return self.func

def start_event(self):
Expand Down Expand Up @@ -197,6 +212,8 @@ def run_kernel(self, func, gpu_args, threads, grid, stream=None):
of the grid
:type grid: tuple(int, int)
"""
if stream is None:
stream = self.stream
func(grid, threads, gpu_args, stream=stream, shared_mem=self.smem_size)

def memset(self, allocation, value, size):
Expand Down
8 changes: 7 additions & 1 deletion kernel_tuner/backends/hip.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None

self.name = self.hipProps._name.decode('utf-8')
self.max_threads = self.hipProps.maxThreadsPerBlock
self.cache_size_L2 = int(self.hipProps.l2CacheSize)
self.device = device
self.compiler_options = compiler_options or []
self.iterations = iterations
Expand All @@ -85,6 +86,11 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
for obs in self.observers:
obs.register_device(self)

def allocate_ndarray(self, array):
return hip.hipMalloc(array.nbytes)
Comment on lines +89 to +90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to store the allocated memory?

Suggested change
def allocate_ndarray(self, array):
return hip.hipMalloc(array.nbytes)
def allocate_ndarray(self, array):
alloc = hip.hipMalloc(array.nbytes)
self.allocations.append(alloc)
return alloc


def free_mem(self, pointer):
raise NotImplementedError("PyHIP currently does not have a free function")

def ready_argument_list(self, arguments):
"""Ready argument list to be passed to the HIP function.
Expand All @@ -106,7 +112,7 @@ def ready_argument_list(self, arguments):
# Allocate space on device for array and convert to ctypes
if isinstance(arg, np.ndarray):
if dtype_str in dtype_map.keys():
device_ptr = hip.hipMalloc(arg.nbytes)
device_ptr = self.allocate_ndarray(arg)
data_ctypes = arg.ctypes.data_as(ctypes.POINTER(dtype_map[dtype_str]))
hip.hipMemcpy_htod(device_ptr, data_ctypes, arg.nbytes)
# may be part of run_kernel, return allocations here instead
Expand Down
46 changes: 34 additions & 12 deletions kernel_tuner/backends/nvcuda.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
"""This module contains all NVIDIA cuda-python specific kernel_tuner functions."""
from warnings import warn

import numpy as np

from kernel_tuner.backends.backend import GPUBackend
from kernel_tuner.observers.nvcuda import CudaRuntimeObserver
from kernel_tuner.util import SkippableFailure, cuda_error_check
from kernel_tuner.util import SkippableFailure, cuda_error_check, to_valid_nvrtc_gpu_arch_cc

# embedded in try block to be able to generate documentation
# and run tests without cuda-python installed
Expand Down Expand Up @@ -66,6 +68,11 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
cudart.cudaDeviceAttr.cudaDevAttrMaxThreadsPerBlock, device
)
cuda_error_check(err)
err, self.cache_size_L2 = cudart.cudaDeviceGetAttribute(
cudart.cudaDeviceAttr.cudaDevAttrL2CacheSize, device
)
cuda_error_check(err)
self.cache_size_L2 = int(self.cache_size_L2)
self.cc = f"{major}{minor}"
self.iterations = iterations
self.current_module = None
Expand Down Expand Up @@ -107,9 +114,19 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None

def __del__(self):
for device_memory in self.allocations:
if isinstance(device_memory, cuda.CUdeviceptr):
err = cuda.cuMemFree(device_memory)
cuda_error_check(err)
self.free_mem(device_memory)

def allocate_ndarray(self, array):
err, device_memory = cuda.cuMemAlloc(array.nbytes)
cuda_error_check(err)
self.allocations.append(device_memory)
return device_memory

def free_mem(self, pointer):
assert isinstance(pointer, cuda.CUdeviceptr)
self.allocations.remove(pointer)
err = cuda.cuMemFree(pointer)
cuda_error_check(err)

def ready_argument_list(self, arguments):
"""Ready argument list to be passed to the kernel, allocates gpu mem.
Expand All @@ -126,9 +143,7 @@ def ready_argument_list(self, arguments):
for arg in arguments:
# if arg is a numpy array copy it to device
if isinstance(arg, np.ndarray):
err, device_memory = cuda.cuMemAlloc(arg.nbytes)
cuda_error_check(err)
self.allocations.append(device_memory)
device_memory = self.allocate_ndarray(arg)
gpu_args.append(device_memory)
self.memcpy_htod(device_memory, arg)
# if not array, just pass along
Expand Down Expand Up @@ -161,12 +176,12 @@ def compile(self, kernel_instance):
compiler_options.append(b"--std=c++11")
if not any(["--std=" in opt for opt in self.compiler_options]):
self.compiler_options.append("--std=c++11")
if not any([b"--gpu-architecture=" in opt for opt in compiler_options]):
if not any([b"--gpu-architecture=" in opt or b"-arch" in opt for opt in compiler_options]):
compiler_options.append(
f"--gpu-architecture=compute_{self.cc}".encode("UTF-8")
f"--gpu-architecture=compute_{to_valid_nvrtc_gpu_arch_cc(self.cc)}".encode("UTF-8")
)
if not any(["--gpu-architecture=" in opt for opt in self.compiler_options]):
self.compiler_options.append(f"--gpu-architecture=compute_{self.cc}")
if not any(["--gpu-architecture=" in opt or "-arch" in opt for opt in self.compiler_options]):
self.compiler_options.append(f"--gpu-architecture=compute_{to_valid_nvrtc_gpu_arch_cc(self.cc)}")

err, program = nvrtc.nvrtcCreateProgram(
str.encode(kernel_string), b"CUDAProgram", 0, [], []
Expand All @@ -192,6 +207,11 @@ def compile(self, kernel_instance):
)
cuda_error_check(err)

# get the number of registers per thread used in this kernel
num_regs = cuda.cuFuncGetAttribute(cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_NUM_REGS, self.func)
assert num_regs[0] == 0, f"Retrieving number of registers per thread unsuccesful: code {num_regs[0]}"
Comment on lines +210 to +212
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this code to a helper function?

self.num_regs = num_regs[1]

except RuntimeError as re:
_, n = nvrtc.nvrtcGetProgramLogSize(program)
log = b" " * n
Expand Down Expand Up @@ -273,6 +293,8 @@ def run_kernel(self, func, gpu_args, threads, grid, stream=None):
of the grid
:type grid: tuple(int, int)
"""
if stream is None:
stream = self.stream
arg_types = list()
for arg in gpu_args:
if isinstance(arg, cuda.CUdeviceptr):
Expand Down Expand Up @@ -309,7 +331,7 @@ def memset(allocation, value, size):
:type size: int

"""
err = cudart.cudaMemset(allocation, value, size)
err = cudart.cudaMemset(allocation.__init__(), value, size)
cuda_error_check(err)

@staticmethod
Expand Down
19 changes: 12 additions & 7 deletions kernel_tuner/backends/opencl.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ def __init__(
self.max_threads = self.ctx.devices[0].get_info(
cl.device_info.MAX_WORK_GROUP_SIZE
)
# TODO the L2 cache size request fails
# self.cache_size_L2 = self.ctx.devices[0].get_info(
# cl.device_affinity_domain.L2_CACHE
# )
self.compiler_options = compiler_options or []

# observer stuff
Expand All @@ -68,6 +72,13 @@ def __init__(
self.env = env
self.name = dev.name

def allocate_ndarray(self, array):
return cl.Buffer(self.ctx, self.mf.READ_WRITE | self.mf.COPY_HOST_PTR, hostbuf=array)

def free_mem(self, pointer):
assert isinstance(pointer, cl.Buffer)
pointer.release()

def ready_argument_list(self, arguments):
"""Ready argument list to be passed to the kernel, allocates gpu mem.

Expand All @@ -83,13 +94,7 @@ def ready_argument_list(self, arguments):
for arg in arguments:
# if arg i is a numpy array copy to device
if isinstance(arg, np.ndarray):
gpu_args.append(
cl.Buffer(
self.ctx,
self.mf.READ_WRITE | self.mf.COPY_HOST_PTR,
hostbuf=arg,
)
)
gpu_args.append(self.allocate_ndarray(arg))
# if not an array, just pass argument along
else:
gpu_args.append(arg)
Expand Down
18 changes: 15 additions & 3 deletions kernel_tuner/backends/pycuda.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ def _finish_up():
str(k): v for (k, v) in self.context.get_device().get_attributes().items()
}
self.max_threads = devprops["MAX_THREADS_PER_BLOCK"]
self.cache_size_L2 = int(devprops["L2_CACHE_SIZE"])
cc = str(devprops.get("COMPUTE_CAPABILITY_MAJOR", "0")) + str(
devprops.get("COMPUTE_CAPABILITY_MINOR", "0")
)
Expand Down Expand Up @@ -151,7 +152,17 @@ def __del__(self):
for gpu_mem in self.allocations:
# if needed for when using mocks during testing
if hasattr(gpu_mem, "free"):
gpu_mem.free()
self.free_mem(gpu_mem)

def allocate_ndarray(self, array):
alloc = drv.mem_alloc(array.nbytes)
self.allocations.append(alloc)
return alloc

def free_mem(self, pointer):
assert hasattr(pointer, "free")
self.allocations.remove(pointer)
pointer.free()

def ready_argument_list(self, arguments):
"""Ready argument list to be passed to the kernel, allocates gpu mem.
Expand All @@ -168,8 +179,7 @@ def ready_argument_list(self, arguments):
for arg in arguments:
# if arg i is a numpy array copy to device
if isinstance(arg, np.ndarray):
alloc = drv.mem_alloc(arg.nbytes)
self.allocations.append(alloc)
alloc = self.allocate_ndarray(arg)
gpu_args.append(alloc)
drv.memcpy_htod(gpu_args[-1], arg)
elif isinstance(arg, torch.Tensor):
Expand Down Expand Up @@ -218,6 +228,8 @@ def compile(self, kernel_instance):
)

self.func = self.current_module.get_function(kernel_name)
if not isinstance(self.func, str):
self.num_regs = self.func.num_regs
return self.func
except drv.CompileError as e:
if "uses too much shared data" in e.stderr:
Expand Down
Loading
Loading