Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffers aren't released #208

Open
lprc opened this issue Nov 23, 2023 · 1 comment
Open

Buffers aren't released #208

lprc opened this issue Nov 23, 2023 · 1 comment

Comments

@lprc
Copy link

lprc commented Nov 23, 2023

I tried the example code from the Readme, which works fine, but apparently the memory is not released after the script has finished. nvidia-smi shows that memory usage gets increased every time the script is rerun in the same repl session.

I tried releasing the buffers manually using e.g. cl.release!(a_buff) but with no appeal either.

This is what I ran:

using OpenCL

const sum_kernel = "
   __kernel void sum(__global const float *a,
                     __global const float *b,
                     __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];
    }
"
a = rand(Float32, 50_000)
b = rand(Float32, 50_000)

device, ctx, queue = cl.create_compute_context()

a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=b)
c_buff = cl.Buffer(Float32, ctx, :w, length(a))

p = cl.Program(ctx, source=sum_kernel) |> cl.build!
k = cl.Kernel(p, "sum")

queue(k, size(a), nothing, a_buff, b_buff, c_buff)

r = cl.read(queue, c_buff)

# these 3 lines have no effect apparently
cl.release!(a_buff)
cl.release!(b_buff)
cl.release!(c_buff)

Did I do something wrong? Or is there a bug?

I'm using Julia 1.9.4. If I can provide any further information, let me know.

Edit: Output of clinfo is probably useful...

clinfo output

Number of platforms                               1
Platform Name                                   NVIDIA CUDA
Platform Vendor                                 NVIDIA Corporation
Platform Version                                OpenCL 3.0 CUDA 12.3.68
Platform Profile                                FULL_PROFILE
Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
Platform Extensions with Version                cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                               cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                               cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                               cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                               cl_khr_fp64                                                      0x400000 (1.0.0)
                                               cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                               cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                               cl_khr_icd                                                       0x400000 (1.0.0)
                                               cl_khr_gl_sharing                                                0x400000 (1.0.0)
                                               cl_nv_compiler_options                                           0x400000 (1.0.0)
                                               cl_nv_device_attribute_query                                     0x400000 (1.0.0)
                                               cl_nv_pragma_unroll                                              0x400000 (1.0.0)
                                               cl_nv_copy_opts                                                  0x400000 (1.0.0)
                                               cl_khr_gl_event                                                  0x400000 (1.0.0)
                                               cl_nv_create_buffer                                              0x400000 (1.0.0)
                                               cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                               cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                               cl_nv_kernel_attribute                                           0x400000 (1.0.0)
                                               cl_khr_device_uuid                                               0x400000 (1.0.0)
                                               cl_khr_pci_bus_info                                              0x400000 (1.0.0)
                                               cl_khr_external_semaphore                                          0x9000 (0.9.0)
                                               cl_khr_external_memory                                             0x9000 (0.9.0)
                                               cl_khr_external_semaphore_opaque_fd                                0x9000 (0.9.0)
                                               cl_khr_external_memory_opaque_fd                                   0x9000 (0.9.0)
Platform Numeric Version                        0xc00000 (3.0.0)
Platform Extensions function suffix             NV
Platform Host timer resolution                  0ns
Platform External memory handle types           Opaque FD
Platform External semaphore import types        Opaque FD
Platform External semaphore export types        Opaque FD

Platform Name                                   NVIDIA CUDA
Number of devices                                 1
Device Name                                     Quadro T2000 with Max-Q Design
Device Vendor                                   NVIDIA Corporation
Device Vendor ID                                0x10de
Device Version                                  OpenCL 3.0 CUDA
Device UUID                                     dbe81e65-625c-31d5-048c-f6413159664e
Driver UUID                                     dbe81e65-625c-31d5-048c-f6413159664e
Valid Device LUID                               No
Device LUID                                     6d69-637300000000
Device Node Mask                                0
Device Numeric Version                          0xc00000 (3.0.0)
Driver Version                                  545.23.06
Device OpenCL C Version                         OpenCL C 1.2 
Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                               OpenCL C                                                         0x401000 (1.1.0)
                                               OpenCL C                                                         0x402000 (1.2.0)
                                               OpenCL C                                                         0xc00000 (3.0.0)
Device OpenCL C features                        __opencl_c_fp64                                                  0xc00000 (3.0.0)
                                               __opencl_c_images                                                0xc00000 (3.0.0)
                                               __opencl_c_int64                                                 0xc00000 (3.0.0)
                                               __opencl_c_3d_image_writes                                       0xc00000 (3.0.0)
Latest conformance test passed                  v2022-10-05-00
Device Type                                     GPU
Device Topology (NV)                            PCI-E, 0000:01:00.0
Device PCI bus info (KHR)                       PCI-E, 0000:01:00.0
Device Profile                                  FULL_PROFILE
Device Available                                Yes
Compiler Available                              Yes
Linker Available                                Yes
Max compute units                               16
Max clock frequency                             1530MHz
Compute Capability (NV)                         7.5
Device Partition                                (core)
 Max number of sub-devices                     1
 Supported partition types                     None
 Supported affinity domains                    (n/a)
Max work item dimensions                        3
Max work item sizes                             1024x1024x64
Max work group size                             1024
Preferred work group size multiple (device)     32
Preferred work group size multiple (kernel)     32
Warp size (NV)                                  32
Max sub-groups per work group                   0
Preferred / native vector sizes                 
 char                                                 1 / 1       
 short                                                1 / 1       
 int                                                  1 / 1       
 long                                                 1 / 1       
 half                                                 0 / 0        (n/a)
 float                                                1 / 1       
 double                                               1 / 1        (cl_khr_fp64)
Half-precision Floating-point support           (n/a)
Single-precision Floating-point support         (core)
 Denormals                                     Yes
 Infinity and NANs                             Yes
 Round to nearest                              Yes
 Round to zero                                 Yes
 Round to infinity                             Yes
 IEEE754-2008 fused multiply-add               Yes
 Support is emulated in software               No
 Correctly-rounded divide and sqrt operations  Yes
Double-precision Floating-point support         (cl_khr_fp64)
 Denormals                                     Yes
 Infinity and NANs                             Yes
 Round to nearest                              Yes
 Round to zero                                 Yes
 Round to infinity                             Yes
 IEEE754-2008 fused multiply-add               Yes
 Support is emulated in software               No
Address bits                                    64, Little-Endian
External memory handle types                    Opaque FD
External semaphore import types                 Opaque FD
External semaphore export types                 Opaque FD
Global memory size                              4096196608 (3.815GiB)
Error Correction support                        No
Max memory allocation                           1024049152 (976.6MiB)
Unified memory for Host and Device              No
Integrated memory (NV)                          No
Shared Virtual Memory (SVM) capabilities        (core)
 Coarse-grained buffer sharing                 Yes
 Fine-grained buffer sharing                   No
 Fine-grained system sharing                   No
 Atomics                                       No
Minimum alignment for any data type             128 bytes
Alignment of base address                       4096 bits (512 bytes)
Preferred alignment for atomics                 
 SVM                                           0 bytes
 Global                                        0 bytes
 Local                                         0 bytes
Atomic memory capabilities                      relaxed, work-group scope
Atomic fence capabilities                       relaxed, acquire/release, work-group scope
Max size for global variable                    0
Preferred total size of global vars             0
Global Memory cache type                        Read/Write
Global Memory cache size                        524288 (512KiB)
Global Memory cache line size                   128 bytes
Image support                                   Yes
 Max number of samplers per kernel             32
 Max size for 1D images from buffer            268435456 pixels
 Max 1D or 2D image array size                 2048 images
 Base address alignment for 2D image buffers   0 bytes
 Pitch alignment for 2D image buffers          0 pixels
 Max 2D image size                             32768x32768 pixels
 Max 3D image size                             16384x16384x16384 pixels
 Max number of read image args                 256
 Max number of write image args                32
 Max number of read/write image args           0
Pipe support                                    No
Max number of pipe args                         0
Max active pipe reservations                    0
Max pipe packet size                            0
Local memory type                               Local
Local memory size                               49152 (48KiB)
Registers per block (NV)                        65536
Max number of constant args                     9
Max constant buffer size                        65536 (64KiB)
Generic address space support                   No
Max size of kernel argument                     32764 (32KiB)
Queue properties (on host)                      
 Out-of-order execution                        Yes
 Profiling                                     Yes
Device enqueue capabilities                     (n/a)
Queue properties (on device)                    
 Out-of-order execution                        No
 Profiling                                     No
 Preferred size                                0
 Max size                                      0
Max queues on device                            0
Max events on device                            0
Prefer user sync for interop                    No
Profiling timer resolution                      1000ns
Execution capabilities                          
 Run OpenCL kernels                            Yes
 Run native kernels                            No
 Non-uniform work-groups                       No
 Work-group collective functions               No
 Sub-group independent forward progress        No
 Kernel execution timeout (NV)                 Yes
 Concurrent copy and kernel execution (NV)     Yes
   Number of async copy engines                3
 IL version                                    (n/a)
 ILs with version                              (n/a)
printf() buffer size                            1048576 (1024KiB)
Built-in kernels                                (n/a)
Built-in kernels with version                   (n/a)
Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
Device Extensions with Version                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                               cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                               cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                               cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                               cl_khr_fp64                                                      0x400000 (1.0.0)
                                               cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                               cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                               cl_khr_icd                                                       0x400000 (1.0.0)
                                               cl_khr_gl_sharing                                                0x400000 (1.0.0)
                                               cl_nv_compiler_options                                           0x400000 (1.0.0)
                                               cl_nv_device_attribute_query                                     0x400000 (1.0.0)
                                               cl_nv_pragma_unroll                                              0x400000 (1.0.0)
                                               cl_nv_copy_opts                                                  0x400000 (1.0.0)
                                               cl_khr_gl_event                                                  0x400000 (1.0.0)
                                               cl_nv_create_buffer                                              0x400000 (1.0.0)
                                               cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                               cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                               cl_nv_kernel_attribute                                           0x400000 (1.0.0)
                                               cl_khr_device_uuid                                               0x400000 (1.0.0)
                                               cl_khr_pci_bus_info                                              0x400000 (1.0.0)
                                               cl_khr_external_semaphore                                          0x9000 (0.9.0)
                                               cl_khr_external_memory                                             0x9000 (0.9.0)
                                               cl_khr_external_semaphore_opaque_fd                                0x9000 (0.9.0)
                                               cl_khr_external_memory_opaque_fd                                   0x9000 (0.9.0)

NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
clCreateContext(NULL, ...) [default]            No platform
clCreateContext(NULL, ...) [other]              Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

@lprc
Copy link
Author

lprc commented Nov 24, 2023

I just realized that the memory is released in another snippet of code of mine, but which resides in a function. So is it actually the expected behavior for the above example to not release buffers since a_buff etc. are living in global space? Thus getting destroyed only when the repl is shut down? (If yes you can close this issue but maybe give a hint in the Readme below the example.)

Still I wonder why calling cl.release! didn't do anything.

I also realized that the memory is not released directly after my function returns but only when I a) rerun the function or b) run GC.gc() explicitely after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant