Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards Beagle+OpenCL on a Raspberry Pi #157

Open
Anaphory opened this issue Feb 10, 2021 · 4 comments
Open

Towards Beagle+OpenCL on a Raspberry Pi #157

Anaphory opened this issue Feb 10, 2021 · 4 comments

Comments

@Anaphory
Copy link

Anaphory commented Feb 10, 2021

This is less a BEAGLE issue, and more a post for your information. Maybe it can be tested, completed, and put somewhere useful.

I got OpenCL, BEAGLE and BEAST2 to work together on a Raspberry Pi in theory. The practical use still suffers from the fact that tests and Beast runs seems to go beyond the limits of the tiny GPU – maybe there's a way to mitigate that.
In this post, I may be forgetting steps that I took in my process of trial-and-error getting to the point I got to. I have a second Pi with a clean environment coming in a few days, I'll try to follow my own instructions and update them where necessary afterwards.

Operating Environment

I have a Raspberry Pi 3 model B V1.2, with an out-of-the-box Raspbian 10 (buster).
Shared libraries go into /usr/local/lib, so I work with

LD_LIBRARY_PATH=/usr/local/lib/

throughout.

OpenCL

Software:

sudo apt-get install clinfo ocl-icd-opencl-dev opencl-headers

There is a partial (Embedded Profile, targeting OpenCL version 1.2 as the last one that can be completely supported) OpenCL implementation for the Raspberry Pi. I compiled and installed it according to the instructions in a separate Git repository, but building VC4CL with settings cmake -DBUILD_TESTING=ON -DBUILD_ICD=ON .., and then sudo make install TestVC4CL went through without a hitch (I say that – at some point, closing the Chromium on the Pi made the difference between the compilation succeding and being killed for lack of memory.) I did not compile the hello_word.cl test. I don't know whether the ICD is necessary, but at least it didn't hurt, so I activated it as instructed using

echo /usr/local/lib/libVC4CL.so | sudo tee -a /etc/OpenCL/vendors/VC4CL.icd

The outcome is a properly installed OpenCL, it seems. (I may be running too many commands with super user rights in this post, but super user permissions are needed for running on the GPU, so all hope is lost anyway. The sudo here is definitely necessary to access /dev/mem.)

$ sudo clinfo 
Number of platforms                               1
  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
  Platform Vendor                                 doe300
  Platform Version                                OpenCL 1.2 VC4CL 0.4.9999 (842d444)
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters
  Platform Extensions function suffix             VC4CL

  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices                                 1
  Device Name                                     VideoCore IV GPU
  Device Vendor                                   Broadcom
  Device Vendor ID                                0x14e4
  Device Version                                  OpenCL 1.2 VC4CL 0.4.9999 (842d444)
  Driver Version                                  0.4.9999
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             300MHz
  Core Temperature (Altera)                       62 C
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             12x12x12
  Max work group size                             12
  Preferred work group size multiple              1
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                               16 / 16      
    int                                                 16 / 16      
    long                                                 0 / 0       
    half                                                 0 / 0        (n/a)
    float                                               16 / 16      
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              79691776 (76MiB)
  Error Correction support                        No
  Max memory allocation                           79691776 (76MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             64 bytes
  Alignment of base address                       512 bits (64 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        32768 (32KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   No
  Local memory type                               Global
  Local memory size                               79691776 (76MiB)
  Max number of constant args                     32
  Max constant buffer size                        79691776 (76MiB)
  Max size of kernel argument                     256
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    IL version                                    SPIR-V_1.5 SPIR_1.2
    SPIR versions                                 1.2
  printf() buffer size                            0
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  OpenCL for the Raspberry Pi VideoCore IV GPU
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [VC4CL]
  clCreateContext(NULL, ...) [default]            Success [VC4CL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2

Beagle

First time I tried compiling BEAGLE, the tests complained about the missing SSE instruction set, so I configured BEAGLE as

./configure --disable-sse

and compiled it as instructed. make check fails with

FAIL: tinytest
============================================================================
Testsuite summary for libhmsbeagle 3.2.0
============================================================================
# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See examples/tinytest/test-suite.log
Please report to [email protected]
============================================================================

due to CL_INVALID_WORK_GROUP_SIZE from file <GPUInterfaceOpenCL.cpp>, line 584. I put a debug output there to check the work group size the code wants to use, and it shows the global work size array has entries [256, 16, 1] – clinfo said that the GPU has “Max work item sizes 12x12x12; Max work group size 12” so that's not surprising and I tried to continue.

I ran

sudo LD_LIBRARY_PATH=/usr/local/lib java -Djava.library.path="/usr/local/lib" -jar ~/.beast/2.6/BEAST/lib/launcher.jar -beagle -beagle_GPU beast.xml

on a random BEAST2 XML I had lying around, not particularly chosen to be small or anything, and it failed with an OpenCL error: CL_OUT_OF_RESOURCES from file <GPUInterfaceOpenCL.cpp>, line 787. (Running it without -beagle_GPU works, different from what I wrote in #156, and is a factor of 2.5 faster than BEAST without BEAGLE, so that's quite good already. It's still about a factor of 8 slower than the un-BEAGLEd BEAST on my generic laptop, but maybe it's useful for someone beyond teaching in the long run. After all, a Pi is less than one eighth the cost of a laptop…)

@Anaphory
Copy link
Author

Increasing GPU memory

I have activated CPPFLAGS="-DBEAGLE_DEBUG_FLOW -DBEAGLE_DEBUG_VALUES" and added the output of the required memory to the AllocateMemory debug output

diff --git a/libhmsbeagle/GPU/BeagleGPUImpl.hpp b/libhmsbeagle/GPU/BeagleGPUImpl.hpp
index b4703f3..e16473c 100644
--- a/libhmsbeagle/GPU/BeagleGPUImpl.hpp
+++ b/libhmsbeagle/GPU/BeagleGPUImpl.hpp
@@ -1958,7 +1958,7 @@ int BeagleGPUImpl<BEAGLE_GPU_GENERIC>::updateTransitionMatricesWithModelCategori
             fprintf(stderr, "dMatrices[probabilityIndices[%d]]  (hDQ = %1.5e, eL = %1.5e) =\n", i,hDistanceQueue[i], edgeLengths[i]);
             gpu->PrintfDeviceVector(dMatrices[probabilityIndices[i]], kMatrixSize * kCategoryCount, r);
             for(int j=0; j<kCategoryCount; j++)
-                fprintf(stderr, " %1.5f",categoryRates[j]);
+                fprintf(stderr, " %1.5f",hCategoryRates[j]);
             fprintf(stderr,"\n");
         }
     #endif
diff --git a/libhmsbeagle/GPU/GPUInterfaceOpenCL.cpp b/libhmsbeagle/GPU/GPUInterfaceOpenCL.cpp
index 0cf2bcb..1d29162 100644
--- a/libhmsbeagle/GPU/GPUInterfaceOpenCL.cpp
+++ b/libhmsbeagle/GPU/GPUInterfaceOpenCL.cpp
@@ -777,7 +777,7 @@ void GPUInterface::UnmapMemory(GPUPtr dPtr, void* hPtr) {
 
 GPUPtr GPUInterface::AllocateMemory(size_t memSize) {
 #ifdef BEAGLE_DEBUG_FLOW
-    fprintf(stderr,"\t\t\tEntering GPUInterface::AllocateMemory\n");
+    fprintf(stderr,"\t\t\tEntering GPUInterface::AllocateMemory for mem size %d\n", memSize);
 #endif
     
     GPUPtr data;

and it showed me that the CL_OUT_OF_RESOURCES I encountered before is easily fixed by increasing the GPU memory on the PI to 256MiB (instead of 76 MiB, there is an option in the Pi config tool for that). After that, BEAST runs into the same issue of work sizes being too large (bigger than 12) as the tinytest:

	Entering BeagleGPUImpl::updateTransitionMatrices
			Entering GPUInterface::MemcpyHostToDevice
			Leaving  GPUInterface::MemcpyHostToDevice
			Entering GPUInterface::MemcpyHostToDevice
			Leaving  GPUInterface::MemcpyHostToDevice
		Entering KernelLauncher::GetTransitionProbabilitiesSquare
			Entering GPUInterface::LaunchKernel
localWorkSize[0]  = 16
globalWorkSize[0] = 6016
localWorkSize[1]  = 16
globalWorkSize[1] = 16
localWorkSize[2]  = 1
globalWorkSize[2] = 1
local = 12


OpenCL error: CL_INVALID_WORK_GROUP_SIZE from file <GPUInterfaceOpenCL.cpp>, line 584.

I don't know how the work size limitations, global work size, and local work size interact (as I said, I'm terribly new to all this low-level programming). Can we make BEAGLE work with “Max work item sizes 12x12x12; Max work group size 12”?

@msuchard
Copy link
Member

@Anaphory -- great work getting beagle-cpu to compile on the Raspberry Pi; I'm going to try it out on my son's system as soon as I found the time. beagle-gpu contains large amount of logic specializing the work-plan for different devices (the most obvious are for specific FPGAs and Apple's [old] dedicated GPUs).

You are welcome to provide a pull request with updates to that logic (and the kernels) to support 12 x 12 x 12 max work-plans for the VideoCore.

A word of caution, however -- almost every kernel is currently hard-coded for local-work-sizes that are "mod 16" since 16 is the magic memory coalescence size for most GPUs and is also super-convenient for 4 x 4 nucleotide models. I don't suspect, a priori, that you'll get much performance gain from the VideoCore.

@Anaphory
Copy link
Author

As I said, I don't really know anything about programming this close to hardware. I have no idea what work sizes and their limits actually mean, how BEAGLE works in general or what a GPU kernel needs to do. So I have absolutely no clue how to start working on such a pull request.

I have just put my dumb question about modifying work sizes to the VC4CL developers, as you see. Maybe you can check over there to see whether I horribly misrepresented the issue?

@doe300
Copy link

doe300 commented Feb 12, 2021

Hey, https://github.com/doe300/VC4CL maintainer here.

As I already stated in doe300/VC4CL#101, work-group sizes of more than 12 are currently not possible for hardware/implementation limitation reasons.

Also I have to agree that you should not have too high expectations in regards to performance on the Raspberry Pi GPU.

From a short browse through the repository I could not find the OpenCL kernel sources. If someone could point me into that direction, I could try to see if there is something to be done to add support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants