Description
This is less a BEAGLE issue, and more a post for your information. Maybe it can be tested, completed, and put somewhere useful.
I got OpenCL, BEAGLE and BEAST2 to work together on a Raspberry Pi in theory. The practical use still suffers from the fact that tests and Beast runs seems to go beyond the limits of the tiny GPU – maybe there's a way to mitigate that.
In this post, I may be forgetting steps that I took in my process of trial-and-error getting to the point I got to. I have a second Pi with a clean environment coming in a few days, I'll try to follow my own instructions and update them where necessary afterwards.
Operating Environment
I have a Raspberry Pi 3 model B V1.2, with an out-of-the-box Raspbian 10 (buster).
Shared libraries go into /usr/local/lib
, so I work with
LD_LIBRARY_PATH=/usr/local/lib/
throughout.
OpenCL
Software:
sudo apt-get install clinfo ocl-icd-opencl-dev opencl-headers
There is a partial (Embedded Profile, targeting OpenCL version 1.2 as the last one that can be completely supported) OpenCL implementation for the Raspberry Pi. I compiled and installed it according to the instructions in a separate Git repository, but building VC4CL with settings cmake -DBUILD_TESTING=ON -DBUILD_ICD=ON ..
, and then sudo make install TestVC4CL
went through without a hitch (I say that – at some point, closing the Chromium on the Pi made the difference between the compilation succeding and being killed for lack of memory.) I did not compile the hello_word.cl
test. I don't know whether the ICD is necessary, but at least it didn't hurt, so I activated it as instructed using
echo /usr/local/lib/libVC4CL.so | sudo tee -a /etc/OpenCL/vendors/VC4CL.icd
The outcome is a properly installed OpenCL, it seems. (I may be running too many commands with super user rights in this post, but super user permissions are needed for running on the GPU, so all hope is lost anyway. The sudo
here is definitely necessary to access /dev/mem
.)
$ sudo clinfo
Number of platforms 1
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Platform Vendor doe300
Platform Version OpenCL 1.2 VC4CL 0.4.9999 (842d444)
Platform Profile EMBEDDED_PROFILE
Platform Extensions cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters
Platform Extensions function suffix VC4CL
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices 1
Device Name VideoCore IV GPU
Device Vendor Broadcom
Device Vendor ID 0x14e4
Device Version OpenCL 1.2 VC4CL 0.4.9999 (842d444)
Driver Version 0.4.9999
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile EMBEDDED_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 1
Max clock frequency 300MHz
Core Temperature (Altera) 62 C
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 12x12x12
Max work group size 12
Preferred work group size multiple 1
Preferred / native vector sizes
char 16 / 16
short 16 / 16
int 16 / 16
long 0 / 0
half 0 / 0 (n/a)
float 16 / 16
double 0 / 0 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero Yes
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 79691776 (76MiB)
Error Correction support No
Max memory allocation 79691776 (76MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 64 bytes
Alignment of base address 512 bits (64 bytes)
Global Memory cache type Read/Write
Global Memory cache size 32768 (32KiB)
Global Memory cache line size 64 bytes
Image support No
Local memory type Global
Local memory size 79691776 (76MiB)
Max number of constant args 32
Max constant buffer size 79691776 (76MiB)
Max size of kernel argument 256
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
IL version SPIR-V_1.5 SPIR_1.2
SPIR versions 1.2
printf() buffer size 0
Built-in kernels (n/a)
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) OpenCL for the Raspberry Pi VideoCore IV GPU
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [VC4CL]
clCreateContext(NULL, ...) [default] Success [VC4CL]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Device Name VideoCore IV GPU
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Device Name VideoCore IV GPU
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name OpenCL for the Raspberry Pi VideoCore IV GPU
Device Name VideoCore IV GPU
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.12
ICD loader Profile OpenCL 2.2
Beagle
First time I tried compiling BEAGLE, the tests complained about the missing SSE instruction set, so I configured BEAGLE as
./configure --disable-sse
and compiled it as instructed. make check
fails with
FAIL: tinytest
============================================================================
Testsuite summary for libhmsbeagle 3.2.0
============================================================================
# TOTAL: 1
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
============================================================================
See examples/tinytest/test-suite.log
Please report to [email protected]
============================================================================
due to CL_INVALID_WORK_GROUP_SIZE from file <GPUInterfaceOpenCL.cpp>, line 584
. I put a debug output there to check the work group size the code wants to use, and it shows the global work size array has entries [256, 16, 1] – clinfo
said that the GPU has “Max work item sizes 12x12x12; Max work group size 12” so that's not surprising and I tried to continue.
I ran
sudo LD_LIBRARY_PATH=/usr/local/lib java -Djava.library.path="/usr/local/lib" -jar ~/.beast/2.6/BEAST/lib/launcher.jar -beagle -beagle_GPU beast.xml
on a random BEAST2 XML I had lying around, not particularly chosen to be small or anything, and it failed with an OpenCL error: CL_OUT_OF_RESOURCES from file <GPUInterfaceOpenCL.cpp>, line 787.
(Running it without -beagle_GPU
works, different from what I wrote in #156, and is a factor of 2.5 faster than BEAST without BEAGLE, so that's quite good already. It's still about a factor of 8 slower than the un-BEAGLEd BEAST on my generic laptop, but maybe it's useful for someone beyond teaching in the long run. After all, a Pi is less than one eighth the cost of a laptop…)