Towards Beagle+OpenCL on a Raspberry Pi

*This is less a BEAGLE issue, and more a post for your information. Maybe it can be tested, completed, and put somewhere useful.*

I got OpenCL, BEAGLE and BEAST2 to work together on a Raspberry Pi in theory. The practical use still suffers from the fact that tests and Beast runs seems to go beyond the limits of the tiny GPU – maybe there's a way to mitigate that.
In this post, I may be forgetting steps that I took in my process of trial-and-error getting to the point I got to. I have a second Pi with a clean environment coming in a few days, I'll try to follow my own instructions and update them where necessary afterwards.

# Operating Environment
I have a Raspberry Pi 3 model B V1.2, with an out-of-the-box Raspbian 10 (buster).
Shared libraries go into `/usr/local/lib`, so I work with
```
LD_LIBRARY_PATH=/usr/local/lib/
```
throughout.

# OpenCL
Software:
```
sudo apt-get install clinfo ocl-icd-opencl-dev opencl-headers
```
There is a partial (Embedded Profile, targeting OpenCL version 1.2 as the last one that can be completely supported) [OpenCL implementation for the Raspberry Pi](https://github.com/doe300/VC4CL). I compiled and installed it according to the instructions in [a separate Git repository](https://github.com/aaguiar96/VC4CL-Install), but building VC4CL with settings `cmake -DBUILD_TESTING=ON -DBUILD_ICD=ON ..`, and then `sudo make install TestVC4CL` went through without a hitch (I say that – at some point, closing the Chromium on the Pi made the difference between the compilation succeding and being killed for lack of memory.) I did not compile the `hello_word.cl` test. I don't know whether the ICD is necessary, but at least it didn't hurt, so I activated it as instructed using
```
echo /usr/local/lib/libVC4CL.so | sudo tee -a /etc/OpenCL/vendors/VC4CL.icd
```
The outcome is a properly installed OpenCL, it seems. (I may be running too many commands with super user rights in this post, but [super user permissions are needed for running on the GPU](https://github.com/doe300/VC4CL#security-considerations), so all hope is lost anyway. The `sudo` here is definitely necessary to access `/dev/mem`.)
```
$ sudo clinfo 
Number of platforms                               1
  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
  Platform Vendor                                 doe300
  Platform Version                                OpenCL 1.2 VC4CL 0.4.9999 (842d444)
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters
  Platform Extensions function suffix             VC4CL

  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices                                 1
  Device Name                                     VideoCore IV GPU
  Device Vendor                                   Broadcom
  Device Vendor ID                                0x14e4
  Device Version                                  OpenCL 1.2 VC4CL 0.4.9999 (842d444)
  Driver Version                                  0.4.9999
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             300MHz
  Core Temperature (Altera)                       62 C
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             12x12x12
  Max work group size                             12
  Preferred work group size multiple              1
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                               16 / 16      
    int                                                 16 / 16      
    long                                                 0 / 0       
    half                                                 0 / 0        (n/a)
    float                                               16 / 16      
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              79691776 (76MiB)
  Error Correction support                        No
  Max memory allocation                           79691776 (76MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             64 bytes
  Alignment of base address                       512 bits (64 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        32768 (32KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   No
  Local memory type                               Global
  Local memory size                               79691776 (76MiB)
  Max number of constant args                     32
  Max constant buffer size                        79691776 (76MiB)
  Max size of kernel argument                     256
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    IL version                                    SPIR-V_1.5 SPIR_1.2
    SPIR versions                                 1.2
  printf() buffer size                            0
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_int16 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_khr_il_program cl_khr_spir cl_khr_create_command_queue cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_khr_extended_versioning cl_khr_spirv_no_integer_wrap_decoration cl_vc4cl_performance_counters

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  OpenCL for the Raspberry Pi VideoCore IV GPU
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [VC4CL]
  clCreateContext(NULL, ...) [default]            Success [VC4CL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 OpenCL for the Raspberry Pi VideoCore IV GPU
    Device Name                                   VideoCore IV GPU

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2
```

# Beagle
First time I tried compiling BEAGLE, the tests complained about the missing SSE instruction set, so I configured BEAGLE as
```
./configure --disable-sse
```
and compiled it as instructed. `make check` fails with
```
FAIL: tinytest
============================================================================
Testsuite summary for libhmsbeagle 3.2.0
============================================================================
# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See examples/tinytest/test-suite.log
Please report to beagle-dev@googlegroups.com
============================================================================
```
due to `CL_INVALID_WORK_GROUP_SIZE from file <GPUInterfaceOpenCL.cpp>, line 584`. I put a debug output there to check the work group size the code wants to use, and it shows the global work size array has entries [256, 16, 1] – `clinfo` said that the GPU has “Max work item sizes 12x12x12; Max work group size 12” so that's not surprising and I tried to continue.

I ran 
```
sudo LD_LIBRARY_PATH=/usr/local/lib java -Djava.library.path="/usr/local/lib" -jar ~/.beast/2.6/BEAST/lib/launcher.jar -beagle -beagle_GPU beast.xml
```
on a random BEAST2 XML I had lying around, not particularly chosen to be small or anything, and it failed with an `OpenCL error: CL_OUT_OF_RESOURCES from file <GPUInterfaceOpenCL.cpp>, line 787.` (Running it without `-beagle_GPU` works, different from what I wrote in #156, and is a factor of 2.5 faster than BEAST without BEAGLE, so that's quite good already. It's still about a factor of 8 slower than the un-BEAGLEd BEAST on my generic laptop, but maybe it's useful for someone beyond teaching in the long run. After all, a Pi is less than one eighth the cost of a laptop…)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Towards Beagle+OpenCL on a Raspberry Pi #157

Operating Environment

OpenCL

Beagle

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Towards Beagle+OpenCL on a Raspberry Pi #157

Description

Operating Environment

OpenCL

Beagle

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions