Skip to content

Find simplex parallelisation

Thomas Etherington edited this page Sep 21, 2020 · 1 revision

The find_simplex function of compGeometeR is written partly in R so unlike other functions that are almost entirely written in C the find_simplex function can start to run slowly when dealing with 100s of millions of test points.

Fortunately, the problem of processing all these test points is what is known as an embarrassingly parallel problem, as each test point can be processed individual without reference to any of the other test points it is a relatively easy task to break the test points into separate chunks to process them in parallel through a coding process called parallelisation.

R packages such as foreach and doParallel make this process possible in R, and the following script puts that idea into practice:

#-------------------------------------------------------------------------------

# R version 3.5.3 (2019-03-11) -- "Great Truth"
# Copyright (C) 2019 The R Foundation for Statistical Computing
# Platform: x86_64-w64-mingw32/x64 (64-bit)

#-------------------------------------------------------------------------------

library(compGeometeR) # version 1.0
library(doParallel) # version 1.0.15
library(foreach) # version 1.5.0
library(tictoc) # version 1.0

#-------------------------------------------------------------------------------

set.seed(0) # To reproduce the same results

# Create random data, alpha-complex, and grid of coordinaets to check
x = runif(100)
y = runif(100)
p = data.frame(x,y)
a_complex = alpha_complex(points = p, alpha = 0.05)
point_grid = grid_coordinates(mins=c(0,0), maxs=c(1,1), spacing=0.0001)
X = unique(point_grid[,1])
Y = unique(point_grid[,2])

#-------------------------------------------------------------------------------

# Serial processing
tic()
grid_a_complex_serial = find_simplex(a_complex, point_grid)
toc()

#-------------------------------------------------------------------------------

# Parallel processing
tic()
chunks = 4
chunk_size = ceiling(nrow(point_grid) / chunks)
cl = makePSOCKcluster(chunks)
registerDoParallel(cl)
grid_a_complex_parallel <- (foreach(i = 1:chunks, .combine=c) %dopar% {
  library(compGeometeR)
  start = (((i - 1) * chunk_size) + (i))
  end = min((((((i - 1) * chunk_size) + (i))) + chunk_size), nrow(point_grid))
  chunk_data = point_grid[start:end,]
  rownames(chunk_data) = 1:nrow(chunk_data)
  acomplex = find_simplex(a_complex, chunk_data)
})
stopCluster(cl)
toc()

#-------------------------------------------------------------------------------

Here we are dealing with 100,020,001 test points, and when processed in serial on my laptop this takes 119.7 seconds. But on a quad-core laptop by splitting the processing into 4 chunks we can reduce the processing time to 71.81 seconds.

The usual logic applies with parallelisation in that due to the overhead of setting up and resolving the parallelisation it will not always result in improved performance. You can see this principal in that using 4 processes did not take a quarter or 25 % of the time, but 60 % so closer to a half than a quarter, and with small numbers of test points the parallel process will actually take longer - and require you to take more time to write more complex code! So I would suggest that parallelisation really is only worth trying if you can't cope with processing times resulting from serial process used in find_simplex.

That said, we did achieve a notable speed up here with just 4 parallel processes when checking 100,020,001 points. So hopefully the example here will help those people who need to process data points faster than can be done in serial. And of course, given the problem is embarrassingly parallel there is certainly the potential to scale up significantly on a high-performance computing infrastructure with many processors if a vast number of points need to be processed.

Clone this wiki locally