Faster Alternative for HIP over a Huge Spatial Area #35

adwiputra · 2020-06-22T01:55:06Z

Hi,

I have been trying to implement the Halton Iterative Partitioning using sdraw function over a 17 million ha SpatialPolygon. I specified the desired number of samples (n=1000) and the process has been running for two weeks now. Now I have to find a quick workaround to obtain the spatially balanced samples, so I tried dividing the area into 250 grids and then run the SDraw with a smaller number of samples (n=4) to get as many samples (1000).
Will this workaround violate any spatial balance that would otherwise be obtained from the actual HIP? Will it be safe to assume that the result will be spatially balanced, even though not ideally balanced?

Thanks so much!

tmcd82070 · 2020-06-22T17:42:53Z

We may have a subtle bug in hip.polygon. We are experiencing some strange long run times, and it should not take that long.

For polygons, HIP and BAS are functionally equivalent. samp <- sdraw(x, 1000, type="BAS") should work in a few seconds.

adwiputra · 2020-06-22T18:05:31Z

Thanks for your prompt response, @tmcd82070 !

Just a bit of clarification question. I'm also having a similar issue with hip.points, tried sampling 300 out of 14 million points. Would you recommend the same alternative using bas.point instead of hip.point?

tmcd82070 · 2020-06-22T18:14:27Z

14 million points is bit big. No guarantee that either HIP or BAS will work on that many; but, tentatively 'yes', BAS should have a better chance of finishing than HIP on that many. I am assuming your points are in a regular lattice.

You may end up needing to roll your own routine. I.e., look at the code for bas.point. The longest part of the calculation is computing the minimum distance between any two points. If you know that, you can by-pass the call the ddist() and assign the minimum distance. Rest of the code should execute pretty quickly.

tmcd82070 · 2020-06-22T18:21:59Z

@adwiputra : Here is a modified version of bas.point with changes I alluded to in the previous comments. If you go this way. let me know whether this worked. (i.e., there still may be issues with 14 million points:

Here I've simply commented out a few lines that compute d. New parameter d = the minimum distance between any two points. x and n are same as bas.point.

I still think there may be an issue creating 14 million little polygons; so, perhaps the polygon route is your only option at this point.

bas.bigPointFrame <- function (x, n, d) 
{
    if (!(inherits(x, "SpatialPoints"))) 
        stop("Must call bas.point with a SpatialPoints* object.")
    N <- length(x)
    if (n > N) {
        warning("Sample size greater than frame size. Census taken.")
        n <- N
    }
    pts <- coordinates(x)
    rnames.x <- row.names(x)
    # d <- min(stats::dist(pts))
    # if (d < (.Machine$double.eps * 10)) {
    #    stop("Minimum distance between points is zero: are there duplicate coordinates?")
    # }
    d <- d/(2 * sqrt(2))
    ll.x <- pts[, 1] - d
    ll.y <- pts[, 2] - d
    ur.x <- pts[, 1] + d
    ur.y <- pts[, 2] + d
    pixels <- vector("list", nrow(pts))
    for (i in 1:nrow(pts)) {
        Sr1 <- Polygon(cbind(c(ll.x[i], ll.x[i], ur.x[i], ur.x[i], 
            ll.x[i]), c(ll.y[i], ur.y[i], ur.y[i], ll.y[i], ll.y[i])))
        pixels[[i]] <- Polygons(list(Sr1), rnames.x[i])
    }
    pixels <- SpatialPolygons(pixels, 1:nrow(pts), proj4string = CRS(proj4string(x)))
    if (inherits(x, "SpatialPointsDataFrame")) {
        df <- data.frame(data.frame(x), row.names = row.names(pixels))
    }
    else {
        df <- data.frame(pts, row.names = row.names(pixels))
    }
    pixels <- SpatialPolygonsDataFrame(pixels, data = df)
    samp <- NULL
    crs.obj <- CRS(x@proj4string@projargs)
    cord.names <- dimnames(bbox(x))[[1]]
    df.col.names <- c("sampleID", "geometryID", names(df))
    df.col.names <- df.col.names[!(df.col.names %in% c(cord.names, 
        "optional"))]
    n.iter <- n
    repeat {
        samp2 <- bas.polygon(pixels, round(n.iter * (1 + n.iter/N)))
        bas.bbox <- attr(samp2, "bas.bbox")
        ran.start <- attr(samp2, "random.start")
        cords.df <- samp2@data
        cords <- cords.df[, cord.names]
        samp2 <- SpatialPointsDataFrame(cords, data = cords.df, 
            proj4string = crs.obj)
        if (length(samp) > 0) {
            samp.cords <- rbind(coordinates(samp), cords)
            samp.df <- rbind(data.frame(samp)[, df.col.names, 
                drop = FALSE], cords.df[, df.col.names, drop = FALSE])
        }
        else {
            samp.cords <- cords
            samp.df <- cords.df[, df.col.names, drop = FALSE]
            m <- ran.start
        }
        dups <- duplicated(samp.df$geometryID)
        samp <- SpatialPointsDataFrame(samp.cords[!dups, ], data = samp.df[!dups, 
            ], proj4string = crs.obj)
        if (length(samp) >= n) {
            samp <- samp[1:n, ]
            break
        }
        n.iter <- n - length(samp)
        pixels <- pixels[!(row.names(pixels) %in% samp$geometryID), 
            ]
    }
    samp@data$sampleID <- 1:nrow(samp)
    attr(samp, "frame") <- deparse(substitute(x))
    attr(samp, "frame.type") <- "point"
    attr(samp, "sample.type") <- "BAS"
    attr(samp, "random.start") <- m
    attr(samp, "bas.bbox") <- bas.bbox
    samp
}

adwiputra · 2020-06-22T19:52:50Z

Thanks! My first attempt to run it failed with a 20 GB RAM being used. It returned an error of memory limit. I'm moving the process to a computer with a bigger RAM and will keep you updated about how things go.

adwiputra · 2020-06-22T21:19:18Z

A similar thing occurred after 1 hour running with 128 GB RAM. Interesting. 14 million points are not affordable in the current processing framework I guess. Anyway, thanks for your kind help @tmcd82070 !

tmcd82070 added the enhancement label Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Alternative for HIP over a Huge Spatial Area #35

Faster Alternative for HIP over a Huge Spatial Area #35

adwiputra commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

adwiputra commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

adwiputra commented Jun 22, 2020

adwiputra commented Jun 22, 2020

Faster Alternative for HIP over a Huge Spatial Area #35

Faster Alternative for HIP over a Huge Spatial Area #35

Comments

adwiputra commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

adwiputra commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

tmcd82070 commented Jun 22, 2020

adwiputra commented Jun 22, 2020

adwiputra commented Jun 22, 2020