Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Alternative for HIP over a Huge Spatial Area #35

Open
adwiputra opened this issue Jun 22, 2020 · 6 comments
Open

Faster Alternative for HIP over a Huge Spatial Area #35

adwiputra opened this issue Jun 22, 2020 · 6 comments

Comments

@adwiputra
Copy link

Hi,

I have been trying to implement the Halton Iterative Partitioning using sdraw function over a 17 million ha SpatialPolygon. I specified the desired number of samples (n=1000) and the process has been running for two weeks now. Now I have to find a quick workaround to obtain the spatially balanced samples, so I tried dividing the area into 250 grids and then run the SDraw with a smaller number of samples (n=4) to get as many samples (1000).
Will this workaround violate any spatial balance that would otherwise be obtained from the actual HIP? Will it be safe to assume that the result will be spatially balanced, even though not ideally balanced?

Thanks so much!

@tmcd82070
Copy link
Owner

We may have a subtle bug in hip.polygon. We are experiencing some strange long run times, and it should not take that long.

For polygons, HIP and BAS are functionally equivalent. samp <- sdraw(x, 1000, type="BAS") should work in a few seconds.

@adwiputra
Copy link
Author

Thanks for your prompt response, @tmcd82070 !

Just a bit of clarification question. I'm also having a similar issue with hip.points, tried sampling 300 out of 14 million points. Would you recommend the same alternative using bas.point instead of hip.point?

@tmcd82070
Copy link
Owner

14 million points is bit big. No guarantee that either HIP or BAS will work on that many; but, tentatively 'yes', BAS should have a better chance of finishing than HIP on that many. I am assuming your points are in a regular lattice.

You may end up needing to roll your own routine. I.e., look at the code for bas.point. The longest part of the calculation is computing the minimum distance between any two points. If you know that, you can by-pass the call the ddist() and assign the minimum distance. Rest of the code should execute pretty quickly.

@tmcd82070
Copy link
Owner

@adwiputra : Here is a modified version of bas.point with changes I alluded to in the previous comments. If you go this way. let me know whether this worked. (i.e., there still may be issues with 14 million points:

Here I've simply commented out a few lines that compute d. New parameter d = the minimum distance between any two points. x and n are same as bas.point.

I still think there may be an issue creating 14 million little polygons; so, perhaps the polygon route is your only option at this point.

bas.bigPointFrame <- function (x, n, d) 
{
    if (!(inherits(x, "SpatialPoints"))) 
        stop("Must call bas.point with a SpatialPoints* object.")
    N <- length(x)
    if (n > N) {
        warning("Sample size greater than frame size. Census taken.")
        n <- N
    }
    pts <- coordinates(x)
    rnames.x <- row.names(x)
    # d <- min(stats::dist(pts))
    # if (d < (.Machine$double.eps * 10)) {
    #    stop("Minimum distance between points is zero: are there duplicate coordinates?")
    # }
    d <- d/(2 * sqrt(2))
    ll.x <- pts[, 1] - d
    ll.y <- pts[, 2] - d
    ur.x <- pts[, 1] + d
    ur.y <- pts[, 2] + d
    pixels <- vector("list", nrow(pts))
    for (i in 1:nrow(pts)) {
        Sr1 <- Polygon(cbind(c(ll.x[i], ll.x[i], ur.x[i], ur.x[i], 
            ll.x[i]), c(ll.y[i], ur.y[i], ur.y[i], ll.y[i], ll.y[i])))
        pixels[[i]] <- Polygons(list(Sr1), rnames.x[i])
    }
    pixels <- SpatialPolygons(pixels, 1:nrow(pts), proj4string = CRS(proj4string(x)))
    if (inherits(x, "SpatialPointsDataFrame")) {
        df <- data.frame(data.frame(x), row.names = row.names(pixels))
    }
    else {
        df <- data.frame(pts, row.names = row.names(pixels))
    }
    pixels <- SpatialPolygonsDataFrame(pixels, data = df)
    samp <- NULL
    crs.obj <- CRS(x@proj4string@projargs)
    cord.names <- dimnames(bbox(x))[[1]]
    df.col.names <- c("sampleID", "geometryID", names(df))
    df.col.names <- df.col.names[!(df.col.names %in% c(cord.names, 
        "optional"))]
    n.iter <- n
    repeat {
        samp2 <- bas.polygon(pixels, round(n.iter * (1 + n.iter/N)))
        bas.bbox <- attr(samp2, "bas.bbox")
        ran.start <- attr(samp2, "random.start")
        cords.df <- samp2@data
        cords <- cords.df[, cord.names]
        samp2 <- SpatialPointsDataFrame(cords, data = cords.df, 
            proj4string = crs.obj)
        if (length(samp) > 0) {
            samp.cords <- rbind(coordinates(samp), cords)
            samp.df <- rbind(data.frame(samp)[, df.col.names, 
                drop = FALSE], cords.df[, df.col.names, drop = FALSE])
        }
        else {
            samp.cords <- cords
            samp.df <- cords.df[, df.col.names, drop = FALSE]
            m <- ran.start
        }
        dups <- duplicated(samp.df$geometryID)
        samp <- SpatialPointsDataFrame(samp.cords[!dups, ], data = samp.df[!dups, 
            ], proj4string = crs.obj)
        if (length(samp) >= n) {
            samp <- samp[1:n, ]
            break
        }
        n.iter <- n - length(samp)
        pixels <- pixels[!(row.names(pixels) %in% samp$geometryID), 
            ]
    }
    samp@data$sampleID <- 1:nrow(samp)
    attr(samp, "frame") <- deparse(substitute(x))
    attr(samp, "frame.type") <- "point"
    attr(samp, "sample.type") <- "BAS"
    attr(samp, "random.start") <- m
    attr(samp, "bas.bbox") <- bas.bbox
    samp
}

@adwiputra
Copy link
Author

Thanks! My first attempt to run it failed with a 20 GB RAM being used. It returned an error of memory limit. I'm moving the process to a computer with a bigger RAM and will keep you updated about how things go.

@adwiputra
Copy link
Author

A similar thing occurred after 1 hour running with 128 GB RAM. Interesting. 14 million points are not affordable in the current processing framework I guess. Anyway, thanks for your kind help @tmcd82070 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants