Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by #49

Open
mdsumner opened this issue Aug 31, 2017 · 4 comments
Open

group_by #49

mdsumner opened this issue Aug 31, 2017 · 4 comments

Comments

@mdsumner
Copy link
Collaborator

mdsumner commented Aug 31, 2017

Discussed with @tremenyi the lazy approach of tidync %>% hyper_filter should also allow group_by to accumulate grouping variables. We'll need

  • ability to accumulate axis-groupings or variable groupings
  • ability to ungroup

Maybe we can use tbl_df?

This works (fixed just now in #50 ):

f <- "/rdsi/PUBLIC/raad/data/ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc"
 
library(tidync)
library(dplyr)
tidync(f) %>% hyper_filter(time = index < 10) %>% 
  hyper_tbl_cube() %>% group_by(time, lon) %>% 
  summarize(sst = mean(sst))

It's pretty slow though, it's actually better to just do this

tidync(f) %>% hyper_filter(time = index < 10) %>% 
  hyper_tibble() %>% group_by(time, lon) %>% 
  summarize(sst = mean(sst))

which shows we are on the right track I guess. You can't use tbl_cube to group_by data variables, it's only by dimension. (You can group by more than one dimension). It depends which dimension are grouped by which approach is faster, and it would depend on other stuff tool.

@jimjam-slam
Copy link

I was actually wondering recently whether there'd be any performance benefit to have hyper_tibble accept group_by on nest_by arguments and directly return grouped or nested tibbles. With nested tibbles, in particular, I found that nesting takes up a pretty big chunk of time (600ms, or around half the time of hyper_tibble, in the case of my ~2M cell NetCDF):

screenshot 2018-05-26 12 29 19

Maybe that's just how long it takes; I'm not too familiar with nest. But if hyper_tibble is running off the whole slab into memory in whatever order and then nest then has to move chunks of it around in order to build a list-column, I can't help but wonder if there'd be a significant performance benefit to just delivering it that way from disk in the first place 🙂

@mdsumner
Copy link
Collaborator Author

Ooh I see what you are saying, this is a good idea thanks! For a regular raster, you only need one parent row too. I will explore this

@mdsumner
Copy link
Collaborator Author

Here's a direct group_by only for sf in-dev, totally WIP

f <- raadtools::sshfiles()$fullname[1]
f <- raadtools::sstfiles()$fullname[1]

x <- tidync::tidync(f)
p <- spex::polygonize(raster::aggregate(raster::raster(f) %/% 5, fact = 10)) %>% 
  group_by(Daily.sea.surface.temperature) %>% summarize()
p <- st_cast(p)
#p <- spex::polygonize(raster::raster(extent(raster::raster(f)), nrows = 50, ncols = 80))
#p$Daily.sea.surface.temperature <- NULL
# res <-  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < -30) %>% 
#   group_by(p) %>%   summarize(value = mean(sla, na.rm = T))
res <-  x %>% hyper_filter(lon = lon > 147, lat = lat < -30) %>% 
  group_by(p) %>%   summarize(value = mean(sst, na.rm = T))
p$value <- NA
p$value[match(res$tidync_group_, 1:nrow(p))] <- res$value
plot(p["value"], reset = F)
maps::map("world2", add = T)

@mdsumner
Copy link
Collaborator Author

Her'es the code for this, ripped out to simplify the current build

#' Dplyr verbs for tidync
#' 
#' Very experimental
#' @param x tidync object
#' @param .x tidync object
#' @param .data tidync object
#' @param ... passed into dplyr engine
#' @param add add to existing groups
#' @param shape experimental tidync
#' @rdname dplyr-verbs
#' @examples 
#' \dontrun{
#' #x <- tidync(raadtools::sshfiles()$fullname[1])
#' #  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < - 30) %>% 
#' # group_by(longitude) %>%   summarize(adt = mean(adt, na.rm = TRUE))
#' #  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < - 30) %>% 
#' #group_by(longitude) %>%   summarize_all(mean, na.rm = TRUE)
#' }
tbl_vars.tidync <- function(x) {
  c(x$variable$name[x$variable$active], 
    x$dimension$name[x$dimension$active])
}
groups.tidync <- function(x) {
  x$groups
}
#' @rdname dplyr-verbs
summarise.tidync <- function(.data, ...) {
  if (inherits(.data$groups, "sf")) {
    ## 1. create a raster (transform sf if needed)
    
    ns <- 1:2
    ax <- active_axis_transforms(.data)[ns]
    X <- ax[[1]][[names(ax)[1]]][ax[[1]][["selected"]]]
    Y <- ax[[2]][[names(ax)[2]]][ax[[2]][["selected"]]]
    ## 1a. obtain nominal sf axes from group_by  TODO ^^
    ex <- c(range(X) + c(-1, 1) * diff(X[1:2])/2, 
            range(Y) + c(-1, 1) * diff(Y[1:2])/2)
    
    r <- raster::raster(raster::extent(ex), 
                   ncols = sum(ax[[1]][["selected"]]), nrows = sum(ax[[2]][["selected"]]))
    
    ## 2. fasterize groups to it
    .data$groups$ID <- 1:nrow(.data$groups)
    #rcell <- fasterize::fasterize(.data$groups, r, "ID")
    cells <- tabularaster::cellnumbers(r, .data$groups)
    cells <- dplyr::filter(cells, !is.na(.data$cell_))
    ht <- hyper_tibble(.data, na.rm = FALSE)
  #  browser()
    ht[["tidync_group_"]] <- NA
    
    ht[["tidync_cell_"]] <- raster::cellFromXY(r, as.matrix(ht[names(ax)]))
    
    ht[["tidync_group_"]][match(cells$cell_, ht$tidync_cell_)] <- cells$object_
    ## 3. run actual group_by with sf-ID
            ## danger, see the flip here ////!!!
    #ht[["tidync_group_"]] <- values(flip(rcell, "y"))
    ## could use setdiff here with cellnumbers ...
    ht <- dplyr::filter(ht, !is.na(.data$tidync_group_))
   return( ht %>% group_by(.data$tidync_group_) %>% summarise(...))
  }
  hyper_tibble(.data) %>% group_by(!!!.data$groups) %>% summarise(...)
}
#' @importFrom dplyr group_by ungroup summarise
#' @rdname dplyr-verbs
group_by.tidync <- function(.x, ..., add = FALSE, shape = NULL) {
  if (add) stop('groupings cannot be added to')
  groups <- rlang::quos(...)
  x$groupshape <- FALSE
  if(!is.null(shape)) {
    if (length(groups) > 0) warning("'shape' is set, so bare grouping names ignored")
    groups <- shape
    x$groupshape <- TRUE
  }  
  .x$groups <- groups
  .x
}
#' @rdname dplyr-verbs
ungroup.tidync <- function(x, ...) {
  x$groups <- NULL
  x
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants