group_by #49

mdsumner · 2017-08-31T12:42:50Z

Discussed with @tremenyi the lazy approach of tidync %>% hyper_filter should also allow group_by to accumulate grouping variables. We'll need

ability to accumulate axis-groupings or variable groupings
ability to ungroup

Maybe we can use tbl_df?

This works (fixed just now in #50 ):

f <- "/rdsi/PUBLIC/raad/data/ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc"
 
library(tidync)
library(dplyr)
tidync(f) %>% hyper_filter(time = index < 10) %>% 
  hyper_tbl_cube() %>% group_by(time, lon) %>% 
  summarize(sst = mean(sst))

It's pretty slow though, it's actually better to just do this

tidync(f) %>% hyper_filter(time = index < 10) %>% 
  hyper_tibble() %>% group_by(time, lon) %>% 
  summarize(sst = mean(sst))

which shows we are on the right track I guess. You can't use tbl_cube to group_by data variables, it's only by dimension. (You can group by more than one dimension). It depends which dimension are grouped by which approach is faster, and it would depend on other stuff tool.

The text was updated successfully, but these errors were encountered:

jimjam-slam · 2018-05-26T02:33:57Z

I was actually wondering recently whether there'd be any performance benefit to have hyper_tibble accept group_by on nest_by arguments and directly return grouped or nested tibbles. With nested tibbles, in particular, I found that nesting takes up a pretty big chunk of time (600ms, or around half the time of hyper_tibble, in the case of my ~2M cell NetCDF):

Maybe that's just how long it takes; I'm not too familiar with nest. But if hyper_tibble is running off the whole slab into memory in whatever order and then nest then has to move chunks of it around in order to build a list-column, I can't help but wonder if there'd be a significant performance benefit to just delivering it that way from disk in the first place 🙂

mdsumner · 2018-05-26T04:15:17Z

Ooh I see what you are saying, this is a good idea thanks! For a regular raster, you only need one parent row too. I will explore this

mdsumner · 2018-08-28T15:44:51Z

Here's a direct group_by only for sf in-dev, totally WIP

f <- raadtools::sshfiles()$fullname[1]
f <- raadtools::sstfiles()$fullname[1]

x <- tidync::tidync(f)
p <- spex::polygonize(raster::aggregate(raster::raster(f) %/% 5, fact = 10)) %>% 
  group_by(Daily.sea.surface.temperature) %>% summarize()
p <- st_cast(p)
#p <- spex::polygonize(raster::raster(extent(raster::raster(f)), nrows = 50, ncols = 80))
#p$Daily.sea.surface.temperature <- NULL
# res <-  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < -30) %>% 
#   group_by(p) %>%   summarize(value = mean(sla, na.rm = T))
res <-  x %>% hyper_filter(lon = lon > 147, lat = lat < -30) %>% 
  group_by(p) %>%   summarize(value = mean(sst, na.rm = T))
p$value <- NA
p$value[match(res$tidync_group_, 1:nrow(p))] <- res$value
plot(p["value"], reset = F)
maps::map("world2", add = T)

mdsumner · 2019-02-28T03:44:55Z

Her'es the code for this, ripped out to simplify the current build

#' Dplyr verbs for tidync
#' 
#' Very experimental
#' @param x tidync object
#' @param .x tidync object
#' @param .data tidync object
#' @param ... passed into dplyr engine
#' @param add add to existing groups
#' @param shape experimental tidync
#' @rdname dplyr-verbs
#' @examples 
#' \dontrun{
#' #x <- tidync(raadtools::sshfiles()$fullname[1])
#' #  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < - 30) %>% 
#' # group_by(longitude) %>%   summarize(adt = mean(adt, na.rm = TRUE))
#' #  x %>% hyper_filter(longitude = longitude > 147, latitude = latitude < - 30) %>% 
#' #group_by(longitude) %>%   summarize_all(mean, na.rm = TRUE)
#' }
tbl_vars.tidync <- function(x) {
  c(x$variable$name[x$variable$active], 
    x$dimension$name[x$dimension$active])
}
groups.tidync <- function(x) {
  x$groups
}
#' @rdname dplyr-verbs
summarise.tidync <- function(.data, ...) {
  if (inherits(.data$groups, "sf")) {
    ## 1. create a raster (transform sf if needed)
    
    ns <- 1:2
    ax <- active_axis_transforms(.data)[ns]
    X <- ax[[1]][[names(ax)[1]]][ax[[1]][["selected"]]]
    Y <- ax[[2]][[names(ax)[2]]][ax[[2]][["selected"]]]
    ## 1a. obtain nominal sf axes from group_by  TODO ^^
    ex <- c(range(X) + c(-1, 1) * diff(X[1:2])/2, 
            range(Y) + c(-1, 1) * diff(Y[1:2])/2)
    
    r <- raster::raster(raster::extent(ex), 
                   ncols = sum(ax[[1]][["selected"]]), nrows = sum(ax[[2]][["selected"]]))
    
    ## 2. fasterize groups to it
    .data$groups$ID <- 1:nrow(.data$groups)
    #rcell <- fasterize::fasterize(.data$groups, r, "ID")
    cells <- tabularaster::cellnumbers(r, .data$groups)
    cells <- dplyr::filter(cells, !is.na(.data$cell_))
    ht <- hyper_tibble(.data, na.rm = FALSE)
  #  browser()
    ht[["tidync_group_"]] <- NA
    
    ht[["tidync_cell_"]] <- raster::cellFromXY(r, as.matrix(ht[names(ax)]))
    
    ht[["tidync_group_"]][match(cells$cell_, ht$tidync_cell_)] <- cells$object_
    ## 3. run actual group_by with sf-ID
            ## danger, see the flip here ////!!!
    #ht[["tidync_group_"]] <- values(flip(rcell, "y"))
    ## could use setdiff here with cellnumbers ...
    ht <- dplyr::filter(ht, !is.na(.data$tidync_group_))
   return( ht %>% group_by(.data$tidync_group_) %>% summarise(...))
  }
  hyper_tibble(.data) %>% group_by(!!!.data$groups) %>% summarise(...)
}
#' @importFrom dplyr group_by ungroup summarise
#' @rdname dplyr-verbs
group_by.tidync <- function(.x, ..., add = FALSE, shape = NULL) {
  if (add) stop('groupings cannot be added to')
  groups <- rlang::quos(...)
  x$groupshape <- FALSE
  if(!is.null(shape)) {
    if (length(groups) > 0) warning("'shape' is set, so bare grouping names ignored")
    groups <- shape
    x$groupshape <- TRUE
  }  
  .x$groups <- groups
  .x
}
#' @rdname dplyr-verbs
ungroup.tidync <- function(x, ...) {
  x$groups <- NULL
  x
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by #49

group_by #49

mdsumner commented Aug 31, 2017 •

edited

Loading

jimjam-slam commented May 26, 2018

mdsumner commented May 26, 2018

mdsumner commented Aug 28, 2018

mdsumner commented Feb 28, 2019

group_by #49

group_by #49

Comments

mdsumner commented Aug 31, 2017 • edited Loading

jimjam-slam commented May 26, 2018

mdsumner commented May 26, 2018

mdsumner commented Aug 28, 2018

mdsumner commented Feb 28, 2019

mdsumner commented Aug 31, 2017 •

edited

Loading