Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpose(dt) allows to return list without promoting elements to maxtype #5805

Merged
merged 27 commits into from
Mar 16, 2024

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Dec 6, 2023

Closes #5639

Transpose now also supports transposing list columns.

  • Add feature
  • Docs
  • Tests (including fill)
  • News
  • Handle Factors consistently (do not copy levels)

@tdhock
Copy link
Member

tdhock commented Dec 7, 2023

Hi Ben thanks for sharing, looks interesting. I have never used transpose, so I was wondering what is the typical use case? with lists? The man page says

     This is particularly useful in tasks that require splitting a
     character column and assigning each part to a separate column.
     This operation is quite common enough that a function tstrsplit
     is exported.

I have used tstrsplit, so I guess I understand the use case with character, but I wonder if you could add an illustrative example, which would show how transpose is different / more flexible / useful with lists ?

@ben-schwen
Copy link
Member Author

ben-schwen commented Dec 7, 2023

The main use cases I have in my mind is when you receive a rowwise format or want to lapply some function rowwise. Not promoting to maximum type but returning lists would be favorable there to avoid additional casting.

dt = data.table(name=c("Anna", "Bob"), age=c(30,20))
fun = function(x) sprintf("Hi %d year old %s", x[[2]], x[[1]])
transpose(dt, list.cols = TRUE)[, lapply(.SD, fun)]
#                     V1                 V2
#                 <char>             <char>
# 1: Hi 30 year old Anna Hi 20 year old Bob
transpose(dt, list.cols = FALSE)[, lapply(.SD, fun)]
# Error in sprintf("Hi %d year old %s", x[[2]], x[[1]]) : 
#   invalid format '%d'; use format %s for character objects

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.50%. Comparing base (8de09b2) to head (d0d8285).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5805   +/-   ##
=======================================
  Coverage   97.50%   97.50%           
=======================================
  Files          80       80           
  Lines       14876    14884    +8     
=======================================
+ Hits        14505    14513    +8     
  Misses        371      371           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ben-schwen ben-schwen marked this pull request as ready for review December 10, 2023 20:26
@MichaelChirico
Copy link
Member

Hi Ben thanks for sharing, looks interesting. I have never used transpose, so I was wondering what is the typical use case? with lists? The man page says

     This is particularly useful in tasks that require splitting a
     character column and assigning each part to a separate column.
     This operation is quite common enough that a function tstrsplit
     is exported.

I have used tstrsplit, so I guess I understand the use case with character, but I wonder if you could add an illustrative example, which would show how transpose is different / more flexible / useful with lists ?

transpose() is the workhorse behind tstrsplit, tstrsplit is "just" strsplit()|>transpose().

but there are many other times we wind up with a list() of rows that we want as columns instead. base R code looks like lapply(seq_along(l), \(j) sapply(l, `[[`, j)).

transpose() is not only much cleaner but much faster and handles type conversion and fills ragged data

NEWS.md Outdated Show resolved Hide resolved
@tdhock
Copy link
Member

tdhock commented Dec 11, 2023

transpose() is the workhorse behind tstrsplit, tstrsplit is "just" strsplit()|>transpose().

but there are many other times we wind up with a list() of rows that we want as columns instead. base R code looks like lapply(seq_along(l), \(j) sapply(l, `[[`, j)).

transpose() is not only much cleaner but much faster and handles type conversion and fills ragged data

Can an example of this usage (and new usage) be please added to transpose.Rd in this PR?
(it would help me at least)

@ben-schwen
Copy link
Member Author

ben-schwen commented Dec 15, 2023

FWIW here are benchmarks version vs base solution

l = list(sample(1e5), sample(letters, 1e5, TRUE))
dt = bench::mark(
  base = lapply(seq(length(l[[1]])), function(x) lapply(l, `[[`, x)),
  dt = transpose(l, list.cols=TRUE),
  iterations = 100
)
setDT(dt)
dt[, .(expression, min, median, "max" = lapply(time, max), `itr/sec`, n_itr, n_gc), .I]
#        I   expression          min       median    max   itr/sec n_itr  n_gc
#    <int> <bench_expr> <bench_time> <bench_time> <list>     <num> <int> <num>
# 1:     1         base      270.3ms      333.9ms  769ms  2.773482   100   241
# 2:     2           dt       15.5ms       23.3ms  315ms 29.969829   100    29

@jangorecki
Copy link
Member

jangorecki commented Dec 15, 2023

@ben-schwen base was run only two times while PR was run 16 times. Therefore I would look for a "max" statistic (which is missing) rather than min or median (presented in your post). You may find this post relevant: jangorecki/rollbench#1 (comment)
I would go with, ideally, system.time() + scaled up data, or eventually microbenchmark.

man/transpose.Rd Outdated Show resolved Hide resolved
@jangorecki jangorecki added this to the 1.16.0 milestone Dec 15, 2023
@tdhock
Copy link
Member

tdhock commented Dec 18, 2023

if you want to see how time/memory vary with N, please try using atime tdhock/atime#15

image

above shows that both base R and dt are linear time and memory, different by constant factors.

image

above shows the data size N which both can handle, given 1 second time limit or 1000kb memory limit. (dt is faster by constant factors but uses a constant factor more memory)

NEWS.md Outdated Show resolved Hide resolved
man/transpose.Rd Outdated Show resolved Hide resolved
src/transpose.c Outdated Show resolved Hide resolved
src/transpose.c Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

list.cols=TRUE also works for data.table (/frame) input, though it's not clear from the current PR if that's intentional.

As I mentioned in the updated NEWS, transposing a table to get it in row-major form is indeed quite common & useful. This is especially useful for I/O with non-column-major data structures (as is common for streaming applications which receive one row at a time). As noted in the linked issue, the workarounds were quite tedious before, after this PR, we can quite easily just transpose(<row-major>) |> lapply(unlist) |> setDT() to get a nice data.table out of something we receive, and transpose(<column-major>, list.cols=TRUE) to convert an R table into something to pass along.

In short, this feature is quite great and something I "didn't know I needed" all along!

@tdhock
Copy link
Member

tdhock commented Mar 16, 2024

would be great to add vignette documentation for this use case "As noted in the linked issue, the workarounds were quite tedious before, after this PR, we can quite easily just transpose() |> lapply(unlist) |> setDT() to get a nice data.table out of something we receive, and transpose(, list.cols=TRUE) to convert an R table into something to pass along"

@MichaelChirico
Copy link
Member

would be great to add vignette documentation for this use case "As noted in the linked issue, the workarounds were quite tedious before, after this PR, we can quite easily just transpose() |> lapply(unlist) |> setDT() to get a nice data.table out of something we receive, and transpose(, list.cols=TRUE) to convert an R table into something to pass along"

Do you have a specific vignette in mind? I don't see any mention of transpose() in the current vignettes. Maybe what we really need is a datatable-utils vignette demonstrating transpose() as well as fcase(), fifelse(), ... odds-and-end-functions we have

@MichaelChirico MichaelChirico merged commit 821c8f9 into master Mar 16, 2024
5 checks passed
@MichaelChirico MichaelChirico deleted the transpose_returnList branch March 16, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data.table::transpose applied to lists changes element types
4 participants