gather() and spread() Explained By gt

January 24, 2019 by Hiroaki Yutani

This is episode 0 of my long adventure to multi-spread and multi-gather (this is my homework I got at the tidyverse developer day…). This post might seem to introduce the different semantics from the current tidyr’s one, but it’s probably just because my idea is still vague. So, I really appreciate any feedbacks!

tl;dr

I now think gather() and spread() are about

  1. grouping and
  2. enframe()ing and deframe()ing within each group

Do you get what I mean? Let me explain step by step.

What does gt teach us?

A while ago, gt package, Richard Iannone’s work-in-progress great work, was made public.

gt package is wonderful, especially in that it makes us rethink about the possible semantics of columns. I mean, not all columns are equal. No, I don’t say anything new; this is what you already know with spread() and gather().

spread()ed data explained

Take a look at this example data, a simpler version of the one in ?gather:

library(tibble)
library(gt)

set.seed(1)
# example in ?gather
stocks <- tibble(
  time = as.Date('2009-01-01') + 0:2,
  X = rnorm(3, 0, 1),
  Y = rnorm(3, 0, 2),
  Z = rnorm(3, 0, 4)
)

stocks
#> # A tibble: 3 x 4
#>   time            X      Y     Z
#>   <date>      <dbl>  <dbl> <dbl>
#> 1 2009-01-01 -0.626  3.19   1.95
#> 2 2009-01-02  0.184  0.659  2.95
#> 3 2009-01-03 -0.836 -1.64   2.30

Here, X, Y, and Z are the prices of stock X, Y, and Z. Of course, we can gather() the columns as this is the very example for this, but, we also can bundle these columns using tab_spanner():

gt(stocks) %>%
  tab_spanner("price", vars(X, Y, Z))

time price
X Y Z
2009-01-01 -0.6264538 3.1905616 1.949716
2009-01-02 0.1836433 0.6590155 2.953299
2009-01-03 -0.8356286 -1.6409368 2.303125

Yet another option is to specify groupname_col. We roughly think each row is a group and time is the grouping variable here:

gt(stocks, groupname_col = "time")

X Y Z
2009-01-01
-0.6264538 3.1905616 1.949716
2009-01-02
0.1836433 0.6590155 2.953299
2009-01-03
-0.8356286 -1.6409368 2.303125

gather()ed data explained

Let’s see the gathered version next. Here’s the data:

stocksm <- stocks %>%
  tidyr::gather("name", "value", X:Z)

stocksm
#> # A tibble: 9 x 3
#>   time       name   value
#>   <date>     <chr>  <dbl>
#> 1 2009-01-01 X     -0.626
#> 2 2009-01-02 X      0.184
#> 3 2009-01-03 X     -0.836
#> 4 2009-01-01 Y      3.19 
#> 5 2009-01-02 Y      0.659
#> 6 2009-01-03 Y     -1.64 
#> 7 2009-01-01 Z      1.95 
#> 8 2009-01-02 Z      2.95 
#> 9 2009-01-03 Z      2.30

This can be represented in a similar way. This time, a group doesn’t consist of a single row, but the rows with the same grouping values. Accordingly, the grouping is the same as above.

stocksm %>%
  gt(groupname_col = "time")

name value
2009-01-01
X -0.6264538
Y 3.1905616
Z 1.9497162
2009-01-02
X 0.1836433
Y 0.6590155
Z 2.9532988
2009-01-03
X -0.8356286
Y -1.6409368
Z 2.3031254

You can see the only difference is the rotation. So, theoretically, this can be implemented as grouping + rotating.

Do it yourself by enframe() and deframe()

Before entering into the implementations, I explain two tibble’s functions, enframe() and deframe() briefly. They can convert a vector to/from a two-column data.frame.

library(tibble)

x <- 1:3
names(x) <- c("foo", "bar", "baz")

enframe(x)
#> # A tibble: 3 x 2
#>   name  value
#>   <chr> <int>
#> 1 foo       1
#> 2 bar       2
#> 3 baz       3
deframe(enframe(x))
#> foo bar baz 
#>   1   2   3

gather()

First, nest the data by time.

d <- dplyr::group_nest(stocks, time)
d
#> # A tibble: 3 x 2
#>   time       data            
#>   <date>     <list>          
#> 1 2009-01-01 <tibble [1 x 3]>
#> 2 2009-01-02 <tibble [1 x 3]>
#> 3 2009-01-03 <tibble [1 x 3]>

Then, coerce the columns of the 1-row data.frames to vectors. (In practice, we should check if the elements are all coercible.)

d$data <- purrr::map(d$data, ~ vctrs::vec_c(!!! .))
d
#> # A tibble: 3 x 2
#>   time       data     
#>   <date>     <list>   
#> 1 2009-01-01 <dbl [3]>
#> 2 2009-01-02 <dbl [3]>
#> 3 2009-01-03 <dbl [3]>

Lastly, enframe() the vectors and unnest the whole data.

d$data <- purrr::map(d$data, enframe)
d
#> # A tibble: 3 x 2
#>   time       data            
#>   <date>     <list>          
#> 1 2009-01-01 <tibble [3 x 2]>
#> 2 2009-01-02 <tibble [3 x 2]>
#> 3 2009-01-03 <tibble [3 x 2]>
tidyr::unnest(d)
#> # A tibble: 9 x 3
#>   time       name   value
#>   <date>     <chr>  <dbl>
#> 1 2009-01-01 X     -0.626
#> 2 2009-01-01 Y      3.19 
#> 3 2009-01-01 Z      1.95 
#> 4 2009-01-02 X      0.184
#> 5 2009-01-02 Y      0.659
#> 6 2009-01-02 Z      2.95 
#> 7 2009-01-03 X     -0.836
#> 8 2009-01-03 Y     -1.64 
#> 9 2009-01-03 Z      2.30

Done.

spread()

First step is the same as gather(). Just nest the data by time.

d <- dplyr::group_nest(stocksm, time)
d
#> # A tibble: 3 x 2
#>   time       data            
#>   <date>     <list>          
#> 1 2009-01-01 <tibble [3 x 2]>
#> 2 2009-01-02 <tibble [3 x 2]>
#> 3 2009-01-03 <tibble [3 x 2]>

Then, deframe() the data.frames. (In practice, we have to fill the missing rows to ensure all data.frames have the same variables.)

d$data <- purrr::map(d$data, deframe)
d
#> # A tibble: 3 x 2
#>   time       data     
#>   <date>     <list>   
#> 1 2009-01-01 <dbl [3]>
#> 2 2009-01-02 <dbl [3]>
#> 3 2009-01-03 <dbl [3]>

Then, convert the vectors to data.frames.

d$data <- purrr::map(d$data, ~ tibble::tibble(!!! .))
d
#> # A tibble: 3 x 2
#>   time       data            
#>   <date>     <list>          
#> 1 2009-01-01 <tibble [1 x 3]>
#> 2 2009-01-02 <tibble [1 x 3]>
#> 3 2009-01-03 <tibble [1 x 3]>

Lastly, unnest the whole data.

tidyr::unnest(d)
#> # A tibble: 3 x 4
#>   time            X      Y     Z
#>   <date>      <dbl>  <dbl> <dbl>
#> 1 2009-01-01 -0.626  3.19   1.95
#> 2 2009-01-02  0.184  0.659  2.95
#> 3 2009-01-03 -0.836 -1.64   2.30

Done.

What’s next?

I’m not sure… I roughly believe this can be extended to multi-gather and multi-spread (groups can have multiple vectors and data.frames), but I’m yet to see how different (or same) this is from the current tidyr’s semantics. Again, any feedbacks are welcome!