This is episode 0 of my long adventure to multi-spread and multi-gather (this is my homework I got at the tidyverse developer day…). This post might seem to introduce the different semantics from the current tidyr’s one, but it’s probably just because my idea is still vague. So, I really appreciate any feedbacks!
tl;dr
I now think gather() and spread() are about
grouping and
enframe()ing and deframe()ing within each group
Do you get what I mean? Let me explain step by step.
gt package is wonderful, especially in that it makes us rethink about the possible semantics of columns. I mean, not all columns are equal. No, I don’t say anything new; this is what you already know with spread() and gather().
spread()ed data explained
Take a look at this example data, a simpler version of the one in ?gather:
library(tibble)library(gt)set.seed(1)# example in ?gatherstocks <-tibble(time =as.Date('2009-01-01') +0:2,X =rnorm(3, 0, 1),Y =rnorm(3, 0, 2),Z =rnorm(3, 0, 4))stocks
# A tibble: 3 × 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.626 3.19 1.95
2 2009-01-02 0.184 0.659 2.95
3 2009-01-03 -0.836 -1.64 2.30
Here, X, Y, and Z are the prices of stock X, Y, and Z. Of course, we can gather() the columns as this is the very example for this, but, we also can bundle these columns using tab_spanner():
gt(stocks) %>%tab_spanner("price", c(X, Y, Z))
time
price
X
Y
Z
2009-01-01
-0.6264538
3.1905616
1.949716
2009-01-02
0.1836433
0.6590155
2.953299
2009-01-03
-0.8356286
-1.6409368
2.303125
Yet another option is to specify groupname_col. We roughly think each row is a group and time is the grouping variable here:
gt(stocks, groupname_col ="time")
X
Y
Z
2009-01-01
-0.6264538
3.1905616
1.949716
2009-01-02
0.1836433
0.6590155
2.953299
2009-01-03
-0.8356286
-1.6409368
2.303125
gather()ed data explained
Let’s see the gathered version next. Here’s the data:
# A tibble: 9 × 3
time name value
<date> <chr> <dbl>
1 2009-01-01 X -0.626
2 2009-01-02 X 0.184
3 2009-01-03 X -0.836
4 2009-01-01 Y 3.19
5 2009-01-02 Y 0.659
6 2009-01-03 Y -1.64
7 2009-01-01 Z 1.95
8 2009-01-02 Z 2.95
9 2009-01-03 Z 2.30
This can be represented in a similar way. This time, a group doesn’t consist of a single row, but the rows with the same grouping values. Accordingly, the grouping is the same as above.
stocksm %>%gt(groupname_col ="time")
name
value
2009-01-01
X
-0.6264538
Y
3.1905616
Z
1.9497162
2009-01-02
X
0.1836433
Y
0.6590155
Z
2.9532988
2009-01-03
X
-0.8356286
Y
-1.6409368
Z
2.3031254
You can see the only difference is the rotation. So, theoretically, this can be implemented as grouping + rotating.
Do it yourself by enframe() and deframe()
Before entering into the implementations, I explain two tibble’s functions, enframe() and deframe() briefly. They can convert a vector to/from a two-column data.frame.
# A tibble: 3 × 2
time data
<date> <list>
1 2009-01-01 <dbl [3]>
2 2009-01-02 <dbl [3]>
3 2009-01-03 <dbl [3]>
Lastly, enframe() the vectors and unnest the whole data.
d$data <- purrr::map(d$data, enframe)d
# A tibble: 3 × 2
time data
<date> <list>
1 2009-01-01 <tibble [3 × 2]>
2 2009-01-02 <tibble [3 × 2]>
3 2009-01-03 <tibble [3 × 2]>
tidyr::unnest(d)
Warning: `cols` is now required when using unnest().
Please use `cols = c(data)`
# A tibble: 9 × 3
time name value
<date> <chr> <dbl>
1 2009-01-01 X -0.626
2 2009-01-01 Y 3.19
3 2009-01-01 Z 1.95
4 2009-01-02 X 0.184
5 2009-01-02 Y 0.659
6 2009-01-02 Z 2.95
7 2009-01-03 X -0.836
8 2009-01-03 Y -1.64
9 2009-01-03 Z 2.30
Done.
spread()
First step is the same as gather(). Just nest the data by time.
d <- dplyr::group_nest(stocksm, time)d
# A tibble: 3 × 2
time data
<date> <list<tibble[,2]>>
1 2009-01-01 [3 × 2]
2 2009-01-02 [3 × 2]
3 2009-01-03 [3 × 2]
Then, deframe() the data.frames. (In practice, we have to fill the missing rows to ensure all data.frames have the same variables.)
d$data <- purrr::map(d$data, deframe)d
# A tibble: 3 × 2
time data
<date> <list>
1 2009-01-01 <dbl [3]>
2 2009-01-02 <dbl [3]>
3 2009-01-03 <dbl [3]>
# A tibble: 3 × 2
time data
<date> <list>
1 2009-01-01 <tibble [1 × 3]>
2 2009-01-02 <tibble [1 × 3]>
3 2009-01-03 <tibble [1 × 3]>
Lastly, unnest the whole data.
tidyr::unnest(d)
Warning: `cols` is now required when using unnest().
Please use `cols = c(data)`
# A tibble: 3 × 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.626 3.19 1.95
2 2009-01-02 0.184 0.659 2.95
3 2009-01-03 -0.836 -1.64 2.30
Done.
What’s next?
I’m not sure… I roughly believe this can be extended to multi-gather and multi-spread (groups can have multiple vectors and data.frames), but I’m yet to see how different (or same) this is from the current tidyr’s semantics. Again, any feedbacks are welcome!