This is episode 0 of my long adventure to multi-spread and multi-gather (this is my homework I got at the tidyverse developer day…). This post might seem to introduce the different semantics from the current tidyr’s one, but it’s probably just because my idea is still vague. So, I really appreciate any feedbacks!
I now think gather()
and spread()
are about
enframe()
ing and deframe()
ing within each groupDo you get what I mean? Let me explain step by step.
A while ago, gt package, Richard Iannone’s work-in-progress great work, was made public.
gt package is wonderful, especially in that it makes us rethink about the possible semantics of columns. I mean, not all columns are equal. No, I don’t say anything new; this is what you already know with spread()
and gather()
.
spread()
ed data explainedTake a look at this example data, a simpler version of the one in ?gather
:
library(tibble)
library(gt)
set.seed(1)
# example in ?gather
stocks <- tibble(
time = as.Date('2009-01-01') + 0:2,
X = rnorm(3, 0, 1),
Y = rnorm(3, 0, 2),
Z = rnorm(3, 0, 4)
)
stocks
# A tibble: 3 x 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.626 3.19 1.95
2 2009-01-02 0.184 0.659 2.95
3 2009-01-03 -0.836 -1.64 2.30
Here, X
, Y
, and Z
are the prices of stock X, Y, and Z. Of course, we can gather()
the columns as this is the very example for this, but, we also can bundle these columns using tab_spanner()
:
gt(stocks) %>%
tab_spanner("price", vars(X, Y, Z))
time | price | ||
---|---|---|---|
X | Y | Z | |
2009-01-01 | -0.6264538 | 3.1905616 | 1.949716 |
2009-01-02 | 0.1836433 | 0.6590155 | 2.953299 |
2009-01-03 | -0.8356286 | -1.6409368 | 2.303125 |
Yet another option is to specify groupname_col
. We roughly think each row is a group and time
is the grouping variable here:
gt(stocks, groupname_col = "time")
X | Y | Z |
---|---|---|
2009-01-01 | ||
-0.6264538 | 3.1905616 | 1.949716 |
2009-01-02 | ||
0.1836433 | 0.6590155 | 2.953299 |
2009-01-03 | ||
-0.8356286 | -1.6409368 | 2.303125 |
gather()
ed data explainedLet’s see the gathered version next. Here’s the data:
stocksm <- stocks %>%
tidyr::gather("name", "value", X:Z)
stocksm
# A tibble: 9 x 3
time name value
<date> <chr> <dbl>
1 2009-01-01 X -0.626
2 2009-01-02 X 0.184
3 2009-01-03 X -0.836
4 2009-01-01 Y 3.19
5 2009-01-02 Y 0.659
6 2009-01-03 Y -1.64
7 2009-01-01 Z 1.95
8 2009-01-02 Z 2.95
9 2009-01-03 Z 2.30
This can be represented in a similar way. This time, a group doesn’t consist of a single row, but the rows with the same grouping values. Accordingly, the grouping is the same as above.
stocksm %>%
gt(groupname_col = "time")
name | value |
---|---|
2009-01-01 | |
X | -0.6264538 |
Y | 3.1905616 |
Z | 1.9497162 |
2009-01-02 | |
X | 0.1836433 |
Y | 0.6590155 |
Z | 2.9532988 |
2009-01-03 | |
X | -0.8356286 |
Y | -1.6409368 |
Z | 2.3031254 |
You can see the only difference is the rotation. So, theoretically, this can be implemented as grouping + rotating.
enframe()
and deframe()
Before entering into the implementations, I explain two tibble’s functions, enframe()
and deframe()
briefly. They can convert a vector to/from a two-column data.frame.
# A tibble: 3 x 2
name value
<chr> <int>
1 foo 1
2 bar 2
3 baz 3
gather()
First, nest the data by time
.
d <- dplyr::group_nest(stocks, time)
d
# A tibble: 3 x 2
time data
<date> <list<tibble[,3]>>
1 2009-01-01 [1 × 3]
2 2009-01-02 [1 × 3]
3 2009-01-03 [1 × 3]
Then, coerce the columns of the 1-row data.frames to vectors. (In practice, we should check if the elements are all coercible.)
# A tibble: 3 x 2
time data
<date> <list>
1 2009-01-01 <dbl [3]>
2 2009-01-02 <dbl [3]>
3 2009-01-03 <dbl [3]>
Lastly, enframe()
the vectors and unnest the whole data.
d$data <- purrr::map(d$data, enframe)
d
# A tibble: 3 x 2
time data
<date> <list>
1 2009-01-01 <tibble [3 × 2]>
2 2009-01-02 <tibble [3 × 2]>
3 2009-01-03 <tibble [3 × 2]>
tidyr::unnest(d)
# A tibble: 9 x 3
time name value
<date> <chr> <dbl>
1 2009-01-01 X -0.626
2 2009-01-01 Y 3.19
3 2009-01-01 Z 1.95
4 2009-01-02 X 0.184
5 2009-01-02 Y 0.659
6 2009-01-02 Z 2.95
7 2009-01-03 X -0.836
8 2009-01-03 Y -1.64
9 2009-01-03 Z 2.30
Done.
spread()
First step is the same as gather()
. Just nest the data by time
.
d <- dplyr::group_nest(stocksm, time)
d
# A tibble: 3 x 2
time data
<date> <list<tibble[,2]>>
1 2009-01-01 [3 × 2]
2 2009-01-02 [3 × 2]
3 2009-01-03 [3 × 2]
Then, deframe()
the data.frames. (In practice, we have to fill the missing rows to ensure all data.frames have the same variables.)
d$data <- purrr::map(d$data, deframe)
d
# A tibble: 3 x 2
time data
<date> <list>
1 2009-01-01 <dbl [3]>
2 2009-01-02 <dbl [3]>
3 2009-01-03 <dbl [3]>
Then, convert the vectors to data.frames.
# A tibble: 3 x 2
time data
<date> <list>
1 2009-01-01 <tibble [1 × 3]>
2 2009-01-02 <tibble [1 × 3]>
3 2009-01-03 <tibble [1 × 3]>
Lastly, unnest the whole data.
tidyr::unnest(d)
# A tibble: 3 x 4
time X Y Z
<date> <dbl> <dbl> <dbl>
1 2009-01-01 -0.626 3.19 1.95
2 2009-01-02 0.184 0.659 2.95
3 2009-01-03 -0.836 -1.64 2.30
Done.
I’m not sure… I roughly believe this can be extended to multi-gather and multi-spread (groups can have multiple vectors and data.frames), but I’m yet to see how different (or same) this is from the current tidyr’s semantics. Again, any feedbacks are welcome!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".