Enhancing gather() and spread() by Using “Bundled” data.frames

Last month, I tried to explain gather() and spread() by gt package (https://yutani.rbind.io/post/gather-and-spread-explained-by-gt/). But, after I implemented experimental multi-gather() and multi-spread(), I realized that I need a bit different way of explanation… So, please forget the post, and read this with fresh eyes!

Wait, what is multi-`gather()` and multi-`spread()`??

In short, the current gather() and spread() have a limitation; they can gather into or spread from only one column at once. So, if we want to handle multiple columns, we need to coerce them to one column before actually gathering or spreading.

This is especially problematic when the columns have different types. For example, date column is unexpectedly converted to integers with the following code:

library(tibble)
library(tidyr)

# a bit different version of https://github.com/tidyverse/tidyr/issues/149#issue-124411755
d <- tribble(
  ~place, ~censor,                  ~date, ~status,
    "g1",    "c1",  as.Date("2019-02-01"),   "ok",
    "g1",    "c2",  as.Date("2019-02-01"),  "bad",
    "g1",    "c3",  as.Date("2019-02-01"),   "ok",
    "g2",    "c1",  as.Date("2019-02-01"),  "bad",
    "g2",    "c2",  as.Date("2019-02-02"),   "ok"
)

d %>%
  gather(key = element, value = value, date, status) %>%
  unite(thing, place, element, remove = TRUE) %>%
  spread(thing, value, convert = TRUE)

Warning: attributes are not identical across measure variables;
they will be dropped

# A tibble: 3 × 5
  censor g1_date g1_status g2_date g2_status
  <chr>    <int> <chr>       <int> <chr>    
1 c1       17928 ok          17928 bad      
2 c2       17928 bad         17929 ok       
3 c3       17928 ok             NA <NA>

Here, we need better spread() and gather(), which can handle multiple columns. For more discussions, you can read the following issues:

In this post, I’m trying to explain an approach to solve this by using “bundled” data.frames, which is originally proposed by Kirill Müller.

“Bundled” data.frames

For convenience, I use a new term “bundle” for separating some of the columns of a data.frame to another data.frame, and assigning the new data.frame to a column, and “unbundle” for the opposite operation.

For example, “bundling X, Y, and Z” means converting this

id	X	Y	Z
1	0.1	a	TRUE
2	0.2	b	FALSE
3	0.3	c	TRUE

to something like this:

id	foo
id	X	Y	Z
1	0.1	a	TRUE
2	0.2	b	FALSE
3	0.3	c	TRUE

You might wonder if this is really possible without dangerous hacks. But, with tibble package (2D columns are supported now), this is as easy as:

tibble(
  id = 1:3,
  foo = tibble(
    X = 1:3 * 0.1,
    Y = letters[1:3],
    Z = c(TRUE, FALSE, TRUE)
  )
)

# A tibble: 3 × 2
     id foo$X $Y    $Z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE

For more information about data.frame columns, please see Advanced R.

An experimental package for this

I created a package for bundling, tiedr. Since this is just an experiment, I don’t seriously introduce this. But, for convenience, let me use this package in this post because, otherwise, the code would be a bit long and hard to read…

https://github.com/yutannihilation/tiedr

I need four functions from this package, bundle(), unbundle(), gather_bundles(), and spread_bundles(). gather_bundles() and spread_bundles() are some kind of the variants of gather() and spread(), so probably you can guess the usages. Here, I just explain about the first two functions briefly.

`bundle()`

bundle() bundles columns. It takes data, and the specifications of bundles in the form of new_col1 = c(col1, col2, ...), new_col2 = c(col3, col4, ...), ....

library(tiedr)

d <- tibble(id = 1:3, X = 1:3 * 0.1, Y = letters[1:3], Z = c(TRUE, FALSE, TRUE))

bundle(d, foo = X:Z)

# A tibble: 3 × 2
     id foo$X $Y    $Z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE

bundle() also can rename the sub-columns at the same time.

bundle(d, foo = c(x = X, y = Y, z = Z))

# A tibble: 3 × 2
     id foo$x $y    $z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE

`unbundle()`

unbundle() unbundles columns. This operation is almost the opposite of what bundle() does; one difference is that this adds the names of the bundle as prefixes in order to avoid name collisions. In case the prefix is not needed, we can use sep = NULL.

d %>%
  bundle(foo = X:Z) %>% 
  unbundle(foo)

# A tibble: 3 × 4
     id foo_X foo_Y foo_Z
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE

Expose hidden structures in colnames as bundles

One of the meaningful usage of bundled data.frame is to express the structure of a data. Suppose we have this data (from tidyverse/tidyr#150):

d <- tribble(
  ~Race,~Female_LoTR,~Male_LoTR,~Female_TT,~Male_TT,~Female_RoTK,~Male_RoTK,
  "Elf",        1229,       971,       331,     513,         183,       510,
  "Hobbit",       14,      3644,         0,    2463,           2,      2673,
  "Man",           0,      1995,       401,    3589,         268,      2459
)

Race	Female_LoTR	Male_LoTR	Female_TT	Male_TT	Female_RoTK	Male_RoTK
Elf	1229	971	331	513	183	510
Hobbit	14	3644	0	2463	2	2673
Man	0	1995	401	3589	268	2459

In this data, the prefixes Female_ and Male_ represent the column groups. Thus, as Kirill Müller suggests in the comment, these columns can be bundled (with the sub-columns renamed) to:

Race	Female			Male
Race	LoTR	TT	RoTK	LoTR	TT	RoTK
Elf	1229	331	183	971	513	510
Hobbit	14	0	2	3644	2463	2673
Man	0	401	268	1995	3589	2459

With bundle() we can write this as:

d_bundled <- d %>% 
  bundle(
    Female = c(LoTR = Female_LoTR, TT = Female_TT, RoTK = Female_RoTK),
    Male   = c(LoTR = Male_LoTR,   TT = Male_TT,   RoTK = Male_RoTK)
  )

d_bundled

# A tibble: 3 × 3
  Race   Female$LoTR   $TT $RoTK Male$LoTR   $TT $RoTK
  <chr>        <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 Elf           1229   331   183       971   513   510
2 Hobbit          14     0     2      3644  2463  2673
3 Man              0   401   268      1995  3589  2459

`gather()` the bundles

Remember gather() strips colnames and convert it to a column. We can do this operation for bundled data.frames in the same manner. But, unlike gather() for flat data.frames, we don’t need to specify a colname for values, because the contents in bundles already have their colnames.

Let’s gather Female and Male bundles into key column.

d_gathered <- d_bundled %>%
  gather_bundles(Female, Male, .key = "key")

d_gathered

# A tibble: 6 × 5
  Race   key     LoTR    TT  RoTK
  <chr>  <chr>  <dbl> <dbl> <dbl>
1 Elf    Female  1229   331   183
2 Hobbit Female    14     0     2
3 Man    Female     0   401   268
4 Elf    Male     971   513   510
5 Hobbit Male    3644  2463  2673
6 Man    Male    1995  3589  2459

Race	key	LoTR	TT	RoTK
Elf	Female	1229	331	183
Hobbit	Female	14	0	2
Man	Female	0	401	268
Elf	Male	971	513	510
Hobbit	Male	3644	2463	2673
Man	Male	1995	3589	2459

Now we have all parts for implementing multi-gather(). I did bundling by manual, but we can have a helper function to find the common prefixes and bundle them automatically. So, multi-gather() will be something like:

d %>%
  auto_bundle(-Race) %>% 
  gather_bundles()

# A tibble: 6 × 5
  Race   key     LoTR    TT  RoTK
  <chr>  <chr>  <dbl> <dbl> <dbl>
1 Elf    Female  1229   331   183
2 Hobbit Female    14     0     2
3 Man    Female     0   401   268
4 Elf    Male     971   513   510
5 Hobbit Male    3644  2463  2673
6 Man    Male    1995  3589  2459

`spread()` to the bundles

As we already saw it’s possible to gather() multiple bundles, now it’s obvious that we can spread() multiple columns into multiple bundles vice versa. So, let me skip the details here.

We can multi-spread():

d_bundled_again <- d_gathered %>%
  spread_bundles(key, LoTR:RoTK)

d_bundled_again

# A tibble: 3 × 3
  Race   Female$LoTR   $TT $RoTK Male$LoTR   $TT $RoTK
  <chr>        <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 Elf           1229   331   183       971   513   510
2 Hobbit          14     0     2      3644  2463  2673
3 Man              0   401   268      1995  3589  2459

Then, unbundle() flattens the bundles to prefixes.

d_bundled_again %>%
  unbundle(-Race)

# A tibble: 3 × 7
  Race   Female_LoTR Female_TT Female_RoTK Male_LoTR Male_TT Male_RoTK
  <chr>        <dbl>     <dbl>       <dbl>     <dbl>   <dbl>     <dbl>
1 Elf           1229       331         183       971     513       510
2 Hobbit          14         0           2      3644    2463      2673
3 Man              0       401         268      1995    3589      2459

It’s done. By combining these two steps, multi-spread() will be something like this:

d_gathered %>%
  spread_bundles(key, LoTR:RoTK) %>% 
  unbundle(-Race)

Considerations

As I described above, multi-gather() doesn’t need the column name for value. On the other hand, usual gather() needs a new colname. Because, while it needs a name to become a column, an atomic column doesn’t have inner names.

Similarly, usual spread() can be considered as a special version of multi-spread(). Consider the case when we multi-spread()ing one column:

# an example in ?tidyr::spread
df <- tibble(x = c("a", "b"), y = c(3, 4), z = c(5, 6))

spread_bundles(df, key = x, y, simplify = FALSE)

# A tibble: 2 × 3
      z   a$y   b$y
  <dbl> <dbl> <dbl>
1     5     3    NA
2     6    NA     4

Since y is the only one column in the data, we can simplify these 1-column data.frames to vectors:

spread_bundles(df, key = x, y, simplify = TRUE)

# A tibble: 2 × 3
      z     a     b
  <dbl> <dbl> <dbl>
1     5     3    NA
2     6    NA     4

This is usual spread().

I’m yet to see if we can improve the current spread() and gather() to handle these differences transparently…

Future plans

Probably, this post is too much about the implementational details. I need to think about the interfaces before proposing this on tidyr’s repo.

Any suggestions or feedbacks are welcome!

Wait, what is multi-gather() and multi-spread()??