Enhancing gather() and spread() by Using “Bundled” data.frames

tidyr
tidy data
Author

Hiroaki Yutani

Published

February 3, 2019

Last month, I tried to explain gather() and spread() by gt package (https://yutani.rbind.io/post/gather-and-spread-explained-by-gt/). But, after I implemented experimental multi-gather() and multi-spread(), I realized that I need a bit different way of explanation… So, please forget the post, and read this with fresh eyes!

Wait, what is multi-gather() and multi-spread()??

In short, the current gather() and spread() have a limitation; they can gather into or spread from only one column at once. So, if we want to handle multiple columns, we need to coerce them to one column before actually gathering or spreading.

This is especially problematic when the columns have different types. For example, date column is unexpectedly converted to integers with the following code:

library(tibble)
library(tidyr)

# a bit different version of https://github.com/tidyverse/tidyr/issues/149#issue-124411755
d <- tribble(
  ~place, ~censor,                  ~date, ~status,
    "g1",    "c1",  as.Date("2019-02-01"),   "ok",
    "g1",    "c2",  as.Date("2019-02-01"),  "bad",
    "g1",    "c3",  as.Date("2019-02-01"),   "ok",
    "g2",    "c1",  as.Date("2019-02-01"),  "bad",
    "g2",    "c2",  as.Date("2019-02-02"),   "ok"
)

d %>%
  gather(key = element, value = value, date, status) %>%
  unite(thing, place, element, remove = TRUE) %>%
  spread(thing, value, convert = TRUE)
Warning: attributes are not identical across measure variables;
they will be dropped
# A tibble: 3 × 5
  censor g1_date g1_status g2_date g2_status
  <chr>    <int> <chr>       <int> <chr>    
1 c1       17928 ok          17928 bad      
2 c2       17928 bad         17929 ok       
3 c3       17928 ok             NA <NA>     

Here, we need better spread() and gather(), which can handle multiple columns. For more discussions, you can read the following issues:

In this post, I’m trying to explain an approach to solve this by using “bundled” data.frames, which is originally proposed by Kirill Müller.

“Bundled” data.frames

For convenience, I use a new term “bundle” for separating some of the columns of a data.frame to another data.frame, and assigning the new data.frame to a column, and “unbundle” for the opposite operation.

For example, “bundling X, Y, and Z” means converting this

id X Y Z
1 0.1 a TRUE
2 0.2 b FALSE
3 0.3 c TRUE

to something like this:

id foo
X Y Z
1 0.1 a TRUE
2 0.2 b FALSE
3 0.3 c TRUE

You might wonder if this is really possible without dangerous hacks. But, with tibble package (2D columns are supported now), this is as easy as:

tibble(
  id = 1:3,
  foo = tibble(
    X = 1:3 * 0.1,
    Y = letters[1:3],
    Z = c(TRUE, FALSE, TRUE)
  )
)
# A tibble: 3 × 2
     id foo$X $Y    $Z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE 

For more information about data.frame columns, please see Advanced R.

An experimental package for this

I created a package for bundling, tiedr. Since this is just an experiment, I don’t seriously introduce this. But, for convenience, let me use this package in this post because, otherwise, the code would be a bit long and hard to read…

https://github.com/yutannihilation/tiedr

I need four functions from this package, bundle(), unbundle(), gather_bundles(), and spread_bundles(). gather_bundles() and spread_bundles() are some kind of the variants of gather() and spread(), so probably you can guess the usages. Here, I just explain about the first two functions briefly.

bundle()

bundle() bundles columns. It takes data, and the specifications of bundles in the form of new_col1 = c(col1, col2, ...), new_col2 = c(col3, col4, ...), ....

library(tiedr)

d <- tibble(id = 1:3, X = 1:3 * 0.1, Y = letters[1:3], Z = c(TRUE, FALSE, TRUE))

bundle(d, foo = X:Z)
# A tibble: 3 × 2
     id foo$X $Y    $Z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE 

bundle() also can rename the sub-columns at the same time.

bundle(d, foo = c(x = X, y = Y, z = Z))
# A tibble: 3 × 2
     id foo$x $y    $z   
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE 

unbundle()

unbundle() unbundles columns. This operation is almost the opposite of what bundle() does; one difference is that this adds the names of the bundle as prefixes in order to avoid name collisions. In case the prefix is not needed, we can use sep = NULL.

d %>%
  bundle(foo = X:Z) %>% 
  unbundle(foo)
# A tibble: 3 × 4
     id foo_X foo_Y foo_Z
  <int> <dbl> <chr> <lgl>
1     1   0.1 a     TRUE 
2     2   0.2 b     FALSE
3     3   0.3 c     TRUE 

Expose hidden structures in colnames as bundles

One of the meaningful usage of bundled data.frame is to express the structure of a data. Suppose we have this data (from tidyverse/tidyr#150):

d <- tribble(
  ~Race,~Female_LoTR,~Male_LoTR,~Female_TT,~Male_TT,~Female_RoTK,~Male_RoTK,
  "Elf",        1229,       971,       331,     513,         183,       510,
  "Hobbit",       14,      3644,         0,    2463,           2,      2673,
  "Man",           0,      1995,       401,    3589,         268,      2459
)
Race Female_LoTR Male_LoTR Female_TT Male_TT Female_RoTK Male_RoTK
Elf 1229 971 331 513 183 510
Hobbit 14 3644 0 2463 2 2673
Man 0 1995 401 3589 268 2459

In this data, the prefixes Female_ and Male_ represent the column groups. Thus, as Kirill Müller suggests in the comment, these columns can be bundled (with the sub-columns renamed) to:

Race Female Male
LoTR TT RoTK LoTR TT RoTK
Elf 1229 331 183 971 513 510
Hobbit 14 0 2 3644 2463 2673
Man 0 401 268 1995 3589 2459

With bundle() we can write this as:

d_bundled <- d %>% 
  bundle(
    Female = c(LoTR = Female_LoTR, TT = Female_TT, RoTK = Female_RoTK),
    Male   = c(LoTR = Male_LoTR,   TT = Male_TT,   RoTK = Male_RoTK)
  )

d_bundled
# A tibble: 3 × 3
  Race   Female$LoTR   $TT $RoTK Male$LoTR   $TT $RoTK
  <chr>        <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 Elf           1229   331   183       971   513   510
2 Hobbit          14     0     2      3644  2463  2673
3 Man              0   401   268      1995  3589  2459

gather() the bundles

Remember gather() strips colnames and convert it to a column. We can do this operation for bundled data.frames in the same manner. But, unlike gather() for flat data.frames, we don’t need to specify a colname for values, because the contents in bundles already have their colnames.

Let’s gather Female and Male bundles into key column.

d_gathered <- d_bundled %>%
  gather_bundles(Female, Male, .key = "key")

d_gathered
# A tibble: 6 × 5
  Race   key     LoTR    TT  RoTK
  <chr>  <chr>  <dbl> <dbl> <dbl>
1 Elf    Female  1229   331   183
2 Hobbit Female    14     0     2
3 Man    Female     0   401   268
4 Elf    Male     971   513   510
5 Hobbit Male    3644  2463  2673
6 Man    Male    1995  3589  2459
Race key LoTR TT RoTK
Elf Female 1229 331 183
Hobbit Female 14 0 2
Man Female 0 401 268
Elf Male 971 513 510
Hobbit Male 3644 2463 2673
Man Male 1995 3589 2459

Now we have all parts for implementing multi-gather(). I did bundling by manual, but we can have a helper function to find the common prefixes and bundle them automatically. So, multi-gather() will be something like:

d %>%
  auto_bundle(-Race) %>% 
  gather_bundles()
# A tibble: 6 × 5
  Race   key     LoTR    TT  RoTK
  <chr>  <chr>  <dbl> <dbl> <dbl>
1 Elf    Female  1229   331   183
2 Hobbit Female    14     0     2
3 Man    Female     0   401   268
4 Elf    Male     971   513   510
5 Hobbit Male    3644  2463  2673
6 Man    Male    1995  3589  2459

spread() to the bundles

As we already saw it’s possible to gather() multiple bundles, now it’s obvious that we can spread() multiple columns into multiple bundles vice versa. So, let me skip the details here.

We can multi-spread():

d_bundled_again <- d_gathered %>%
  spread_bundles(key, LoTR:RoTK)

d_bundled_again
# A tibble: 3 × 3
  Race   Female$LoTR   $TT $RoTK Male$LoTR   $TT $RoTK
  <chr>        <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1 Elf           1229   331   183       971   513   510
2 Hobbit          14     0     2      3644  2463  2673
3 Man              0   401   268      1995  3589  2459

Then, unbundle() flattens the bundles to prefixes.

d_bundled_again %>%
  unbundle(-Race)
# A tibble: 3 × 7
  Race   Female_LoTR Female_TT Female_RoTK Male_LoTR Male_TT Male_RoTK
  <chr>        <dbl>     <dbl>       <dbl>     <dbl>   <dbl>     <dbl>
1 Elf           1229       331         183       971     513       510
2 Hobbit          14         0           2      3644    2463      2673
3 Man              0       401         268      1995    3589      2459

It’s done. By combining these two steps, multi-spread() will be something like this:

d_gathered %>%
  spread_bundles(key, LoTR:RoTK) %>% 
  unbundle(-Race)

Considerations

As I described above, multi-gather() doesn’t need the column name for value. On the other hand, usual gather() needs a new colname. Because, while it needs a name to become a column, an atomic column doesn’t have inner names.

Similarly, usual spread() can be considered as a special version of multi-spread(). Consider the case when we multi-spread()ing one column:

# an example in ?tidyr::spread
df <- tibble(x = c("a", "b"), y = c(3, 4), z = c(5, 6))

spread_bundles(df, key = x, y, simplify = FALSE)
# A tibble: 2 × 3
      z   a$y   b$y
  <dbl> <dbl> <dbl>
1     5     3    NA
2     6    NA     4

Since y is the only one column in the data, we can simplify these 1-column data.frames to vectors:

spread_bundles(df, key = x, y, simplify = TRUE)
# A tibble: 2 × 3
      z     a     b
  <dbl> <dbl> <dbl>
1     5     3    NA
2     6    NA     4

This is usual spread().

I’m yet to see if we can improve the current spread() and gather() to handle these differences transparently…

Future plans

Probably, this post is too much about the implementational details. I need to think about the interfaces before proposing this on tidyr’s repo.

Any suggestions or feedbacks are welcome!