dplyr Doesn't Provide Full Support For S4 (For Now?)

February 12, 2018 by Hiroaki Yutani

I’ve seen sooo many (duplicated) issues on this topic were opened on dplyr’s repo and lubridate’s repo.

For example, you cannot filter() intervals. Consider the data below:

library(lubridate)
#> Loading required package: methods
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(dplyr, warn.conflicts = FALSE)

# some method of tibble() won't work for interval
d <- data.frame(
  i     = interval(ymd(10000101) + years(1:3 * 1000), ymd(10000102) + years(1:3 * 1000)),
  value = 1:3
)

d
#>                                i value
#> 1 2000-01-01 UTC--2000-01-02 UTC     1
#> 2 3000-01-01 UTC--3000-01-02 UTC     2
#> 3 4000-01-01 UTC--4000-01-02 UTC     3

Let’s select the second row by filter().

d %>% 
  filter(value == 2L)
#> Warning in format.data.frame(x, digits = digits, na.encode = FALSE):
#> corrupt data frame: columns will be truncated or padded with NAs
#>                                i value
#> 1 2000-01-01 UTC--2000-01-02 UTC     2

Hmm, can you see something is wrong with the result? In case you don’t notice yet, comparing with the result by base subset() may help:

subset(d, value == 2L)
#>                                i value
#> 2 3000-01-01 UTC--3000-01-02 UTC     2

As you see, the result should be 3000-01-01 UTC--3000-01-02 UTC, whereas filter() returns 2000-01-01 UTC--2000-01-02 UTC. Why? This is related to the structure of interval class. Let’s examine them by str():

str(d$i)
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num [1:3] 86400 86400 86400
#>   ..@ start: POSIXct[1:3], format: "2000-01-01" ...
#>   ..@ tzone: chr "UTC"

interval objects consist of 3 slots, 2 of which have the same length. This means, to subset this vector, we need to subset .Data and start in the same way. But…

d %>% 
  filter(value == 2L) %>%
  str()
#> 'data.frame':	1 obs. of  2 variables:
#>  $ i    :Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num 86400
#>   .. ..@ start: POSIXct, format: "2000-01-01" ...
#>   .. ..@ tzone: chr "UTC"
#>  $ value: int 2

You can notice that it failed to subset start. In contrast, subset() properly subsets the slot as well:

str(subset(d, value == 2L))
#> 'data.frame':	1 obs. of  2 variables:
#>  $ i    :Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num 86400
#>   .. ..@ start: POSIXct, format: "3000-01-01"
#>   .. ..@ tzone: chr "UTC"
#>  $ value: int 2

This is because subset() dispatches the proper S4 method for interval, while dplyr fails to handle S4.

Of course, the maintainers are aware of this issue and seem to plan to address in long-awaited package, vctrs.

So, apparently, the content of this post won’t stay useful over time. But, for now, I feel this temporal “known issue” should be well-known, at least among those who suffers from this issue.

I hope this post will be outdated soon!