Skip to content

Performance of fill() with labelled data after group_by() #658

@etiennebacher

Description

@etiennebacher

This problem is very similar to tidyverse/tidyr#520. When there is a large number of groups, fill() is much slower with labelled data than with only numeric data.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(haven)

set.seed(2)
n <- 1e4
my_sample <- sample(c(1:10, NA), n, replace = TRUE)
df <- tibble(
  group = sample(paste("id", 1:(n/4)), n, replace = TRUE),
  num = my_sample,
  lab = haven::labelled(my_sample)
) %>% 
  group_by(group)

bench::mark(
  num = fill(df, num, .direction = "updown"),
  lab = fill(df, lab, .direction = "updown"),
  check = FALSE
)[1:4]
#> # A tibble: 2 x 4
#>   expression      min   median `itr/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>
#> 1 num         37.09ms  41.65ms    23.3  
#> 2 lab           1.99s    1.99s     0.502
#> Warning message:
#> Some expressions had a GC in every iteration; so filtering is disabled. 

Note that the timing is similar when the data is not grouped:

set.seed(2)
n <- 1e4
my_sample <- sample(c(1:10, NA), n, replace = TRUE)
df <- tibble(
  num = my_sample,
  lab = haven::labelled(my_sample)
) 

bench::mark(
  num = fill(df, num, .direction = "updown"),
  lab = fill(df, lab, .direction = "updown"),
  check = FALSE
)[1:4]
#> # A tibble: 2 x 4
#>   expression      min   median `itr/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>
#> 1 num          22.4ms   24.4ms      39.4
#> 2 lab          26.1ms   32.9ms      29.7

Done with development versions of tidyr, dplyr and haven.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions