Description
Describe the bug, including details regarding any error messages, version, and platform.
I've spent a few hours trying to pinpoint exactly when this issue appears. The reprex below should make this clear.
The type of dataset creating this issue is a data frame with:
- a column including very long strings of text (the issue does not seem to emerge with short strings)
- this column has at least one empty element (empty as
""
; NA do not seem to be an issue); if such rows are removed, then the issue does not emerge
The issue appears only if:
- the data frame is stored with
write_dataset()
, and is then read withopen_dataset
. If it is created witharrow_table
in memory, the issue does not appear. If the dataset is stored partitioned (grouped before writing) the issue is apparently limited to groups where an empty string is present. - the dataset is filtered based on string matching with stringr::str_detect (or grepl), before collecting.
Under these conditions, the filter returns an incomplete set of rows. If the same arrow
connection is collected before filtering, then it returns the expected result.
Even if it returns an incomplete set of rows it throws no errors or warnings: the user will not notice unless they conduct additional tests.
In my real-world case, this happens with textual corpora; it seems to be happening more frequently (i.e. even if strings are shorter) with corpora with non-latin characters, but I haven't found the exact threshold.
Tested with both current version on CRAN as well as current development version, details in reprex below.
library("tibble")
library("dplyr")
library("stringr")
library("purrr")
library("arrow")
set.seed(1)
### Create a data frame with a column with long string, and another for testing grouping
rows <- 100
data_df <- purrr::map(.x = 1:rows,
.progress = TRUE,
.f = function(x) {
tibble::tibble(text = paste(sample(x = c(letters, LETTERS),
size = sample(0:10000, size = 1), replace = TRUE), collapse = ""),
category = sample(x = 1:10, size = 1))
}) |>
purrr::list_rbind()
### Add a few empty cells
data_df[["text"]][sample(c(TRUE, FALSE), size = rows, prob = c(0.05, 0.95), replace = TRUE)] <- ""
### Store in a temp folder
test_arrow_path <- file.path(tempdir(), "test_arrow")
write_dataset(dataset = data_df |>
dplyr::group_by(category),
path = test_arrow_path)
### Read from temp folder
arrow_from_disk <- open_dataset(test_arrow_path)
### Read from memory
arrow_from_memory <- arrow_table(data_df |>
dplyr::group_by(category))
arrow_from_disk_filtered <- arrow_from_disk |>
filter(str_detect(string = text, pattern = "a"))
arrow_from_memory_filtered <- arrow_from_memory |>
filter(str_detect(string = text, pattern = "a"))
data_df |>
filter(str_detect(string = text, pattern = "a")) |>
nrow()
#> [1] 97
arrow_from_disk_filtered_n_rows <- arrow_from_disk_filtered |>
dplyr::collect() |>
nrow()
arrow_from_disk_filtered_n_rows
#> [1] 78
arrow_from_memory_filtered_n_rows <- arrow_from_memory_filtered |>
dplyr::collect() |>
nrow()
arrow_from_memory_filtered_n_rows
#> [1] 97
## different number of rows, while they should be the same
arrow_from_memory_filtered_n_rows==arrow_from_disk_filtered_n_rows
#> [1] FALSE
## filter before collecting gives wrong result
arrow_from_disk |>
filter(str_detect(string = text, pattern = "a")) |>
dplyr::collect() |>
nrow()
#> [1] 78
## filter after collecting gives correct result
arrow_from_disk |>
dplyr::collect() |>
filter(str_detect(string = text, pattern = "a")) |>
nrow()
#> [1] 97
## write to disk but without grouping
test_arrow_path_no_group <- file.path(tempdir(), "test_arrow_no_group")
write_dataset(dataset = data_df,
path = test_arrow_path_no_group)
arrow_from_disk_no_group <- open_dataset(test_arrow_path_no_group)
arrow_from_disk_no_group |>
filter(str_detect(string = text, pattern = "a")) |>
dplyr::collect() |>
nrow()
#> [1] 0
arrow_from_disk_no_group |>
dplyr::collect() |>
filter(str_detect(string = text, pattern = "a")) |>
nrow()
#> [1] 97
arrow_from_disk_no_group |>
dplyr::collect() |>
filter(str_detect(string = text, pattern = "a")) |>
nrow()
#> [1] 97
arrow_from_disk_no_group |>
filter(grepl(x = text, "a")) |>
dplyr::collect() |>
nrow()
#> [1] 0
packageVersion("arrow")
#> [1] '15.0.2.9000'
sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora Linux 38 (Workstation Edition)
#>
#> Matrix products: default
#> BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_IE.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_IE.UTF-8
#> [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_IE.UTF-8
#> [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Rome
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] arrow_15.0.2.9000 purrr_1.0.2 stringr_1.5.1 dplyr_1.1.4
#> [5] tibble_3.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.2 knitr_1.46 rlang_1.1.3
#> [5] xfun_0.43 stringi_1.8.3 generics_0.1.3 assertthat_0.2.1
#> [9] bit_4.0.5 glue_1.7.0 htmltools_0.5.8.1 fansi_1.0.6
#> [13] rmarkdown_2.26 evaluate_0.23 fastmap_1.1.1 yaml_2.3.8
#> [17] lifecycle_1.0.4 compiler_4.3.3 fs_1.6.3 pkgconfig_2.0.3
#> [21] rstudioapi_0.16.0 digest_0.6.35 R6_2.5.1 reprex_2.1.0
#> [25] tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3
#> [29] bit64_4.0.5 tools_4.3.3 withr_3.0.0
Created on 2024-04-12 with reprex v2.1.0
Component(s)
R