[R] Filtering based on str_detect character columns with more than 4000 characters and occasional empty cells not working correctly when reading from disk with arrow

### Describe the bug, including details regarding any error messages, version, and platform.

I've spent a few hours trying to pinpoint exactly when this issue appears. The reprex below should make this clear. 

The type of dataset creating this issue is a data frame with:
- a column including very long strings of text (the issue does not seem to emerge with short strings)
- this column has at least one empty element (empty as `""`; NA do not seem to be an issue); if such rows are removed, then the issue does not emerge

The issue appears only if:
- the data frame is stored with `write_dataset()`, and is then read with `open_dataset`. If it is created with `arrow_table` in memory, the issue does not appear. If the dataset is stored partitioned (grouped before writing) the issue is apparently limited to groups where an empty string is present.
- the dataset is filtered based on string matching with stringr::str_detect (or grepl), *before* collecting. 

Under these conditions, the filter returns an incomplete set of rows. If the same `arrow` connection is collected before filtering, then it returns the expected result. 

Even if it returns an incomplete set of rows it throws no errors or warnings: the user will not notice unless they conduct additional tests. 

In my real-world case, this happens with textual corpora; it seems to be happening more frequently (i.e. even if strings are shorter) with corpora with non-latin characters, but I haven't found the exact threshold.

Tested with both current version on CRAN as well as current development version, details in reprex below.


``` r
library("tibble")
library("dplyr")
library("stringr")
library("purrr")
library("arrow")

set.seed(1)

### Create a data frame with a column with long string, and another for testing grouping
rows <- 100
data_df <- purrr::map(.x = 1:rows,
                      .progress = TRUE,
                      .f = function(x) {
  tibble::tibble(text = paste(sample(x = c(letters, LETTERS),
                                     size = sample(0:10000, size = 1), replace = TRUE), collapse = ""),
                 category = sample(x = 1:10, size = 1))
}) |> 
  purrr::list_rbind()

### Add a few empty cells
data_df[["text"]][sample(c(TRUE, FALSE), size = rows, prob = c(0.05, 0.95), replace = TRUE)] <- ""


### Store in a temp folder
test_arrow_path <- file.path(tempdir(), "test_arrow")
write_dataset(dataset = data_df |> 
                dplyr::group_by(category),
              path = test_arrow_path)

### Read from temp folder
arrow_from_disk <- open_dataset(test_arrow_path)
### Read from memory
arrow_from_memory <- arrow_table(data_df |> 
                                   dplyr::group_by(category))


arrow_from_disk_filtered <- arrow_from_disk |> 
  filter(str_detect(string = text, pattern = "a"))

arrow_from_memory_filtered <- arrow_from_memory |> 
  filter(str_detect(string = text, pattern = "a"))

data_df |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  nrow()
#> [1] 97

arrow_from_disk_filtered_n_rows <- arrow_from_disk_filtered |> 
  dplyr::collect() |>
  nrow()

arrow_from_disk_filtered_n_rows
#> [1] 78

arrow_from_memory_filtered_n_rows <- arrow_from_memory_filtered |> 
  dplyr::collect() |> 
  nrow()

arrow_from_memory_filtered_n_rows
#> [1] 97

## different number of rows, while they should be the same
arrow_from_memory_filtered_n_rows==arrow_from_disk_filtered_n_rows
#> [1] FALSE


## filter before collecting gives wrong result
arrow_from_disk |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  dplyr::collect() |> 
  nrow()
#> [1] 78

## filter after collecting gives correct result
arrow_from_disk |> 
  dplyr::collect() |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  nrow()
#> [1] 97


## write to disk but without grouping 

test_arrow_path_no_group <- file.path(tempdir(), "test_arrow_no_group")

write_dataset(dataset = data_df,
              path = test_arrow_path_no_group)

arrow_from_disk_no_group <- open_dataset(test_arrow_path_no_group)

arrow_from_disk_no_group |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  dplyr::collect() |> 
  nrow()
#> [1] 0

arrow_from_disk_no_group |> 
  dplyr::collect() |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  nrow()
#> [1] 97

arrow_from_disk_no_group |> 
  dplyr::collect() |> 
  filter(str_detect(string = text, pattern = "a")) |> 
  nrow()
#> [1] 97

arrow_from_disk_no_group |> 
  filter(grepl(x = text, "a")) |> 
  dplyr::collect() |> 
  nrow()
#> [1] 0


packageVersion("arrow")
#> [1] '15.0.2.9000'

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora Linux 38 (Workstation Edition)
#> 
#> Matrix products: default
#> BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
#>  [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
#>  [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Rome
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] arrow_15.0.2.9000 purrr_1.0.2       stringr_1.5.1     dplyr_1.1.4      
#> [5] tibble_3.2.1     
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.2         knitr_1.46        rlang_1.1.3      
#>  [5] xfun_0.43         stringi_1.8.3     generics_0.1.3    assertthat_0.2.1 
#>  [9] bit_4.0.5         glue_1.7.0        htmltools_0.5.8.1 fansi_1.0.6      
#> [13] rmarkdown_2.26    evaluate_0.23     fastmap_1.1.1     yaml_2.3.8       
#> [17] lifecycle_1.0.4   compiler_4.3.3    fs_1.6.3          pkgconfig_2.0.3  
#> [21] rstudioapi_0.16.0 digest_0.6.35     R6_2.5.1          reprex_2.1.0     
#> [25] tidyselect_1.2.1  utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
#> [29] bit64_4.0.5       tools_4.3.3       withr_3.0.0
```

<sup>Created on 2024-04-12 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>


### Component(s)

R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[R] Filtering based on str_detect character columns with more than 4000 characters and occasional empty cells not working correctly when reading from disk with arrow #41175

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[R] Filtering based on str_detect character columns with more than 4000 characters and occasional empty cells not working correctly when reading from disk with arrow #41175

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions