Skip to content

searching in tabular data in PDF fails #289

@xkcd386at

Description

@xkcd386at

Describe the bug

When a PDF has a table, rga does not find matches for foo.*bar where bar is in a later column of the same row than foo.

To Reproduce

All the PDFs I have are my financial documents, so I found a publicly accessible one that shows the difference:

$ wget -o /dev/null https://assets.accessible-digital-documents.com/uploads/2017/01/sample-tables.pdf

$ ls
sample-tables.pdf

$ rga 'Financial.*22.5'

As you can see this produces nothing, but actually there is something to be found, if you consider the .* to span columns:

$ pdfgrep 'Financial.*22.5' -r
./sample-tables.pdf:Policy functions            Financial                            22.5           30.57
./sample-tables.pdf:Policy functions            Financial                           22.5      30.57
./sample-tables.pdf:Policy functions          Financial                          22.5        30.57

Operating System and Version

Xubuntu 24.04 LTS

Output of rga --version

ripgrep-all 0.10.9

What I searched

I did search open issues for "table" and "column" before posting. Found #232 which did not help me. (Edit: I wasn't quite sure if that was the same issue). Apologies if there already was a posted issue and I missed it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions