Change in PDF Extraction Results

Hi, today I noticed a sudden change in the way text is extracted from PDFs. It seems like a lot of the binary content is being included. This is causing our tests to fail:

![image](https://github.com/user-attachments/assets/0478b567-24ac-440e-8d27-5d6f8d334673)

We've been able to resolve this quickly on our end by downgrading the package version; but just wanted to give you guys a heads-up.

EDIT: On further investigation, it looks like a change in the python API caused the issue:

```bash
Traceback (most recent call last):
  File "/home/bls/Downloads/code/bbot/bbot/modules/extractous.py", line 135, in extract_text
    buffer = reader.read(4096)
             ^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'read'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change in PDF Extraction Results #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change in PDF Extraction Results #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions