Open
Description
Problem Description
Marketing sites often include information like phone numbers and addresses in the footer. So if you need that information in your dataset, you can't use onlyMainContent
, meaning you have to have n * numPages
copies of that header and footer.
Proposed Feature
A flag to use with or instead of onlyMainContent
that puts header/footer/etc data as a separate 'page' or onlyMainContent for all but the first scraped page.
Alternatives Considered
- Deduplication post-scrape - doable, but a bit messy and sometimes unreliable
- Accept the repeated data - often gets in the way of llm context windows
Implementation Suggestions
Whatever mechanism is used to exclude the non-main content could be used in reverse to grab it exclusively.
Use Case
It would allow for a happy medium option of how much non-main content to include.
Metadata
Metadata
Assignees
Labels
No labels