Skip to content

[Feat] Option to include header/footer once when using onlyMainContent #1518

Open
@iandoesallthethings

Description

@iandoesallthethings

Problem Description
Marketing sites often include information like phone numbers and addresses in the footer. So if you need that information in your dataset, you can't use onlyMainContent, meaning you have to have n * numPages copies of that header and footer.

Proposed Feature

A flag to use with or instead of onlyMainContent that puts header/footer/etc data as a separate 'page' or onlyMainContent for all but the first scraped page.

Alternatives Considered

  • Deduplication post-scrape - doable, but a bit messy and sometimes unreliable
  • Accept the repeated data - often gets in the way of llm context windows

Implementation Suggestions
Whatever mechanism is used to exclude the non-main content could be used in reverse to grab it exclusively.

Use Case
It would allow for a happy medium option of how much non-main content to include.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions