Skip to content

[pkg/stanza][fileconsumer] - finalize archive's implementation #39901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented May 6, 2025

This PR finalizes the archive implementation and adds test case.

It updates following things:

  1. Updates tracker to track unmatched files as well.
  2. Introduces a handleUnmatchedFiles helper to create readers once archive matching is done.

Relates: #38056

@VihasMakwana VihasMakwana force-pushed the archive-implementation branch from 91f3ea8 to 496796f Compare May 6, 2025 18:52
@VihasMakwana VihasMakwana changed the title [pkg/stanza][fileconsumer[ - finalize archive's implementation [pkg/stanza][fileconsumer] - finalize archive's implementation May 6, 2025
@VihasMakwana VihasMakwana force-pushed the archive-implementation branch from 496796f to ecce69e Compare May 6, 2025 19:12
@VihasMakwana
Copy link
Contributor Author

@djaglowski @andrzej-stencel Looking forward to hear your thoughts on this approach 😄

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 30, 2025
@VihasMakwana
Copy link
Contributor Author

cc: @andrzej-stencel

@github-actions github-actions bot removed the Stale label May 31, 2025
@atoulme atoulme removed the request for review from djaglowski June 3, 2025 00:19
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 21, 2025
Copy link
Member

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last comments on the docs and we should be good to go.

@@ -65,6 +65,7 @@ Tails and parses logs from files.
| `ordering_criteria.sort_by.format` | | Relevant if `sort_type` is set to `timestamp`. Defines the strptime format of the timestamp being sorted. |
| `ordering_criteria.sort_by.ascending` | | Sort direction |
| `compression` | | Indicate the compression format of input files. If set accordingly, files will be read using a reader that uncompresses the file before scanning its content. Options are ``, `gzip`, or `auto`. `auto` auto-detects file compression type. Currently, gzip files are the only compressed files auto-detected, based on ".gz" filename extension. `auto` option is useful when ingesting a mix of compressed and uncompressed files with the same filelogreceiver. |
| `polls_to_archive` | | This settings control the number of poll cycles to store on disk, rather than being discarded. By default, the receiver will purge the record of readers for existed for 3 generations. Refer [archiving](#archiving) and [polling](../../pkg/stanza/fileconsumer/design.md#polling) for more details. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention what the default value is.

@@ -221,6 +222,10 @@ Here is some of the information the file log receiver stores:

Exactly how this information is serialized depends on the type of storage being used.

### Archiving

If `polls_to_archive` setting is used in conjunction with `storage` setting, file offsets older than three poll cycles are stored on disk rather than being discarded. This feature enables the receiver to remember file for a longer period and also aims to use limited amount of memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when polls_to_archive is used but storage is not set? Should describe this in the docs too.

@@ -221,6 +222,10 @@ Here is some of the information the file log receiver stores:

Exactly how this information is serialized depends on the type of storage being used.

### Archiving

If `polls_to_archive` setting is used in conjunction with `storage` setting, file offsets older than three poll cycles are stored on disk rather than being discarded. This feature enables the receiver to remember file for a longer period and also aims to use limited amount of memory.
Copy link
Member

@andrzej-stencel andrzej-stencel Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, this documentation does not explain to the user why they should use it. Can you describe a specific scenario when using this option changes the behavior of the receiver to the benefit of the user? This section should describe the scenario and what the behavior is without and with this option set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrzej-stencel I've added a scenario. Can you take a look?

@@ -65,6 +65,7 @@ Tails and parses logs from files.
| `ordering_criteria.sort_by.format` | | Relevant if `sort_type` is set to `timestamp`. Defines the strptime format of the timestamp being sorted. |
| `ordering_criteria.sort_by.ascending` | | Sort direction |
| `compression` | | Indicate the compression format of input files. If set accordingly, files will be read using a reader that uncompresses the file before scanning its content. Options are ``, `gzip`, or `auto`. `auto` auto-detects file compression type. Currently, gzip files are the only compressed files auto-detected, based on ".gz" filename extension. `auto` option is useful when ingesting a mix of compressed and uncompressed files with the same filelogreceiver. |
| `polls_to_archive` | | This settings control the number of poll cycles to store on disk, rather than being discarded. By default, the receiver will purge the record of readers for existed for 3 generations. Refer [archiving](#archiving) and [polling](../../pkg/stanza/fileconsumer/design.md#polling) for more details. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark this option as experimental, so that we can possibly change it without breaking the beta stability guarantees of the component?

I feel this implementation may be not the ultimate solution and would like to reserve the right to change it in the future.

Copy link
Contributor Author

@VihasMakwana VihasMakwana Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can mark this experimental.

Just to confirm, you are suggesting we use this feature under a feature gate. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm no, I wasn't thinking about a feature gate. This feature can be available for use as is, let's just mention in the docs that it's experimental.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay got it. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants