Skip to content

enhancement(file source):skip redundant file fingerprinting for already-watched files#25602

Open
vparfonov wants to merge 2 commits into
vectordotdev:masterfrom
vparfonov:skip-redundant-fingerprinting
Open

enhancement(file source):skip redundant file fingerprinting for already-watched files#25602
vparfonov wants to merge 2 commits into
vectordotdev:masterfrom
vparfonov:skip-redundant-fingerprinting

Conversation

@vparfonov

Copy link
Copy Markdown
Contributor

Summary

  • Skip fingerprinting files that are already being watched during glob cycles
  • Reduces disk I/O syscalls by ~79% (open/read/lseek) for stable file sets
  • Detects file truncation via metadata check (1 stat() call) to preserve correctness

Every glob cycle (default 60s), FileServer fingerprints every file returned by the paths provider, even files already tracked in fp_map. Each fingerprint involves ~6 syscalls (open, seek etc).

On large Kubernetes clusters with 500+ pods, this causes thousands of unnecessary read syscalls per minute, saturating disk I/O and disrupting other node services (e.g., etcd on control plane nodes).

Changes

Before the glob loop, build a reverse lookup (path → fingerprint) from fp_map. When glob returns a path, check the reverse map first. If the path is already tracked, skip fingerprinting entirely — no file I/O needed.

To handle file truncation (same path, different content), we do one stat() call to compare the current file size against our read position. If the file is smaller than where we last read (metadata.len() < file_position), it was truncated — fall through to full fingerprinting so Vector detects the change and re-reads from the beginning. If stat() fails (file deleted, permissions error), we also fall through to full fingerprinting.

Measured Impact (500 files, 35s trace, glob_minimum_cooldown_ms = 10000 (10s))

Syscall Before After Reduction
open 1,503 5 99.7%
lseek 3,000 0 100%
read 4,500. 2,500 44%
total 12,033 2,555 78.8%

Vector configuration

How did you test this PR?

Exsited test passed - no regression

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

…iles

  On each glob cycle, FileServer fingerprinted every file returned by the
  paths provider, even files already being actively watched. Each
  fingerprint involves syscalls (open, seek, read magic bytes, seek,
  read first line, EOF check). On clusters with 500+ pods this caused
  thousands of unnecessary read syscalls per minute, saturating disk I/O
  and disrupting etcd on control plane nodes.

  Add a path-based reverse lookup before fingerprinting. If a file path
  is already tracked in fp_map and hasn't been truncated (file size >=
  read position), skip fingerprinting entirely. Truncated files still
  fall through to full fingerprinting to preserve correct behavior.

  Measured impact (500 files, 35s trace):
  - open:  1,503 → 5    (99.7% reduction)
  - lseek: 3,000 → 0    (100% reduction)
  - read:  4,500 → 2,500 (44% reduction, remaining are data reads)
  - total: 12,033 → 2,555 (78.8% reduction)

  Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Vitalii Parfonov <vparfono@redhat.com>
@vparfonov vparfonov requested a review from a team as a code owner June 9, 2026 16:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04bac4e09f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/file-source/src/file_server.rs Outdated
Comment thread lib/file-source/src/file_server.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efaf039b73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +203 to +207
if let Some((file_id, file_position, devno, inode)) = watched_paths.get(&path)
&& let Ok(metadata) = fs::metadata(&path).await
&& metadata.portable_dev() == *devno
&& metadata.portable_ino() == *inode
&& metadata.len() >= *file_position

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep fingerprinting files that may be copy-truncated

When checksum fingerprinting is configured, copytruncate-style rotation keeps the same dev/inode while replacing the file contents; if the file is refilled to at least the old read offset before the next discovery pass, this condition treats it as unchanged and skips fingerprinting. The previous path would detect the changed first-lines checksum and start a new watcher from the beginning, but the fast path leaves the existing watcher at file_position, causing the beginning of the new log file to be dropped in that rotation scenario.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization skips fingerprinting for files that have the same path, inode, and size >= read position. There is one edge case it does not cover: copytruncate-style log rotation where the file is truncated and refilled past the old read position within a single glob cycle (60s). In that scenario, the fast path would not detect the content change and the beginning of the new log content could be silently missed.

This does NOT affect the kubernetes_logs source — kubelet always creates new numbered files (new inode) on rotation, which the inode check catches. It only affects the generic file source with logrotate copytruncate and very high write throughput.

Before this change, fingerprinting every file on every cycle would catch this case (by detecting the changed first-line CRC). The tradeoff is ~79% fewer disk I/O syscalls in exchange for this narrow edge case.

If needed, a periodic forced re-fingerprint (e.g., every Nth cycle) could be added as a follow-up to cover this scenario without losing the I/O improvement.

WDYT, Vector team?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant