-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.x](backport #41762) Use fingerprint
file identity by default and migrate file state from native
or path
#42126
Conversation
…m `native` or `path` (#41762) This commit changes the default `file_identity` from `native` to `fingerprint`, any previous state from `native` (or `path`) is automatically migrated to `fingerprint` whe Filestream is starting. The Filestream input has always had the [ability to update file identifiers](https://github.com/elastic/beats/blob/4278366ab03221e8b62183dc06f9505f6ccc5209/filebeat/input/filestream/prospector.go#L104-L122), however it never worked as expected, leading to full data duplication when changing the file identity. This commit fixes it to allow changing the file identity from `native` (inode + device ID) and `path` to `fingerprint` without any data duplication. (cherry picked from commit 78fe7a5) # Conflicts: # filebeat/tests/integration/filestream_test.go
Cherry-pick of 78fe7a5 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
d8c1b46
to
ef081b9
Compare
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
1 similar comment
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
The plan is to backport the state migration bit, which is not a breaking change, but remove the defaults change, the breaking change. That's the plan: #41762 (comment) |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
This pull request is now in conflicts. Could you fix it? 🙏
|
ef081b9 reverted the breaking change part of the original PR. |
…to mergify/bp/8.x/pr-41762
|
||
[source,yaml] | ||
---- | ||
file_identity.fingerprint: ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what is the meaning of the tilde character here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the object file_identity.fingerprint
to exist. I believe some times when the config is exported/dumped as YAML, it ends up as file_identity.fingerprint: null
.
We need file_identity.fingerprint
defined, but it does not have any attribute, nor can it be a primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, thanks. I suspected this was some kind of placeholder, but I was unsure if this was some kind of yaml/parser thing but it seems to be just a project convention, TIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
@belimawr looks like the doc build is failing with this error which seems legit: Could you please take a look? |
Yes, I'll take a look today. |
Proposed commit message
The Filestream input has always had the ability to update file identifiers,
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from
native
(inode + device ID) andpath
tofingerprint
without any data duplication.Checklist
I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
None, this backport does not change the default behaviour.
Author's Checklist
Regarding the Elastic-Agent integration tests, most tests actually use the
log
input because when they were written, Filestream was not available as an integration package. The very few other test that use Filestrem either generate a log file large enough or are skipped as flaky.How to test this PR locally
Ensure this backport is not changing the default behaviour
Create a log file with at least a few log lines and more than 1kb (e.g:
/tmp/flog.log
, 15 log lines), you can useflog
with Docker:Start Filebeat with the following configuration
filebeat.yml (native)
Wait until the file is fully ingested (wait for
End of file reached: /tmp/flog.log; Backoff now.
in the logs)Stop Filebeat
Look at the registry log file (cat
data/registry/filebeat/log.json
) and ensure all keys (k
) start withfilestream::test-migrate-ID::native::
, ex:Test the state migration
Create a log file with at least a few log lines and more than 1kb (e.g:
/tmp/flog.log
, 15 log lines), you can useflog
with Docker:Start Filebeat with the following configuration
filebeat.yml (native)
Wait until the file is fully ingested (wait for
End of file reached: /tmp/flog.log; Backoff now.
in the logs)Ensure all events have been published to the output (
wc -l ./output-file*
should return 15)Stop Filebeat
Change the file identity to
fingerprint
. It's the new default, hence it's not explicitly set.filebeat.yml (fingerprint)
Start Filebeat
Wait until the Filebeat "finds the end of the file" (wait for
End of file reached: /tmp/flog.log; Backoff now.
in the logs)Ensure no extra event was published ((
wc -l ./output-file*
should still return 15)Add 10 more lines to the file:
Wait until the new lines are ingested (wait for
End of file reached: /tmp/flog.log; Backoff now.
in the logs)Ensure all events have been published to the output with no duplication (
wc -l ./output-file*
should return 25)Related issues
Use cases
Dealing with identity reuse (e.g: inode reuse) without facing re-ingestion of data with Filestream input
## ScreenshotsLogs
This is an automatic backport of pull request #41762 done by [Mergify](https://mergify.com).