Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x](backport #41762) Use fingerprint file identity by default and migrate file state from native or path #42126

Merged
merged 10 commits into from
Jan 6, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Dec 19, 2024

Proposed commit message

The Filestream input has always had the ability to update file identifiers,
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from native (inode + device ID) and
path to fingerprint without any data duplication.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None, this backport does not change the default behaviour.

Author's Checklist

  • Test with dynamic config reload
  • Test with Kubernetes
  • Test with Elastic-Agent
  • Fix all the tests that break with the new behaviour
  • Investigate which integration tests are going to break in the Elastic-Agent repo

Regarding the Elastic-Agent integration tests, most tests actually use the log input because when they were written, Filestream was not available as an integration package. The very few other test that use Filestrem either generate a log file large enough or are skipped as flaky.

How to test this PR locally

Ensure this backport is not changing the default behaviour

  1. Create a log file with at least a few log lines and more than 1kb (e.g: /tmp/flog.log, 15 log lines), you can use flog with Docker:

    docker run -it --rm mingrammer/flog -n 15 > /tmp/flog.log
    
  2. Start Filebeat with the following configuration

    filebeat.yml (native)

    filebeat.inputs:
      - type: filestream
        id: "test-migrate-ID"
        paths:
          - /tmp/flog.log
    
    queue.mem:
      flush.timeout: 0s
    
    output.file:
      path: ${path.home}
      filename: "output-file"
      rotate_on_startup: false
    
    logging:
      level: debug
      selectors:
        - input
        - input.filestream
        - input.filestream.prospector
      metrics:
        enabled: false

  3. Wait until the file is fully ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  4. Stop Filebeat

  5. Look at the registry log file (cat data/registry/filebeat/log.json) and ensure all keys (k) start with filestream::test-migrate-ID::native::, ex:

    {"k":"filestream::test-migrate-ID::native::154893-40","v":{"ttl":-1,"updated":[280187419146957,1735836208],"cursor":{"offset":1515},"meta":{"source":"/tmp/flog.log","identifier_name":"native"}}}
    

Test the state migration

  1. Create a log file with at least a few log lines and more than 1kb (e.g: /tmp/flog.log, 15 log lines), you can use flog with Docker:

    docker run -it --rm mingrammer/flog -n 15 > /tmp/flog.log
    
  2. Start Filebeat with the following configuration

    filebeat.yml (native)

    filebeat.inputs:
      - type: filestream
        id: "test-migrate-ID"
        paths:
          - /tmp/flog.log
        file_identity.native: ~
        prospector:
          scanner:
            check_interval: 0.1s
            fingerprint.enabled: false
    
    queue.mem:
      flush.timeout: 0s
    
    output.file:
      path: ${path.home}
      filename: "output-file"
      rotate_on_startup: false
    
    logging:
      level: debug
      selectors:
        - input
        - input.filestream
        - input.filestream.prospector
      metrics:
        enabled: false

  3. Wait until the file is fully ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  4. Ensure all events have been published to the output (wc -l ./output-file* should return 15)

  5. Stop Filebeat

  6. Change the file identity to fingerprint. It's the new default, hence it's not explicitly set.

    filebeat.yml (fingerprint)

    filebeat.inputs:
      - type: filestream
        id: "test-migrate-ID"
        paths:
          - /tmp/flog.log
        prospector:
          scanner:
            check_interval: 0.1s
    
    queue.mem:
      flush.timeout: 0s
    
    output.file:
      path: ${path.home}
      filename: "output-file"
      rotate_on_startup: false
    
    logging:
      level: debug
      selectors:
        - input
        - input.filestream
        - input.filestream.prospector
      metrics:
        enabled: false

  7. Start Filebeat

  8. Wait until the Filebeat "finds the end of the file" (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  9. Ensure no extra event was published ((wc -l ./output-file* should still return 15)

  10. Add 10 more lines to the file:

    docker run -it --rm mingrammer/flog -n 10 >> /tmp/flog.log
    
  11. Wait until the new lines are ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  12. Ensure all events have been published to the output with no duplication (wc -l ./output-file* should return 25)

Related issues

Use cases

Dealing with identity reuse (e.g: inode reuse) without facing re-ingestion of data with Filestream input

## Screenshots

Logs


This is an automatic backport of pull request #41762 done by [Mergify](https://mergify.com).

…m `native` or `path` (#41762)

This commit changes the default `file_identity` from `native` to
`fingerprint`, any previous state from `native` (or `path`) is
automatically migrated to `fingerprint` whe Filestream is starting.

The Filestream input has always had the [ability to update file identifiers](https://github.com/elastic/beats/blob/4278366ab03221e8b62183dc06f9505f6ccc5209/filebeat/input/filestream/prospector.go#L104-L122),
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from `native` (inode + device ID) and
`path` to `fingerprint` without any data duplication.

(cherry picked from commit 78fe7a5)

# Conflicts:
#	filebeat/tests/integration/filestream_test.go
@mergify mergify bot requested a review from a team as a code owner December 19, 2024 19:12
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Dec 19, 2024
@mergify mergify bot requested review from belimawr and faec and removed request for a team December 19, 2024 19:12
Copy link
Contributor Author

mergify bot commented Dec 19, 2024

Cherry-pick of 78fe7a5 has failed:

On branch mergify/bp/8.x/pr-41762
Your branch is up to date with 'origin/8.x'.

You are currently cherry-picking commit 78fe7a5b7.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG.next.asciidoc
	modified:   filebeat/_meta/config/filebeat.global.reference.yml.tmpl
	modified:   filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
	modified:   filebeat/docs/faq.asciidoc
	modified:   filebeat/docs/inputs/input-filestream-file-options.asciidoc
	modified:   filebeat/docs/inputs/input-filestream.asciidoc
	modified:   filebeat/filebeat.reference.yml
	modified:   filebeat/include/list.go
	modified:   filebeat/input/filestream/environment_test.go
	modified:   filebeat/input/filestream/fswatch.go
	modified:   filebeat/input/filestream/fswatch_test.go
	modified:   filebeat/input/filestream/identifier.go
	modified:   filebeat/input/filestream/identifier_test.go
	modified:   filebeat/input/filestream/input_integration_test.go
	modified:   filebeat/input/filestream/input_test.go
	modified:   filebeat/input/filestream/internal/input-logfile/prospector.go
	modified:   filebeat/input/filestream/internal/input-logfile/store.go
	modified:   filebeat/input/filestream/internal/input-logfile/store_test.go
	modified:   filebeat/input/filestream/legacy_metrics_integration_test.go
	modified:   filebeat/input/filestream/metrics_integration_test.go
	modified:   filebeat/input/filestream/parsers_integration_test.go
	modified:   filebeat/input/filestream/prospector.go
	modified:   filebeat/input/filestream/prospector_creator.go
	modified:   filebeat/input/filestream/prospector_test.go
	new file:   filebeat/input/filestream/testdata/log.log
	modified:   filebeat/tests/integration/event_log_file_test.go
	modified:   filebeat/tests/integration/filestream_truncation_test.go
	modified:   filebeat/tests/integration/store_test.go
	modified:   filebeat/tests/integration/translate_ldap_attribute_test.go
	modified:   filebeat/tests/system/config/filestream-fixup-id.yml.j2
	modified:   filebeat/tests/system/test_reload_inputs.py
	modified:   libbeat/tests/integration/framework.go
	modified:   x-pack/filebeat/filebeat.reference.yml

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   filebeat/tests/integration/filestream_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 19, 2024
@belimawr belimawr force-pushed the mergify/bp/8.x/pr-41762 branch from d8c1b46 to ef081b9 Compare December 19, 2024 22:15
Copy link
Contributor Author

mergify bot commented Dec 23, 2024

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

1 similar comment
Copy link
Contributor Author

mergify bot commented Dec 30, 2024

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

@jlind23
Copy link
Collaborator

jlind23 commented Dec 30, 2024

@belimawr as this is a breaking change, do we really want to backport this to 8.x?
cc @cmacknz

@belimawr
Copy link
Contributor

belimawr commented Jan 2, 2025

@belimawr as this is a breaking change, do we really want to backport this to 8.x? cc @cmacknz

The plan is to backport the state migration bit, which is not a breaking change, but remove the defaults change, the breaking change.

That's the plan: #41762 (comment)

@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 2, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 2, 2025
Copy link
Contributor Author

mergify bot commented Jan 2, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b mergify/bp/8.x/pr-41762 upstream/mergify/bp/8.x/pr-41762
git merge upstream/8.x
git push upstream mergify/bp/8.x/pr-41762

@belimawr
Copy link
Contributor

belimawr commented Jan 2, 2025

ef081b9 reverted the breaking change part of the original PR.

@jlind23 jlind23 requested review from a team and VihasMakwana and removed request for a team January 2, 2025 19:46

[source,yaml]
----
file_identity.fingerprint: ~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what is the meaning of the tilde character here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the object file_identity.fingerprint to exist. I believe some times when the config is exported/dumped as YAML, it ends up as file_identity.fingerprint: null.

We need file_identity.fingerprint defined, but it does not have any attribute, nor can it be a primitive type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thanks. I suspected this was some kind of placeholder, but I was unsure if this was some kind of yaml/parser thing but it seems to be just a project convention, TIL.

Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

mergify bot commented Jan 6, 2025

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

@jlind23
Copy link
Collaborator

jlind23 commented Jan 6, 2025

@belimawr looks like the doc build is failing with this error which seems legit:
INFO:build_docs:asciidoctor: WARNING: invalid reference: filebeat-input-filestream-file-identity-fingerprint

Could you please take a look?

@belimawr
Copy link
Contributor

belimawr commented Jan 6, 2025

@belimawr looks like the doc build is failing with this error which seems legit: INFO:build_docs:asciidoctor: WARNING: invalid reference: filebeat-input-filestream-file-identity-fingerprint

Could you please take a look?

Yes, I'll take a look today.

@belimawr belimawr merged commit 1fff0d3 into 8.x Jan 6, 2025
140 of 142 checks passed
@belimawr belimawr deleted the mergify/bp/8.x/pr-41762 branch January 6, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport conflicts There is a conflict in the backported pull request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants