Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simdutf: simdutf_connector: in_tail: Implement UTF-16LE/UTF-16BE encoder #9468

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Oct 7, 2024

In Windows, there are lots of using UTF-16LE programs. This is because Unicode on Windows means UTF-16LE with BOM(Byte Order Mark).
In addition, there is lots of differences between UTF-16LE/UTF-16BE and UTF-8.
I added some of C, J and subdivision flags test cases for converting from UTF-16LE/UTF-16BE to UTF-8 in unit tests for in_tail plugin. This is because in_tail is the main usages to process non-UTF-8 encodings.
At first, we need to process UTF-16LE and UTF-16BE encodings.

Note that simdutf library is written in C++. So, we also provide an option (FLB_UNICODE_ENCODER) to turn on/off this feature.

Closes #9321


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
   flush           1
   log_level       trace

[INPUT]
   Name              tail
   Path              <path/to/non-UTF-8_encoded_file.log>
   Read_from_Head    True
   Unicode.Encoding  auto

[OUTPUT]
   Name  stdout
   Match *
  • Debug log output from testing the change
Fluent Bit v3.2.3
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _           _____  _____ 
|  ___| |                | |   | ___ (_) |         |____ |/ __  \
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`' / /'
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \  / /  
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /./ /___
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)_____/


[2024/12/19 14:42:52] [ info] Configuration:
[2024/12/19 14:42:52] [ info]  flush time     | 1.000000 seconds
[2024/12/19 14:42:52] [ info]  grace          | 5 seconds
[2024/12/19 14:42:52] [ info]  daemon         | 0
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  inputs:
[2024/12/19 14:42:52] [ info]      tail
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  filters:
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  outputs:
[2024/12/19 14:42:52] [ info]      stdout.0
[2024/12/19 14:42:52] [ info] ___________
[2024/12/19 14:42:52] [ info]  collectors:
[2024/12/19 14:42:52] [ info] [fluent bit] version=3.2.3, commit=de5ee981a2, pid=1225646
[2024/12/19 14:42:52] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/12/19 14:42:52] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/12/19 14:42:52] [ info] [simd    ] SSE2
[2024/12/19 14:42:52] [ info] [cmetrics] version=0.9.9
[2024/12/19 14:42:52] [ info] [ctraces ] version=0.5.7
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] initializing
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/12/19 14:42:52] [debug] [tail:tail.0] created event channels: read=25 write=26
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] adjusted buf_max_size to 4001
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] adjusted buf_chunk_size to 4001
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inotify watch fd=31
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170377 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log, inode 43170377
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log'
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170323 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [ info] [output:stdout:stdout.0] worker #0 started
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log, inode 43170323
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log'
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170324 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log, inode 43170324
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log'
[2024/12/19 14:42:52] [debug] [stdout:stdout.0] created event channels: read=35 write=36
[2024/12/19 14:42:52] [ info] [sp] stream processor started
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=123, records=1, input=tail.0
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=109, records=1, input=tail.0
[2024/12/19 14:42:52] [trace] [input chunk] update output instances with new chunk size diff=196, records=1, input=tail.0
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] [static files] processed 290b
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170377 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170377 watch_fd=1 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170323 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170323 watch_fd=2 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] inode=43170324 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log promote to TAIL_EVENT
[2024/12/19 14:42:52] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170324 watch_fd=3 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2024/12/19 14:42:52] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [task 0x6177b10] created (id=0)
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [debug] [task] created task=0x6177b10 id=0 OK
[2024/12/19 14:42:52] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] sqlerrorlog: [[1734586972.362419405, {}], {"log"=>"🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁷󠁬󠁳󠁿"}]
[1] sqlerrorlog: [[1734586972.388064603, {}], {"log"=>"用汉字在 Fluent Bit 中处理日志,就像是一个梦一样😀"}]
[2] sqlerrorlog: [[1734586972.389956708, {}], {"log"=>"にほんごテストログふぁいる。文字エンコーディングをUnicodeにできる!?☕😀⚪⚫🔴🔵🟠🟡🟢🟣🟤🇺🇸🇯🇵"}]
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [debug] [out flush] cb_destroy coro_id=0
[2024/12/19 14:42:52] [trace] [coro] destroy coroutine=0x6177db0 data=0x6177dd0
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:52] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2024/12/19 14:42:52] [debug] [task] destroy task=0x6177b10 (task_id=0)
[2024/12/19 14:42:52] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
^C[2024/12/19 14:42:53] [engine] caught signal (SIGINT)
[2024/12/19 14:42:53] [trace] [engine] flush enqueued data
[2024/12/19 14:42:53] [ warn] [engine] service will shutdown in max 5 seconds
[2024/12/19 14:42:53] [ info] [input] pausing tail.0
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [ info] [engine] service has stopped (0 pending tasks)
[2024/12/19 14:42:53] [ info] [input] pausing tail.0
[2024/12/19 14:42:53] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2024/12/19 14:42:53] [trace] [sched] 0 timer coroutines destroyed
[2024/12/19 14:42:53] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170377 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2024/12/19 14:42:53] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170377 watch_fd=1
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170323 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2024/12/19 14:42:53] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170323 watch_fd=2
[2024/12/19 14:42:53] [debug] [input:tail:tail.0] inode=43170324 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1225646== 
==1225646== HEAP SUMMARY:
==1225646==     in use at exit: 0 bytes in 0 blocks
==1225646==   total heap usage: 3,463 allocs, 3,463 frees, 1,050,521 bytes allocated
==1225646== 
==1225646== All heap blocks were freed -- no leaks are possible
==1225646== 
==1225646== For lists of detected and suppressed errors, rerun with: -s
==1225646== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1471

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from d1b404a to 4053bbd Compare October 7, 2024 07:13
@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from 4053bbd to 2a515ea Compare October 7, 2024 07:17
From UTF-16LE, UTF-16BE and UTF-16LE with BOM, UTF-16BE with BOM to
UTF-8 are supported.
This could be useful for Windows' Unicode insisted logs.
They are usually using UTF-16LE with BOM.

Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
…s not fully support C++11

Signed-off-by: Hiroshi Hatake <[email protected]>
Plus, waiting for relatively longer for the ordinary test cases.
This is because these test cases for unicode need to read contents from
filesystem.

Signed-off-by: Hiroshi Hatake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required ok-package-test Run PR packaging tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for reading files encoded in UTF-16 for Tail Input
5 participants