Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when using encoding.codec = "raw_message" in conjunction with buffer.type = "disk", empty lines are output by sinks #21578

Open
vbmithr opened this issue Oct 22, 2024 · 3 comments
Labels
domain: buffers Anything related to Vector's memory/disk buffers sink: aws_s3 Anything `aws_s3` sink related type: bug A code related bug.

Comments

@vbmithr
Copy link

vbmithr commented Oct 22, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Using the file sink, I get the output as excepted.
Using the aws_s3 sink, empty files are uploaded.

When using the json encoding.codec, both sinks work fine (I have to use framing.method = "newline_delimited" in aws_s3 to get ndjson, whereas it seems to be the default for the file sink).

I'd like to be able to use raw_message sink in aws_s3.

Thanks,

Configuration

data_dir = "/home/vb/.vector"
schema.log_namespace = true

[api]
enabled = true

[sources.fh]
type = "socket"
decoding.codec = "native"
mode = "unix_stream"
framing.method = "length_delimited"
framing.length_delimited.length_field_length = 4
framing.length_field_is_big_endian = true
path = "/tmp/fh.sock"

[transforms.create_msg]
type = "remap"
inputs = ["fh"]
source = '''
.message = join!(["{", "\"ts\":", to_string(to_unix_timestamp!(%fh.timestamp, unit: "microseconds")), ",\"data\":", ., "}"])
# .data = parse_json!(string!(.))
# .ts = %fh.timestamp
set_semantic_meaning(.message, "message")
'''

[sinks.split]
type = "file"
inputs = ["create_msg"]
path = "/home/vb/code/dm/ocaml/vector_out/{{ %fh.xch }}/{{ %fh.sym }}/{{ %fh.source }}/%F.raw"
encoding.codec = "raw_message"
encoding.timestamp_format = "unix_us"

# [sinks.console]
# type = "console"
# inputs = ["create_msg"]
# encoding.codec = "json"

[sinks.aws]
type= "aws_s3"
inputs = ["create_msg"]
bucket = "<redacted>"
compression = "zstd"
buffer.type = "disk"
buffer.max_size = 10737418240 # Max size of the buffer on disk (10 GiB)
batch.max_bytes = 1073741824 # Max batch size (1 GiB)
batch.timeout_secs = 60 # Upload every minute
filename_extension = "raw.zst"
encoding.codec = "raw_message"
framing.method = "newline_delimited"
encoding.timestamp_format = "unix_us"
key_prefix = "{{ %fh.xch }}/{{ %fh.sym }}/{{ %fh.source }}/%F/"
filename_time_format = "%T%.9f"
filename_append_uuid = false
auth.access_key_id = "<redacted>"
auth.secret_access_key = "<redacted>"
region = "eu-west-3"

Version

vector 0.42.0 (x86_64-unknown-linux-gnu)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@vbmithr vbmithr added the type: bug A code related bug. label Oct 22, 2024
@jszwedko
Copy link
Member

Hi @vbmithr ,

I'm having trouble reproducing this. I slimmed down your example to:

schema.log_namespace = true

[api]
enabled = true

[sources.fh]
type = "stdin"

[transforms.create_msg]
type = "remap"
inputs = ["fh"]
source = '''
.message = encode_json({ "data": . })
set_semantic_meaning(.message, "message")
'''

[sinks.split]
type = "file"
inputs = ["create_msg"]
path = "/tmp/vector.txt"
encoding.codec = "raw_message"
encoding.timestamp_format = "unix_us"

[sinks.aws]
type= "aws_s3"
inputs = ["create_msg"]
bucket = "jszwedko-test"
batch.timeout_secs = 60 # Upload every minute
encoding.codec = "raw_message"
framing.method = "newline_delimited"
encoding.timestamp_format = "unix_us"
key_prefix = "1/"
filename_time_format = "%T%.9f"
filename_append_uuid = false
region = "us-west-1"

In my case I saw both the file written to disk and the file created in S3 have the same contents. Am I maybe missing something? Do you think you could come up with a more minimal example that reproduces?

@jszwedko jszwedko added the sink: aws_s3 Anything `aws_s3` sink related label Oct 24, 2024
@vbmithr
Copy link
Author

vbmithr commented Oct 24, 2024

I can repro your example with using my socket source instead and indeed it works. Will try to bisect the issue by simplifying my config.

Edit: Found the issue. Seems to be an issue with using disk buffering. Both sources were actually affected, just that I did not set this up on the file source.

buffer.type = "disk"
buffer.max_size = 10737418240 # Max size of the buffer on disk (10 GiB)

By adding this config to any sink, empty lines are output. Looks like a bug unless I'm missing something. The disk buffer looks mandatory to me in order to avoid loosing events in case of a reboot of vector.

@vbmithr vbmithr changed the title aws_s3 sink inconsistent with file sink for same encoding settings when using encoding.codec = "raw_message" in conjunction with buffer.type = "disk", em Oct 24, 2024
@vbmithr vbmithr changed the title when using encoding.codec = "raw_message" in conjunction with buffer.type = "disk", em when using encoding.codec = "raw_message" in conjunction with buffer.type = "disk", empty lines are output by sinks Oct 24, 2024
@pront pront added the domain: buffers Anything related to Vector's memory/disk buffers label Oct 28, 2024
@jszwedko
Copy link
Member

Edit: Found the issue. Seems to be an issue with using disk buffering. Both sources were actually affected, just that I did not set this up on the file source.

buffer.type = "disk"
buffer.max_size = 10737418240 # Max size of the buffer on disk (10 GiB)

By adding this config to any sink, empty lines are output. Looks like a bug unless I'm missing something. The disk buffer looks mandatory to me in order to avoid loosing events in case of a reboot of vector.

Ah, that's interesting (and surprising!). I'll try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers sink: aws_s3 Anything `aws_s3` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

6 participants
@jszwedko @vbmithr @pront and others