-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro files miss headers #16994
Comments
Thanks for the report @razumau ! Yes, you are correct that the Fixing this would require a bit of an update to the encoding system within Vector which only deals with encoding individual events and not, for example, providing additional data to write at the beginning of each "batch". This same issue would come up with a |
@jszwedko Hello! I'm afraid that at the moment in real use this sync is almost useless. Because almost every tool or system I tried to process the resulting file expects a schema or at least a header. I'm not an expert on the Avro standard - but is the presence of a header without the data schema itself a valid Avro file? Perhaps, if the previous assumption is correct, it is possible to add only a header to the final resulting avro file of the sync - so that the result can be read by standard libraries (in my case attempts to process the resulting file included python, apache spark and apache impala) |
I'm not an avro expert, but my understanding is that the schema can be provided when decoding data files so it is usable albeit less convenient. Ideally the schema would be included in the file itself too. |
+👍 as this issue prevents from using Google BigQuery with the generated files. |
A note for the community
Problem
Avro files generated by sinks miss header.
According to Avro’s specification, Avro files should begin with a schema. However, Avro files produced by File or S3 sinks start with data and don’t have headers.
It seems to be happening because the only Avro-writing method used in Vector is
to_avro_datum
(vector/lib/codecs/src/encoding/format/avro.rs
Line 73 in 6542778
Am I missing something? Should there be more in my config? I’ve attached a minimal config that creates a small avro file. Avro files are in this gist.
Configuration
Version
0.28.1
Debug Output
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: