Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3: add description of parquet compression #1380

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
48 changes: 47 additions & 1 deletion pipeline/outputs/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b
| sts\_endpoint | Custom endpoint for the STS API. | None |
| profile | Option to specify an AWS Profile for credentials. | default |
| canned\_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None |
| compression | Compression type for S3 objects. 'gzip' is currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. | None |
| compression | Compression type for S3 objects. 'gzip' and 'parquet' are currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can also use 'arrow'. If [columnify](https://github.com/reproio/columnify) command is installed, you can also compress as parquet format. For gzip compression, the Content-Encoding HTTP Header will be set to 'gzip'. Gzip and parquet compression can be enabled when `use_put_object` is 'on' or 'off' (PutObject and Multipart). Arrow compression can only be enabled with `use_put_object On`. A configuration error will be triggered for invalid combinations. Default is empty and no compression therefore. | |
| content\_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. | None |
| send\_content\_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false |
| auto\_retry\_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | true |
Expand All @@ -49,6 +49,14 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b
| storage\_class | Specify the [storage class](https://docs.aws.amazon.com/AmazonS3/latest/API/API\_PutObject.html#AmazonS3-PutObject-request-header-StorageClass) for S3 objects. If this option is not specified, objects will be stored with the default 'STANDARD' storage class. | None |
| retry\_limit | Integer value to set the maximum number of retries allowed. Note: this configuration is released since version 1.9.10 and 2.0.1. For previous version, the number of retries is 5 and is not configurable. | 1 |
| external\_id | Specify an external ID for the STS API, can be used with the role\_arn parameter if your role requires an external ID. | None |
| parquet.compression | Compression type for parquet. 'uncompressed', 'snappy', 'gzip', 'zstd' are the supported values by default. 'lzo', 'brotli', 'lz4' are not supported for now. The default value is upper case of snappy. Also, users can specify lower case of compression types. The specified value will be converted to upper case automatically. | SNAPPY |
| parquet.pagesize | Page size of parquet format. Defaults to 8192 bytes (8KiB). | 8192 bytes |
| parquet.row\_group\_size | Row group size of parquet format. Defaults to 134217728 bytes (128MiB). | 134217728 bytes |
| parquet.record\_type | Format type of records on parquet format. Defaults to json. | json |
| parquet.schema\_type | Format type of schema on parquet format. Defaults to json. | avro |
| parquet.schema\_file | Specify path to schema file for parquet compression. | None |
| parquet.process\_dir | Specify a temporary directory for processing parquet objects. This paramater is effective for non Windows platforms. | /tmp |


## TLS / SSL

Expand Down Expand Up @@ -282,6 +290,44 @@ Example:

Then, the records will be stored into the MinIO server.

## Usage for Parquet Compression

For parquet compression, users need to install to install [columnify](https://github.com/reproio/columnify) in the running system or container at runtime.

After installing that command, out_s3 can handle parquet compression:

```
[OUTPUT]
Name s3
Match *
bucket your-bucket
Use_Put_object true
compression parquet
parquet.schema_file /path/to/your-schema.avsc
parquet.compression snappy
```

### Build columnify Command

For building columnify command, users need to inherit and use Golang development container:

```
FROM golang:1-alpine as builder # Always refer the latest golang:1-aline image

ENV ROOT=/go/src/cmd
WORKDIR ${ROOT}

RUN apk update && \
apk add git

RUN go install github.com/reproio/columnify/cmd/columnify@latest

FROM debian:bullseye-slim as production

# Put columnify command inside the PATH.
COPY --from=builder /go/bin/columnify /usr/bin/columnify
patrick-stephens marked this conversation as resolved.
Show resolved Hide resolved
```

## Getting Started

In order to send records into Amazon S3, you can run the plugin from the command line or through the configuration file.
Expand Down