Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize file source part #95

Merged
merged 12 commits into from
Nov 28, 2024
Merged

Reorganize file source part #95

merged 12 commits into from
Nov 28, 2024

Conversation

WanYixian
Copy link
Collaborator

@WanYixian WanYixian commented Nov 27, 2024

Description

Summarize file source part, document batch reading and data type mapping.

Related Code PR

risingwavelabs/risingwave#15358
risingwavelabs/risingwave#19561

Related Doc Issues

Resolve #51
Resolve #86

Preview

File source management: https://risingwavelabs-wyx-file-source-related.mintlify.app/ingestion/overview#file-source-management
Supported Parquet format: https://risingwavelabs-wyx-file-source-related.mintlify.app/ingestion/supported-sources-and-formats#parquet

@WanYixian
Copy link
Collaborator Author

@wcy-fdu let me know if any comments, and pls help provide the unsupported data type, thanks!

delivery/overview.mdx Outdated Show resolved Hide resolved
@@ -105,7 +105,7 @@ When creating an `upsert` sink, note whether or not you need to specify the prim
<Note>
**PUBLIC PREVIEW**

Sink data in parquet encode is in the public preview stage, meaning it's nearing the final product but is not yet fully stable. If you encounter any issues or have feedback, please contact us through our [Slack channel](https://www.risingwave.com/slack). Your input is valuable in helping us improve the feature. For more information, see our [Public preview feature list](/changelog/product-lifecycle#features-in-the-public-preview-stage).
Sink data in Parquet encode is in the public preview stage, meaning it's nearing the final product but is not yet fully stable. If you encounter any issues or have feedback, please contact us through our [Slack channel](https://www.risingwave.com/slack). Your input is valuable in helping us improve the feature. For more information, see our [Public preview feature list](/changelog/product-lifecycle#features-in-the-public-preview-stage).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WanYixian Let's always refer to Parquet as a format. "Sinking data in Parquet format..."

delivery/overview.mdx Outdated Show resolved Hide resolved
ingestion/overview.mdx Outdated Show resolved Hide resolved
WanYixian and others added 2 commits November 28, 2024 09:25
Co-authored-by: hengm3467 <[email protected]>
Signed-off-by: IrisWan <[email protected]>
Co-authored-by: hengm3467 <[email protected]>
Signed-off-by: IrisWan <[email protected]>
@@ -63,6 +63,15 @@ FORMAT [ DEBEZIUM | UPSERT | PLAIN ] ENCODE AVRO (

Note that for `map.handling.mode = 'jsonb'`, the value types can only be: `null`, `boolean`, `int`, `string`, or `map`/`record`/`array` with these types.

### Bytes

RisingWave allows you to read data streams without decoding the data by using the `BYTES` row format. However, the table or source can have exactly one field of `BYTEA` data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have an example here? I don't get this exactly one field part. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to sort these formats in alphabetical order, so move this one from down below to above 🤣
image

Copy link
Collaborator

@hengm3467 hengm3467 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Thanks!

ingestion/overview.mdx Outdated Show resolved Hide resolved
| decimal | decimal |
| Int8 | Int16 |
| UInt8 | Int16 |
| UInt16 | Int32 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Int16 should be changed into smallint, Int32 -> int, Int64 -> bigint?
cc @xiangjinwu to confirm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right column for RisingWave shall use the names smallint, int, bigint, decimal, real, double precision.

The left column for Parquet shall also be consistent. Names shall be int32 and int64 rather than integer and long. (And seems int16 is missing?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unsupported datatype, I think Int96 , FIXED_LEN_BYTE_ARRAY , refer to https://parquet.apache.org/docs/file-format/types/

Not sure if there is any omission, can you help to confirm?

| UInt16 | Int32 |
| UInt32 | Int64 |
| UInt64 | Decimal |
| Float16 | Float32 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Float32 -> real

@WanYixian WanYixian merged commit a595aa4 into main Nov 28, 2024
3 checks passed
@WanYixian WanYixian deleted the wyx/file-source-related branch November 28, 2024 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants