Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for decoding built-in types from the replication stream #1

Open
imor opened this issue Apr 30, 2024 · 4 comments
Open

Add support for decoding built-in types from the replication stream #1

imor opened this issue Apr 30, 2024 · 4 comments

Comments

@imor
Copy link
Contributor

imor commented Apr 30, 2024

Support built-in types from the following table: https://www.postgresql.org/docs/current/datatype.html#DATATYPE-TABLE

Done Name Aliases Description
Yes bigint int8 signed eight-byte integer
? bigserial serial8 autoincrementing eight-byte integer
? bit [ (n) ]   fixed-length bit string
? bit varying [ (n) ] varbit [ (n) ] variable-length bit string
Yes boolean bool logical Boolean (true/false)
? box   rectangular box on a plane
Yes bytea   binary data (“byte array”)
Yes character [ (n) ] char [ (n) ] fixed-length character string
Yes character varying [ (n) ] varchar [ (n) ] variable-length character string
? cidr   IPv4 or IPv6 network address
? circle   circle on a plane
? date   calendar date (year, month, day)
? double precision float8 double precision floating-point number (8 bytes)
? inet   IPv4 or IPv6 host address
Yes integer int, int4 signed four-byte integer
? interval [ fields ] [ (p) ]   time span
? json   textual JSON data
? jsonb   binary JSON data, decomposed
? line   infinite line on a plane
? lseg   line segment on a plane
? macaddr   MAC (Media Access Control) address
? macaddr8   MAC (Media Access Control) address (EUI-64 format)
? money   currency amount
? numeric [ (p, s) ] decimal [ (p, s) ] exact numeric of selectable precision
? path   geometric path on a plane
? pg_lsn   PostgreSQL Log Sequence Number
? pg_snapshot   user-level transaction ID snapshot
? point   geometric point on a plane
? polygon   closed geometric path on a plane
? real float4 single precision floating-point number (4 bytes)
Yes smallint int2 signed two-byte integer
? smallserial serial2 autoincrementing two-byte integer
? serial serial4 autoincrementing four-byte integer
Yes text   variable-length character string
? time [ (p) ] [ without time zone ]   time of day (no time zone)
? time [ (p) ] with time zone timetz time of day, including time zone
Yes timestamp [ (p) ] [ without time zone ]   date and time (no time zone)
Yes timestamp [ (p) ] with time zone timestamptz date and time, including time zone
? tsquery   text search query
? tsvector   text search document
? txid_snapshot   user-level transaction ID snapshot (deprecated; see pg_snapshot)
? uuid   universally unique identifier
? xml   XML data
@cvauclair
Copy link

cvauclair commented Sep 18, 2024

Hey @imor, Im working on a project for which I want to use pg_replicate with a custom sink. However, my source DB contains a lot of types that are currently unsupported (most notably the numeric type).

I took a look around the codebase as well as the diesel codebase (the latter does support all built-in types) to see how we could extend the supported types.

Diesel essentially wraps the raw pgoutput bytes in a PgValue type, and implements the FromSql trait to convert the PgValue raw bytes to Rust types (BigDecimal in the case of the numeric postgres type).

Since those trait implementations parse the raw bytes, it seems like most of the FromSql implementations could be reused (if not outright copied) in your CdcEventConverter::from_tuple_data to support additional types but I wanted to check in with you to see what you thought of the idea.

Love the project, keep up the good work!

@imor
Copy link
Contributor Author

imor commented Sep 19, 2024

Your ideas are directionally correct. Just want to point out that apart from CdcEventConverter::from_tuple_data there's also TableRowConverter::get_cell_value which is called when initial table is copied before CdcEvent start arriving.

Apart from the conversions themselves you might face another hurdle if you are implementing a custom sink. Methods in the BatchSink (and the older Sink) traits hard code the error type as SinkError which might not work for a custom sink impl outside the pg_replicate crate. There's an issue for this but haven't found the time to tackle it yet.

@cvauclair
Copy link

Yes I noticed pretty early there were two converters to handle 😅

Quick question about this, is there a reason why the input of both converters is different? As in, why are the bytes in CdcEventConverter::from_tuple_data are parseable with a simple from_utf8 (even the for the numeric type) while the bytes in TableRowConverter::get_cell_value require more elaborate decoding? (sorry if this is obvious, pretty new to this low-level postgres stuff).

On the SinkError issue, I noticed that there is already a fork out there which does implement the error as an associated type (I inspired myself from that fork for my own). Is that something you would consider merging if I opened a PR for it?

@imor
Copy link
Contributor Author

imor commented Sep 20, 2024

is there a reason why the input of both converters is different?

The reason for two separate code paths for decoding tuple values is because they are coming in from two different operations. When we want to replicate a table we want two things: the data in the table as it exists right now and any changes to the table data as they happen in realtime. The TableRowConverter::get_cell_value code path is for the initial table copy (data as it exists in the table right now) and CdcEventConverter::from_tuple_data is for the realtime changes to the table data. If you dig into the types of the input for these two operations (BinaryCopyOutRow and TupleData) you'd notice that the first one is a binary format and the second one is a text format (although clients can request a binary format but it's less portable as mentioned in the Postgres docs). That's another reason for the difference.

Is that something you would consider merging if I opened a PR for it?

PRs are always welcome to make the code better :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants