-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Records Duplication in BigQuery when Streaming Data from Kafka Topic via kafka-connect-bigquery #369
Comments
Are you sure you don't have equivalent (not duplicate) data in the topic? For example, your producer is sending the same record, more than once, but at different offsets? Maybe you can use InsertField transform to include offset/partition information |
Yes, Initially, I had the same concern that records might be duplicated from the producer's side, but upon further investigation, I discovered that this is not the case. What I found was that each record is duplicated precisely once and not more than twice. This means that this is being caused by the connector. |
Does console consumer show the same? Can you get offset information for each record via connect transforms? |
Do you mean in kafka connect logs? How do we get that? |
|
Yes, no duplicates exist in the records produced to the topic. I have verified this by using Kafdrop and writting a separate consumer that logs record_id (the same set of record_ids that was duplicated in bigquery) and offset values. Both methods showed no duplicates. When the producer create records, each record receives a unique timestamp and record_id. However, I have noticed that the duplicated records are exactly identical, including the timestamp. It seems like the same record is being inserted into Bigquery twice |
@spd123 BigQuery is at-least once connector which means it guarantees to ingest records at-least once in the destination table. If you have verified that the topic contains unique data and there is only one bigquery connector/writer writing into the topic, then duplication can happen if the record ingestion is re-attempted. Re-attempt is made when the same set of records are sent again as those are not committed in first attempt. Could you check if there are warn/error logs which indicate commit failures? |
We also saw a similar situation here several months back. It's extremely rare, but it did happen, and as a workaround a wrapping BigQuery view was unfortunately created to keep only the newest row version, using PostgreSQL LSN to keep only the latest. This was for sinking a Kafka topic originating from Debezium PostgreSQL connector, and stored in the simple UPSERT topic format (that is, Kafka key == table primary key, Kafka value == new row value, or tombstone if deleting). The Debezium connector snapshots the PG table into Kafka, and then follows the PG write-ahead log to stream subsequent updates. Thus we can safely assume that the most recent Kafka message with a particular key represents the latest row version in the underlying PG table. The BigQuery sink configuration at the time of the record duplication was as follows:
This was before exactly-one support in Debezium, so obviously some duplication can be expected there. But I think that is not a problem, because the duplicates would all have the same Kafka key and only the most recent message with the same Kafka key should count. @b-goyal also correctly points out that BigQuery streaming inserts are normally at-least-once. Thus the initial BigQuery inserts could also be subject to even more duplication. However, the key here is that the connector was configured with both Notice the key USING block when SELECTing from the temporary intermediate table, which seems (to me) to clearly select only at most one row from the source - the newest row. The remainder of the MERGE statement seems to clearly use it to update any existing row in the target table rather than insert a new one.
Thus, even though the intermediate table might have duplicates due to the multiple issues mentioned above, they shouldn't make it into the final target table due to the de-duplication that the MERGE statement does. Yet unfortunately it did happen: In this example, This happened some time ago, so I can't look up the exact logs any more. But the engineer on call did note at the time that the BigQuery sink was being rate-limited around the same time for an extended duration with a bunch of errors like: |
Issue Description:
When streaming data from a Kafka topic to BigQuery using the kafka-connect-bigquery connector, I have observed instances of records being duplicated in the BigQuery table. This behavior appears to be inconsistent but recurring.
Steps to Reproduce:
Expected Behavior:
Records should be ingested into BigQuery without any duplication.
Actual Behavior:
Some records are duplicated in the BigQuery table during the streaming process.
Dependecies:
Kafka Connect BigQuery Connector Version: 1.6.6
The text was updated successfully, but these errors were encountered: