Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG_BASED replication seems to wipe target table before loading #71

Open
dluo-sig opened this issue Jul 8, 2024 · 8 comments
Open

LOG_BASED replication seems to wipe target table before loading #71

dluo-sig opened this issue Jul 8, 2024 · 8 comments

Comments

@dluo-sig
Copy link

dluo-sig commented Jul 8, 2024

I'm using this in conjunction with target-snowflake, and I've noticed that the target table is always truncated before the new records are loaded in. I'm not sure where this behavior occurs, but I do not get this problem with INCREMENTAL. That said, it does seem to correctly pick up on records in the CDC log, though I can't confirm if the deletion is working as expected due to this issue.

@s7clarke10
Copy link
Collaborator

What variant of target-snowflake are you using. I would look at the github repository for target-snowflake and search for the key word truncate. I have not seen this problem with the version / variant that we use pipelinewise-target-snowflake.

I wonder if you have the replication_method set to truncate? Look at this setting.

https://github.com/mjsqu/pipelinewise-target-snowflake/blob/d9dfcdfc222d7b6552bce65ac971dfd029b0eef2/target_snowflake/db_sync.py#L71

And the documentation here. https://github.com/mjsqu/pipelinewise-target-snowflake/blob/d9dfcdfc222d7b6552bce65ac971dfd029b0eef2/README.md?plain=1#L183.

I don't believe this is a feature of tap-mssql.

@dluo-sig
Copy link
Author

dluo-sig commented Jul 8, 2024

I'm using the meltanolabs variant, which is default for meltano. Are you using the forked version that you linked, or is the wise variant listed on meltano hub the same? This other replication_method is a separate one on the target? I will take a look into that, thanks for the suggestions on where to investigate.

@dluo-sig
Copy link
Author

dluo-sig commented Jul 8, 2024

I just checked and the forked version that you linked is the only one that seems to have this replication_method parameter.

@s7clarke10
Copy link
Collaborator

s7clarke10 commented Jul 8, 2024

Yes, we are using the wise variant from the Meltano Hub. We have our own variant so we can keep it patched. The other key reason is we have a setting in tap-mssql and our variant of target-snowflake called use_singer_decimal which will transfer decimal and floating point data without possible truncation. Data is transferred as a string and then converted to the correct datatype as part of the loader. Without this we observed that there can be rounding or truncation of certain numerics.

In terms of the Meltano variant there is a hard_delete option - https://github.com/MeltanoLabs/target-snowflake/blob/b357c52574380345e9253eace99989cd35787ca0/README.md?plain=1#L30 .

I would check this setting and raise and issue on their GitHub for this if you still have an issue.

@dluo-sig
Copy link
Author

dluo-sig commented Jul 8, 2024

Yeah, it's defaulted to false. In any case, I will investigate further. Going to close this out unless I pinpoint the cause otherwise.

@dluo-sig dluo-sig closed this as completed Jul 8, 2024
@dluo-sig
Copy link
Author

Just want to gather some additional information here.. It seems like the cause of the behavior is actually the hard_delete parameter on the snowflake target. When disabled, it does a soft delete and the behavior seems correct. However, when set, it seems to process the activate_version message and delete everything as if it was a full load. It sounds like because the soft delete is working, that means all the right info is at least being sent.

Based on the meltano docs, it seems to suggest that, while not formalized, it should only be sent when replication mode is full sync. What I'm trying to understand is if the activate_version should be sent in the first place, or it the target is processing it incorrectly. I think I can also see another issue where even if the table doesn't get truncated by activate_version, the target would also need to know to process either _sdc_operation_type or _sdc_lsn_deleted_at to actually delete the target record, or is that already something that is standardized? Any insights?

@mjsqu
Copy link
Collaborator

mjsqu commented Jul 24, 2024

I can't promise to devote any time to this, but you are correct in your reading of the ACTIVATE_VERSION section of the documentation.

And the Log Based section of the documentation states:

targets have the information it needs to delete records in the destination.

However, unfortunately there don't seem to be any standards in place regarding the naming of deletion column names. Meltano suggests in Target Specific > Soft Delete that:

Targets that support this capability implement logic that soft deletes records in the destination, usually by populating a deleted_at timestamp field

In itself this is a bit confusing as targets aren't 'populating' the field, it is being set by the tap and should be interpreted by the target.

Things we could do in this tap:

  • Adopt the suggested deleted_at name for _sdc_lsn_deleted_at, or review other targets that use log-based (off the top of my head that would be Oracle, MongoDB,...) - immediately I found another tap that uses _sdc_deleted_at so this reinforces the lack of any standard ☹️
  • Remove the ACTIVATE_VERSION message from all but the FULL_TABLE sync mode

@dluo-sig
Copy link
Author

I wonder if we can leave the current delete column alone to be consistent with the rest of the columns within this tap, but introduce an optional parameter in the yml to override the delete column name. That way, it can supply whatever the target being used is expecting. Obviously, it would be good if there was a standard, but here we are.

The removal of ACTIVATE_VERSION seems to make sense if there's not a good reason to go against the docs.

@dluo-sig dluo-sig reopened this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants