-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Duplicates in eff_sat for initial load #179
Comments
Thanks for this report! Just to let you know it has not gone unnoticed 😄 |
Hi and thanks for reporting this. We've had some internal discussions and believe from the information you've provided here that it is working as intended. A change in a PK/HK should mean a change in the record, so just because PK1 is not depended on by anything, the fact it is changing does mean something. This is our understanding of what you have provided; Raw/primed stage data
Link
Eff Sat
I would suggest that potentially, your data model might need re-thinking. Saying all of this, after working through the above example, I am not sure how you would be configuring your models and what the metadata would look like to achieve what you have described. My above guess has a few issues. Let's continue to discuss this and see if we can get to the bottom of it |
I added a payload column to the example, and specified what my dbtvault config would look like. The Attr column is dependent on the PK of the source file ([PK1, PK2]). But the source file is actually already a join between several tables from the source, so I really had to do proper analysis to see what concepts are in there, which are the keys and which are attributes, and to which key each attribute belonged. So that's how I found out that FK is dependent only on PK2, which is why I modeled [PK2, FK] as a separate link with an eff_sat on top of it. If that's completely wrong, that could be my mistake. I would have wanted to receive separate source files with the individual concepts but unfortunately, going back to the source is not an option.
Link1
Sat1
Link2
Eff Sat
|
Hi 👋 - I've run into a very similar (if not the same) issue with regular satellites (sat macro) sourcing from a table at a different grain. Describe the bug Environment To Reproduce Expected behavior Why I think this is happening WITH source_data AS (
SELECT a.CLIENT_HK, a.CLIENT_CK, a.<client_attribute_columns>, a.EFFECTIVE_FROM, a.LOAD_DATETIME, a.SOURCE
FROM <prime_stage_for_projects_view> AS a
WHERE a.CLIENT_HK IS NOT NULL
),
latest_records AS (
SELECT a.CLIENT_HK, a.CLIENT_CK, a.LOAD_DATETIME
FROM (
SELECT current_records.CLIENT_HK, current_records.CLIENT_CK, current_records.LOAD_DATETIME,
RANK() OVER (
PARTITION BY current_records.CLIENT_HK
ORDER BY current_records.LOAD_DATETIME DESC
) AS rank
FROM <sat_for_clients_table> AS current_records
JOIN (
SELECT DISTINCT source_data.CLIENT_HK
FROM source_data
) AS source_records
ON current_records.CLIENT_HK = source_records.CLIENT_HK
) AS a
WHERE a.rank = 1
),
records_to_insert AS (
SELECT DISTINCT stage.CLIENT_HK, stage.CLIENT_CK, stage.<client_attribute_columns>, stage.EFFECTIVE_FROM, stage.LOAD_DATETIME, stage.SOURCE
FROM source_data AS stage
LEFT JOIN latest_records
ON latest_records.CLIENT_HK = stage.CLIENT_HK
AND latest_records.CLIENT_CK = stage.CLIENT_CK
WHERE latest_records.CLIENT_CK IS NULL
)
SELECT * FROM records_to_insert; On first run (or full-refresh), the Workarounds
Questions Hopefully this is enough detail to understand the issue. Thank you! |
Hi! We believe this is related to the improvements we recently made to standard satellites, which was a long-standing and fundamental restriction expecting 1 instance of a key per batch. If users don't meet these expectations then unpredictable behavior will occur. The recent improvement we've made to standard Satellites are something we're planning to roll out to the other satellite types, including Effectity Satellites. I cannot 100% confirm that this is the root cause, but from everything contributed here I can make a fair assumption. All of our tests for Effectity Satellites are currently passing, however they assume the standard Data Vault 2.0 loading expectations (true delta feeds, only changes). We appreciate the ongoing discussion and information provided, we'll make sure to test the existing effectivity satellites using the provided information to reproduce the issues being encountered, and of course add regression tests for these edge cases when a fix is in place. |
Additional thought @EliasAthey are you including the collision key in your hash key? It looks like you're treating it as a composite PK instead. I'm not by my desk right now so I can't look into this further but I suspect this would solve your issue. Whether or not the composite PK should be treated as unique (I think it should be and you shouldn't be seeing this issue regardless) will need to be looked at as well. |
I've noticed the same thing on initial loads building using the effectivity sat on a stage view over the source table. It simply loads everything for the link hash key, even if the relationship hasn't changed. Our work around as been developing a dedicated raw stage model just to show distinct relationship changes over our source data and feed this to the eff macro (not ideal). I'd have thought bringing in some of the features of the standard macro such as using a hash diff to only bring changes to the relationship and only adding new records when the hash diff changes on the driving key would be the way to go? That would remove the need to create a dedicated raw stage model just for housing data on a relationship change. |
Describe the bug
When I load an eff_sat with a driving key that is not the pk of that source table, I will end up with duplicates. Only happens when doing an initial load.
Environment
dbt version: 1.2.0
dbtvault version: 0.9.2
Database/Platform: Databricks, using the snowflake macro
To Reproduce
Say I have a source model with 3 columns. PK1 and PK2 make up the PK of the model. FK is directly dependent only on PK2.
Model a link and an eff_sat on PK2+FK
Expected behavior
In the link macro that will be used to load from PK2 & FK, a distinct happens, so AAA+ZZZ will only be loaded once. In the eff_sat macro, there are distincts as well, but not in the right place, imo. So now, AAA+ZZZ will be loaded twice. When I put a distinct in the source_data CTE on line 32, the duplicates are gone. Also, I think the other distincts (line 74, 97, 123, 144) can then be removed. (And maybe even change the unions in the records_to_insert CTE to union all?)
AB#5368
The text was updated successfully, but these errors were encountered: