-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate IDs in Several Tables #1545
Comments
No I don't think we've seen this recently. Are you able to pull just the rows with duplicated IDs and attach those here? |
Sure! Files are attached. For the claims tables and encounters, you'll see the dupe IDs are associated with completely different records. For imaging, the records appear to be full dupes. Also there are hundreds of copies of some of the imaging dupes, whereas for claims and encounters each duplicated ID is associated with only 2 rows. claims_transactions.csv |
Thanks! There are two separate issues here. The imaging dupes issue should be a trivial fix, that's a nested loop and we're using the wrong index in one spot. The duplicate IDs across different records is trickier. I'm worried our ID generation approach just isn't random enough for a large dataset like this, so we'll have to think about how we can improve that |
Sounds good! Thanks so much for your quick reply! My usecase for the million patient dataset was testing performance on an ETL, so not a huge rush on my end for that second improvement. Though it might be a good idea to assess above which threshold the unique ID generation starts to break down and recommend users not to go above that number. |
What happened?
I generated a 1 million patient Synthea dataset following Basic Setup instructions: https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running. I output to csv files. Upon loading the csvs into duckdb and running some tests I discovered duplicated primary keys in the following tables: imaging_studies, claims_transactions, claims, encounters.
Is this a known issue?
Environment
Relevant log output
No response
The text was updated successfully, but these errors were encountered: