Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate IDs in Several Tables #1545

Open
katy-sadowski opened this issue Dec 27, 2024 · 4 comments
Open

Duplicate IDs in Several Tables #1545

katy-sadowski opened this issue Dec 27, 2024 · 4 comments
Labels

Comments

@katy-sadowski
Copy link

What happened?

I generated a 1 million patient Synthea dataset following Basic Setup instructions: https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running. I output to csv files. Upon loading the csvs into duckdb and running some tests I discovered duplicated primary keys in the following tables: imaging_studies, claims_transactions, claims, encounters.

Is this a known issue?

Environment

- OS: macOS 14.6.1
- Java: java 15.0.1

Relevant log output

No response

@dehall
Copy link
Contributor

dehall commented Dec 27, 2024

No I don't think we've seen this recently. Are you able to pull just the rows with duplicated IDs and attach those here?

@katy-sadowski
Copy link
Author

Sure! Files are attached. For the claims tables and encounters, you'll see the dupe IDs are associated with completely different records. For imaging, the records appear to be full dupes. Also there are hundreds of copies of some of the imaging dupes, whereas for claims and encounters each duplicated ID is associated with only 2 rows.

claims_transactions.csv
claims.csv
encounter.csv
imaging.csv

@dehall
Copy link
Contributor

dehall commented Dec 30, 2024

Thanks! There are two separate issues here. The imaging dupes issue should be a trivial fix, that's a nested loop and we're using the wrong index in one spot. The duplicate IDs across different records is trickier. I'm worried our ID generation approach just isn't random enough for a large dataset like this, so we'll have to think about how we can improve that

@katy-sadowski
Copy link
Author

Sounds good! Thanks so much for your quick reply! My usecase for the million patient dataset was testing performance on an ETL, so not a huge rush on my end for that second improvement. Though it might be a good idea to assess above which threshold the unique ID generation starts to break down and recommend users not to go above that number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants