Duplicate IDs in Several Tables #1545

katy-sadowski · 2024-12-27T19:25:00Z

What happened?

I generated a 1 million patient Synthea dataset following Basic Setup instructions: https://github.com/synthetichealth/synthea/wiki/Basic-Setup-and-Running. I output to csv files. Upon loading the csvs into duckdb and running some tests I discovered duplicated primary keys in the following tables: imaging_studies, claims_transactions, claims, encounters.

Is this a known issue?

Environment

- OS: macOS 14.6.1
- Java: java 15.0.1

Relevant log output

No response

dehall · 2024-12-27T20:00:12Z

No I don't think we've seen this recently. Are you able to pull just the rows with duplicated IDs and attach those here?

katy-sadowski · 2024-12-27T21:25:39Z

Sure! Files are attached. For the claims tables and encounters, you'll see the dupe IDs are associated with completely different records. For imaging, the records appear to be full dupes. Also there are hundreds of copies of some of the imaging dupes, whereas for claims and encounters each duplicated ID is associated with only 2 rows.

claims_transactions.csv
claims.csv
encounter.csv
imaging.csv

dehall · 2024-12-30T13:22:59Z

Thanks! There are two separate issues here. The imaging dupes issue should be a trivial fix, that's a nested loop and we're using the wrong index in one spot. The duplicate IDs across different records is trickier. I'm worried our ID generation approach just isn't random enough for a large dataset like this, so we'll have to think about how we can improve that

katy-sadowski · 2024-12-30T21:46:29Z

Sounds good! Thanks so much for your quick reply! My usecase for the million patient dataset was testing performance on an ETL, so not a huge rush on my end for that second improvement. Though it might be a good idea to assess above which threshold the unique ID generation starts to break down and recommend users not to go above that number.

katy-sadowski added the bug label Dec 27, 2024

katy-sadowski mentioned this issue Dec 27, 2024

[NOTES] Experiences with dbt-synthea for Big Patient Datasets OHDSI/dbt-synthea#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate IDs in Several Tables #1545

Duplicate IDs in Several Tables #1545

katy-sadowski commented Dec 27, 2024

dehall commented Dec 27, 2024

katy-sadowski commented Dec 27, 2024

dehall commented Dec 30, 2024

katy-sadowski commented Dec 30, 2024

Duplicate IDs in Several Tables #1545

Duplicate IDs in Several Tables #1545

Comments

katy-sadowski commented Dec 27, 2024

What happened?

Environment

Relevant log output

dehall commented Dec 27, 2024

katy-sadowski commented Dec 27, 2024

dehall commented Dec 30, 2024

katy-sadowski commented Dec 30, 2024