-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.qmd
661 lines (366 loc) · 44.9 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
---
title: "C-PATH / JH guide to getting started with OMOP for CURE ID"
date-modified: last-modified
bibliography: references.bib
---
## Welcome
This site is designed to provide guidance, scripts and resources for sites transitioning data to OMOP common data model as part of CURE ID.
CURE ID is a platform designed and developed by the US Food and Drug Administration (FDA) and the National Institutes of Health (NIH) National Center for Translational Sciences (NCATS) to capture real world data (RWD) about how existing drugs are used in new ways (e.g., drug repurposing) to treat diseases of high unmet clinical need. The documentation on this site is part of a project funded by the US Department of Health and Human Services (HHS) Assistant Secretary for Planning and Evaluation (ASPE)'s Patient-Centered Outcomes Research Trust Fund (under Interagency Agreement #75F40121S35006) to develop and disseminate tools facilitating the extraction of RWD from the electronic health record (EHR).
## Introduction
The goals of OMOP are to gain better understanding of human health by improving our ability to analyze observational health data.
Health data is stored by individual health systems in unique ways which limits the ability to collaborate and learn from larger populations. This is a problem because certain clinical conditions are infrequent and it is important to have sufficient number of cases to perform statistical tests. De-identified data shared across institutions has the promise of allowing scientists to develop important insights about health which is the motivation to enable greater collaboration.
#### What is a common data model?
A common data model (CDM) allows multiple health care systems with their individual databases to join forces creating greater populations which can power more compelling scientific studies. A simple way to think about this is the difficulty of finding the silverware drawer in a new kitchen -- if there is a unifying rule for where the silverware drawer is located in every kitchen (immediately to right of the sink) it makes it possible to find this much easier. A common data model creates a system so that data elements like "systolic blood pressure" or "sepsis" are recorded and organized in the same way. OMOP stands for Observational Medical Outcomes Partnership and is the common data model of the [Observational Health Data Sciences and Informatics](https://www.ohdsi.org/) program.
#### What is an ETL?
ETL stands for Extract, Transform, and Load. [The Book of OHDSI, Chapter 6](https://ohdsi.github.io/TheBookOfOhdsi/ExtractTransformLoad.html) states:
"In order to get from the native/raw data to the OMOP Common Data Model (CDM) we have to create an extract, transform, and load (ETL) process. This process should restructure the data to the CDM, and add mappings to the Standardized Vocabularies, and is typically implemented as a set of automated scripts, for example SQL scripts. It is important that this ETL process is repeatable, so that it can be rerun whenever the source data is refreshed.
Creating an ETL is usually a large undertaking. Over the years, we have developed best practices, consisting of four major steps:
1. Data experts and CDM experts together design the ETL.
2. People with medical knowledge create the code mappings.
3. A technical person implements the ETL.
4. All are involved in quality control."
## Getting started
The code for the CURE ID project is hosted on [GitHub](https://github.com/). GitHub is a web-based platform that allows developers to host, review, and collaborate on code repositories. It is widely used for version control and source code management, and it provides features such as issue tracking, wikis, and project management tools to facilitate team collaboration.
The repository containing the code and concept sets can be accessed at [OHDSI/CureIdRegistry](https://github.com/OHDSI/CureIdRegistry) There are four branches within the repository that differ only in the clinical variables included in the cohort curation scripts. Be sure to select the appropriate branch based on individual health system data and ETL capacity:
1. Main branch: fundamental set of variables for Registry submission - demographics, comorbidities, vitals, labs, medications, vaccinations, oxygen devices, viral assays, and COVID complications.
2. Refresh Basic: the main branch variables with additional concepts to allow for Charlson Comorbidity Index calculation
3. Refresh Advanced: the refresh basic branch variables with additional concepts for Area Deprivation Index calculation (this branch requires Geocoding patient addresses to determine Census Blocks)
4. Main CDM v5.3: this is the same set of variables included in the main branch but with cohort curation scripts written for the OMOP CDM version 5.3 (there are four fields names that changed from version 5.3 to 5.4)
## CURE ID technical support checklist
1. Core site team identified and technical kickoff call scheduled
2. Determine feasibility of using OHDSI tools
3. If using OHDSI tools is not possible, has the site determined a path forward, for example using scripts already developed by other sites?
4. Members of core team send proof of course completion to technical team for the following free courses:
- [Course: CURE ID](https://view.genially.com/667178efe55dd50015cd9de5)
- [OMOP CDM and Standardised Vocabularies (ehden.eu)](https://academy.ehden.eu/course/view.php?id=4)
- [Extract, Transform and Load (ehden.eu)](https://academy.ehden.eu/course/view.php?id=7)
5. Site Completes CURE ID Manual OMOP Data Mapping Template
6. The technical team reviews steps for performing ETL process by walking through GitHub scripts
7. Meetings with OHDI experts and software contractors arranged, if needed
8. Site confirms that they are using most recent GitHub scripts before proceeding
9. Data Quality Dashboard (DQD) and profile scripts run, output sent to technical team
10. Data quality approved or issues identified and reviewed with technical team, re-run until resolved
11. Run CURE ID scripts (see OMOP Cohort Creation and De-identification Guide)
12. De-identified data exported to CSV files
13. Pre-Export Checklist and naming convention document instructions followed
14. Technical team reviews CSV files in live meeting before transfer
15. Any caveats in the data documented in Data Export Cover Sheet
16. Data transfer arrangements made by contacting C-Path
## OMOP Extract, Transform, and Load Guide
### [OHDSI ETL Resources]{.underline}
- [OHDSI Wiki: ETL Creation Best Practices](https://www.ohdsi.org/web/wiki/doku.php?id=documentation:etl_best_practices)
- [The Book of OHDSI, Chapter 6](https://ohdsi.github.io/TheBookOfOhdsi/ExtractTransformLoad.html)
- [EHDEN Academy: OMOP CDM and Standardised Vocabularies](https://academy.ehden.eu/enrol/index.php?id=4)
- [EHDEN Academy: Extract, Transform and Load](https://academy.ehden.eu/course/view.php?id=7)
- [EHDEN Academy: Introduction to Usagi & Code Mappings for an ETL](https://academy.ehden.eu/enrol/index.php?id=18)
### [OHDSI ETL Steps]{.underline}
1. **Find your people.**
a. Assemble ETL team: both data experts, medical experts, CDM/vocabulary experts
b. Assign persons with medical knowledge to create code mappings
c. Assign persons with technical knowledge to implement the ETL
d. Schedule regular meetings for quality control maintenance
2. **Understand your data.**
a. Take inventory of source data (tools: [White Rabbit](https://github.com/OHDSI/WhiteRabbit))
i. List of tables in database
ii. List fields in each table
iii. List distinct values of each field
iv. Summarize the frequency of each value of each field.
b. Define relationships between source data tables and fields (tools: [Rabbit-in-a-Hat](https://github.com/OHDSI/WhiteRabbit))
i. Record both relationship definition decisions and reasoning behind these decisions
c. Determine standardized vocabularies that already exist in your source data (ICD-10, CPT, HCPCS, LOINC, etc.)
i. Many commonly used standardized coding systems have already been mapped to the OMOP vocabulary -- utilize the work done by others across the OHDSI community to accelerate the process of mapping your data systems to OMOP.
d. Determine the coding systems in your source data that are proprietary or not already mapped to OMOP (proprietary patient/visit identifiers, account numbers, charge codes, etc.)
3. **Map your codes to OMOP concepts.**
a. Summarize the frequency of each code from your source data code sets that will require mapping to OMOP concepts.
b. Create your source data to OMOP mapping. (tools: [Usagi](https://github.com/OHDSI/Usagi))
i. Assign this task to the ETL team members with appropriate medical knowledge to discern which codes are most synonymous based on their descriptions. Medical knowledge is key for this step as semantic understanding of clinical descriptions is required to make decisions on mapping source data concepts to OMOP concepts.
ii. Start with the most frequently used codes and determine your threshold of code frequency to include in the mapping.
iii. Focus on a particular project or clinical domain to limit the scope of data needed for capture and conversion to OMOP.
iv. Utilize the Usagi tool for mapping suggestions and searching based on similarity of code descriptions.
4. **OMOPify your data!**
a. Determine the technology and software used or approved at your site for data storage, querying, and transfer. Determine the tools and technologies that team members and internal staff have expertise in.
b. Generate the OMOP DDLs. DDL (Data Definition Language) is a set of SQL commands used to define the structure of a database, including the creation, alteration, and deletion of tables and other database objects. In the context of the OMOP Common Data Model, generating DDLs involves creating the necessary SQL scripts to establish the standardized schema, ensuring that all data conform to the specified format and relationships within the OMOP CDM.
i. Using R: https://github.com/OHDSI/CommonDataModel
ii. Using SQL: https://github.com/OHDSI/CommonDataModel/tree/v5.4.0/inst/ddl/5.4
c. Implement the ETL.
i. Assign this task to the ETL team members with technical knowledge and permission for accessing and extracting source data and writing to a database.
ii. Many tools and technologies can be used for this step. Utilize the existing tools, technologies, and expertise of your ETL team and internal staff.
5. **Evaluate your data.**
a. Involve everyone in the evaluation and maintenance of ETL and data quality.
b. Review the ETL design documentation, computer code used in implementation, and concept mappings
c. Perform a manual audit using a sample of patients from both source data and derived OMOP data
d. Compare summary level counts of key fields between source data and OMOP derived data. (tools: Achilles)
**★ Ask for help along the way!**
The CURE ID support team is here to help you. We also encourage you to connect with the OHDSI community to find others with experience, expertise, and guidance for each step of your ETL to OMOP journey. The open, collaborative community of OHDSI is the most powerful tool at your disposal. Use it (us)!
### [Places to Connect]{.underline}
- OHDSI.org: <https://www.ohdsi.org/>
- OHDSI Wiki: <https://www.ohdsi.org/web/wiki/doku.php>
- OHDSI Forums: <https://forums.ohdsi.org/>
- OHDSI Workgroups: <https://www.ohdsi.org/workgroups/>
## OMOP Cohort Creation and De-identification Guide
```{r, echo=FALSE}
message("Updated: 7/26/2023")
```
The following scripts are to be run on a site's full OMOP dataset in order to prepare the relevant data for sharing with the VIRUS registry. Each script should be run on the same server as the OMOP data but can be customized to run on the preferred Database and Schema.
**Instructions:** Replace the database name and schema in each of these scripts with your own, then run the cohort creation and deidentification scripts in the following sequence:
### [00 -- Concept Table Creation Script]{.underline}
**Filename**: [00_CURE_ID_create_concept_table.sql)](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/00_CURE_ID_create_concept_table.sql)
**Purpose**: This script creates a table of standard concepts required for the CureID Registry project. It is used in conjunction with CONCEPT_ANCESTOR table in 02_CURE_ID_All_Tables.sql script.
**Description**: Fields particularly important to the process are "is_standard" and "include_descendants".
"is_standard" determines the standardization of the concept, either: a "C", "S", or "N" -- C is for classification. This concept will not be used, but it may have useable descendants -- S is for Standard. These codes will be used. They may or may not have descendants -- N is Non-standard. These codes will not be used. If they have descendants
"include_descendants" determines whether the script should look for descendents -- Values are either TRUE or FALSE
They will be used in 02_CURE_ID_All_Tables.sql in the FROM clauses: Measurement example: INNER JOIN omop.CONCEPT_ANCESTOR ON descendant_concept_id = m.measurement_concept_id INNER JOIN \[Results\].\[cure_id_concepts\] ON ancestor_concept_id = concept_id WHERE domain = 'Measurement' AND (include_descendants = 'TRUE' OR ancestor_concept_id = descendant_concept_id)
If "include_descendants" is either 'TRUE' or if the ancestor_concept_id is the same as descendant_concept_id, the concept will be used. The "is_standard" field is informational only and does not participate in the script.
**Dependencies**: None
### [01 -- Cohort Creation Script]{.underline}
**Filename**: [01_CURE_ID_Cohort.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/01_CURE_ID_Cohort.sql)
**Purpose**: This script creates a cohort of patients for the CURE ID registry. The patient list is saved in the cohort table, along with other useful data elements.
**Description**: This SQL script creates a cohort of COVID-positive hospitalized patients based on specific criteria. The script performs several steps to identify and filter the patients before finally creating the cohort table. The script sets the context to use a specific database, but the actual name of the database is meant to be provided by the user.
**Dependencies**: None
**Steps**:
1. Create cohort table.
2. Identify patients (inpatient and outpatient) with covid positive lab results
- Use OMOP concepts that represent the LOINC codes for SARS-COV-2 nucleic acid test
- The concept ids here represent LOINC or SNOMED codes for standard ways to code a lab that is positive.
3. Identify the first positive covid test per patient (after January 1, 2020).
4. Limit to covid-positive patients with inpatient encounters.
5. Apply all inclusion/exclusion criteria to identify all patients hospitalized with symptomatic covid-19 up to 21 days after a positive SARS-CoV-2 test or up to 7 days prior to a positive SARS-CoV-2 test
6. Find the closest inpatient encounter to first positive SARS-COV-2 test (for patients hospitalized more than once)
7. Account for edge cases where patients have two hospitalizations same number of absolute days from SARS-COV-2 test (Ex: Patient hospitalized separately 3 days before and 3 days after SARS-COV-2 test)
8. Create the cohort by adding on birth date and death date
### [02 -- CURE ID Tables Script]{.underline}
**Filename**: [02_CURE_ID_All_Tables.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/02_CURE_ID_All_Tables.sql)
**Purpose**: This script takes your OMOP dataset and generates a copy of key tables that have been filtered down to only include people and records related to the CURE ID registry.
**Description**: Creates CURE_ID tables from the generated CURE_ID cohort.
**Dependencies**: - 01_CURE_ID_Cohort.sql
**Steps**:
1. Load Person table
2. Load Measurements table
3. Load Drug Exposure table
4. Load Death table
5. Load Observation data
6. Load Procedure Occurrence Table
7. Load Condition Occurrence Table
8. Load Visit Occurrence table
9. Load Device Exposure tabledb
### [03 -- Replace Rare Conditions Script]{.underline}
**Filename**: [03_CURE_ID_replace_rare_conditions_with_parents.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/03_CURE_ID_replace_rare_conditions_with_parents.sql)
**Purpose**: Replace conditions occurring 10 or less times in the dataset with parent concepts that have at least 10 counts
**Description**: This script is run after scripts 01 and 02
**Dependencies**: - 01_CURE_ID_Cohort.sql - 02_CURE_ID_All_Tables.sql
**Steps**:
1. Condition roll up: concepts are mapped to their corresponding ancestor concept(s)
2. Create table that counts the ancestor concepts for each original concept
3. Create table that counts the original concepts
4. Filter to only include conditions that have more than 10 counts
5. Get just the most specific condition in the ancestor-descendent hierarchy
### [04 -- Deidentified Data DDL Script]{.underline}
**Filename**: [04_DE_ID_CDM_Table_ddl.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/04_DE_ID_CDM_Table_ddl.sql)
**Purpose**: Generate the necessary tables for the de-identified version of the CURE ID Cohort
**Description**: This script will create tables in the Results schema and preface the table names with 'deident.' However, the preface can be set to whatever value you desire.
**Dependencies**: None
**Steps**:
1. Create the Person table
2. Create the Death table
3. Create the Visit Occurrence table
4. Create the Drug Exposure table
5. Create the Device Exposure table
6. Create the Condition Occurrence table
7. Create the Measurement table
8. Create the Observation table
### [05 -- Deidentification Script]{.underline}
**Filename**: [05_DE_ID_script.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/05_DE_ID_script.sql)
**Purpose**: This script creates a copy of the Cohort and removes identifying characteristics to prepare the data for sharing with the VIRUS registry.
**Description**: Run this script to generate a deidentified copy of your target data. The following actions are performed: - Reassignment of Person IDs: Person IDs are regenerated sequentially from a sorted copy of the Person table. These new Person IDs are carried throughout the CDM to all tables that reference it.
- Date Shifting: Each person is assigned a random date shift value between -186 and +186 days. All dates for that person are then shifted shifted by that amount.
- Birthdays: After date shifting a person’s birthday, the day is then set to the first of the new birth month. If the person would be \> 89 years old then they are assigned a random birth year that would make them 90-99 years old.
- Date Truncation: A user-defined Start and End date are used to exclude any date shifted data that falls outside of the target date range (e.g. procedures, conditions occurrences, etc.). Does not include Birthdates.
- Removal of Other Identifiers: Other potentially identifying datapoints are removed from the dataset such as location_id, provider_id, and care_site_id
**Dependencies**: - 01_CURE_ID_Cohort.sql - 02_CURE_ID_All_Tables.sql - 03_CURE_ID_replace_rare_conditions_with_parents.sql - 04_DE_ID_CDM_Table_ddl.sql
**Steps**:
1. Use find and replace to set source and target DB and Schema names
2. Load the OMOP Person table, and de-identify
3. Load the OMOP Visit Occurrence table, and de-identify
4. Load the OMOP Condition Occurrence table, and de-identify
5. Load the OMOP Procedure Occurrence table, and de-identify
6. Load the OMOP Drug Exposure table, and de-identify
7. Load the OMOP Observation table, and de-identify
8. Load the OMOP Death table, and de-identify
9. Load the OMOP Device Exposure table, and de-identify
10. Load the OMOP Measurement table, and de-identify
### [06 -- Quality Checks Script (optional)]{.underline}
**Filename**: [06_DE_ID_Quality_Checks.sql](https://github.com/OHDSI/CureIdRegistry/blob/main/Cohort%20curation%20scripts/06_DE_ID_Quality_Checks.sql)
**Purpose**: This script checks basic metrics for each table in the deidentified dataset to ensure the previous scripts were successful.
**Description**: This script runs a number of summary level quality checks for each table to audit basic data counts and date ranges.
**Dependencies**: - 01_CURE_ID_Cohort.sql - 02_CURE_ID_All_Tables.sql - 03_CURE_ID_replace_rare_conditions_with_parents.sql - 04_DE_ID_CDM_Table_ddl.sql - 05_DE_ID_script.sql
**Steps**:
1. Count distinct person_ids and find the maximum and minimum birthdates in the OMOP Person table.
2. Count distinct person_ids in the OMOP Death table.
3. Count distinct person_ids, count number of records per observation_concept_id, and find the maximum and minimum observation dates for all records in the OMOP Observation table.
4. Count distinct person_ids, count number of records per procedure_concept_id, and find the maximum and minimum procedure dates for all records in the OMOP Procedure Occurrence table.
5. Count distinct person_ids, count number of records per condition_concept_id, and find the maximum and minimum condition dates for all records in the OMOP Condition Occurrence table.
6. Count distinct person_ids, count number of records per measurement_concept_id, and find the maximum and minimum measurement dates for all records in the OMOP Measurement table.
7. Count distinct person_ids, count number of records per device_concept_id, and find the maximum and minimum device exposure dates for all records in the OMOP Device Exposure table.
8. Count distinct person_ids, count number of records per drug_concept_id, and find the maximum and minimum drug exposure dates for all records in the OMOP Drug Exposure table.
### [07 -- Cohort Profile Scripts]{.underline}
**Dependencies**: These scripts require the populated deidentified OMOP tables generated from the sequence of running scripts 1-5:
- 01_CURE_ID_Cohort.sql
- 02_CURE_ID_All_Tables.sql
- 03_CURE_ID_replace_rare_conditions_with_parents.sql
- 04_DE_ID_CDM_Table_ddl.sql
- 05_DE_ID_script.sql
\##### 07-A – Condition Profile
**Filename**: 07_A_condition_profile.sql
**Purpose**: Generate a profile of condition prevalence in the final cohort.
**Description**: Condition counts are calculated per patient and are aggregated by parent concepts for each condition concept present in the final OMOP Condition Occurrence table.
\##### 07-B – Measurement Profile
**Filename**: 07_B_measurement_profile.sql
**Purpose**: Generate a profile of measurement prevalence in the final cohort.
**Description**: Measurement counts are calculated per patient and are aggregated by parent concepts for each measurement concept present in the final OMOP Measurement table.
\##### 07-C – Drug Exposure Profile
**Filename**: 07_C_drug_exposure_profile.sql
**Purpose**: Generate a profile of drug prevalence in the final cohort.
**Description**: Drug counts are calculated per patient and are aggregated by ingredient for each drug concept present in the final OMOP Drug Exposure table.
\##### 07-D – Unmapped Drugs Profile
**Filename**: 07_D_review_unmapped_drugs.sql
**Purpose**: Generate a profile of drugs that are not mapped to drug_concept_ids in the final cohort.
**Description**: This file filters drugs that were unsuccessfully mapped to a drug_concept_id when running the 02_CURE_ID_All_Tables.sql script. Drug source values for which the drug_concept_id is “0” and have at least 20 instances in the final cohort are aggregated for manual review. \*\* Drug source values can contain PHI. Please review the output for PHI before sharing.
\##### 07-E – Device Profile
**Filename**: 07_E_device_profile.sql
**Purpose**: Generate a profile of device prevalence in the final cohort.
**Description**: Device counts are calculated per patient and are aggregated by parent concepts for each device concept present in the final OMOP Device Exposure table.
## CURE ID Profile Scripts -- Output & Interpretation
This section provides synthetic examples of the output generated from the CURE ID person measurement drug exposure and device profile SQL scripts as well as examples of how to interpret the output from each script.
### 07_A_condition_profile.sql
Generates a table showing condition prevalence in cohort by individual condition concept.
Example Interpretation
In the highlighted row, the table shows that 6499 patients in the cohort have a recorded diagnosis of hypertension, representing 58% of the total cohort.
![Profile_script_7A](images/Profile_script_7A.png)
### 07_B_measurement_profile.sql
Generates a table which includes the measurement concepts included in the cohort and their names.
Example Interpretation
The highlighted row shows the distribution of the number of times heart rate was measured per patient in the cohort. 25% of the patients had 17 heart rate measurements or less. 50% of patients had at least 34 heart rate measurements recorded. 75% of patients had at most, 72 heart rate measurements recorded. 95% of patients had at most, 398 heart rate measurements recorded. The four columns can also be interpreted as the median for the bottom 50% (percentile_25), overall median (median), median for the top 50% (percentile_75), and the median for the top 10% (percentile_95) -- in terms of how many times the measurement was taken for each patient in the cohort.
![Profile_script_7B](images/Profile_script_7B.png)
### 07_C_drug_exposure_profile.sql
Generates a table of all drugs given in the cohort.
Example Interpretation
The highlighted row shows that acetaminophen was administered to 7,937 total patients which represents 68% of the patients in the cohort. Acetaminophen was administered an average of 2 times per patient with the highest recorded value of 26 administrations for a single patient.\
It is important to note that these are *ingredients* - which is why sodium chloride is the most frequent "drug" because most IV drugs contain sodium chloride as an ingredient.
![Profile_script_7C](images/Profile_script_7C.png)
### 07_E_device_profile.sql
Generates a table which includes the device concepts included in the cohort and their counts.
Example Interpretation
The highlighted row shows that an oxygen nasal cannula device was used for 9,238 patients in the cohort. That patients count represents 44% of the total cohort. Use of this device was recorded an average of 225.3 times per patient, with a maximum of 7,239 recordings for a single patient, when counting all flowsheet entries.
![Profile_script_7E](images/Profile_script_7E.png)
## OMOP Table and Field Basics
Adapted from OHDSI CDM Site: [Data Model Conventions](https://ohdsi.github.io/CommonDataModel/dataModelConventions.html)
#### Tables
For the tables of the main domains of the CDM it is imperative that concepts used are strictly limited to the domain. For example, the CONDITION_OCCURRENCE table contains only information about conditions (diagnoses, signs, symptoms), but no information about procedures. Not all source coding schemes adhere to such rules. For example, ICD-9-CM codes, which contain mostly diagnoses of human disease, also contain information about the status of patients having received a procedure. The ICD-9-CM code V20.3 'Newborn health supervision' defines a continuous procedure and is therefore stored in the PROCEDURE_OCCURRENCE table.
#### Fields
Variable names across all tables follow one convention:
| | |
|------------------|------------------------------------------------------|
| **Notation** | **Description** |
| \_SOURCE_VALUE | Verbatim information from the source data, typically used in ETL to map to CONCEPT_ID, and not to be used by any standard analytics. For example, CONDITION_SOURCE_VALUE = '787.02' was the ICD-9 code captured as a diagnosis from the administrative claim. |
| \_ID | Unique identifiers for key entities, which can serve as foreign keys to establish relationships across entities. For example, PERSON_ID uniquely identifies each individual. VISIT_OCCURRENCE_ID uniquely identifies a PERSON encounter at a point of care. |
| \_CONCEPT_ID | Foreign key into the Standardized Vocabularies (i.e. the standard_concept attribute for the corresponding term is true), which serves as the primary basis for all standardized analytics. For example, CONDITION_CONCEPT_ID = [31967](http://athena.ohdsi.org/search-terms/terms/31967) contains the reference value for the SNOMED concept of 'Nausea' |
| \_SOURCE_CONCEPT_ID | Foreign key into the Standardized Vocabularies representing the concept and terminology used in the source data, when applicable. For example, CONDITION_SOURCE_CONCEPT_ID = [45431665](http://athena.ohdsi.org/search-terms/terms/45431665) denotes the concept of 'Nausea' in the Read terminology; the analogous CONDITION_CONCEPT_ID might be 31967, since SNOMED-CT is the Standardized Vocabulary for most clinical diagnoses and findings. |
| \_TYPE_CONCEPT_ID | Delineates the origin of the source information, standardized within the Standardized Vocabularies. For example, DRUG_TYPE_CONCEPT_ID can allow analysts to discriminate between 'Pharmacy dispensing' and 'Prescription written |
For more information, see:
[Data Model Conventions](https://ohdsi.github.io/CommonDataModel/dataModelConventions.html#Data_Model_Conventions)
[How to Calculate Drug Dose](https://ohdsi.github.io/CommonDataModel/drug_dose.html)
[Clinical Data Tables](https://ohdsi.github.io/CommonDataModel/cdm53.html#Clinical_Data_Tables)
# Cure ID Concept Mapping Support
![](images/CureID_Concept_Mapping_Process.png)
#### Collaborative OMOP Concept Mapping Process
1. CureID clinical informaticists work with each site to manually map internal concepts to OMOP concepts outlined for capture in the project plan.
2. The combined concepts from all sites are maintained as a single, updated concept set using the OHDSI ATLAS open-source tool.
3. All concepts currently identified by the work between CureID staff and each site, and all corresponding descendant concepts are exported from ATLAS as .json or .csv files for use in cohort creation, profiling, and clinical registry submission.
# Identifying Oxygen Devices in Epic/Clarity
[Identifying Oxygen Devices Flowsheet Measures to Map to OMOP Concepts](https://userweb.epic.com/Thread/118440/Identifying-Oxygen-Devices-Flowsheet-Measures-to-Map-to-OMOP/)
# Geocoding and Integrating SDOH Data
To link geocoded addresses to the Neighborhood Atlas for getting the Area Deprivation Index (ADI) for patients in a dataset, you will typically follow these steps:
1. Prepare Your Dataset Patient Data: Ensure you have a dataset of patients with addresses. This data should be anonymized to protect patient privacy. Geocoding: Each address needs to be geocoded. Geocoding is the process of converting addresses into geographic coordinates (latitude and longitude).
2. Geocoding Addresses Use a Geocoding Service: Tools like Google Maps API, ArcGIS, or OpenStreetMap can be used to geocode addresses. This will give you the precise geographic location for each address. Accuracy Check: Ensure the geocoding is accurate. Incorrect geocoding can lead to wrong ADI assignments.
3. Understanding the Neighborhood Atlas ADI Overview: The ADI is a measure developed by the Neighborhood Atlas. It provides a ranked and searchable list of neighborhoods according to their level of disadvantage. Data Format: Understand the format of the Neighborhood Atlas data. It typically includes geographical identifiers like census tracts.
4. Linking Geocoded Data to the Neighborhood Atlas Match Coordinates with Census Tracts: Use the geographic coordinates to determine the corresponding census tract for each patient. This can be done using GIS software like ArcGIS or QGIS. Cross-Reference with ADI Data: Once you have the census tract for each patient, cross-reference these with the Neighborhood Atlas to find the ADI for each tract.
5. Integrating ADI into Your Dataset Add ADI to Patient Records: For each patient, add the ADI corresponding to their census tract. This step integrates socioeconomic context into your patient data.
### Geocoding
The primary consideration for linking geospatial data like social determinants of health and environmental risk factors with patient data is protecting PHI when geocoding patient home address and location data. Do not use web-based API services for geocoding as they require patient addresses to be sent to an external server for batch geocoding. Geocoding will require software that can match patient addresses to the appropriate shapefiles to determine a precise location and output latitudinal/longitudinal coordinates. Two software solutions for geocoding include: - Use ESRI ArcGIS or a similar program that can perform geocoding locally and has all appropriate shapefiles downloaded. This software is expensive, but many institutions have a license already in place. - Install geocoding software (ex: open-source, Degauss geocoder: https://www.degauss.org/geocoder) locally via manual installation or by installing a docker container.
### Area Deprivation Index
The Area Deprivation Index (ADI) is based on a measure created by the Health Resources & Services Administration (HRSA) over three decades ago, and has since been refined, adapted, and validated to the Census block group neighborhood level by Amy Kind, MD, PhD and her research team at the University of Wisconsin-Madison. It allows for rankings of neighborhoods by socioeconomic disadvantage in a region of interest (e.g., at the state or national level). It includes factors for the theoretical domains of income, education, employment, and housing quality. It can be used to inform health delivery and policy, especially for the most disadvantaged neighborhood groups. "Neighborhood" is defined as a Census block group.\
from: https://www.neighborhoodatlas.medicine.wisc.edu/
### ADI as an OHDSI Concept
ADI is not included as an explicit concept in any of the appropriate standard vocabularies. The OHDSI Vocabularies Team recommended that this information be included in the OMOP data as a custom concept in the Observation table. The workflow will be simply to add a line item to the concept mapping table for Area Deprivation Index like the following example: concept_id = 2000000999 concept_code = NA concept_name = Area deprivation index domain = observation vocabulary = Custom is_standard = C include_descendents = False
# OHDSI Tool Suite
The Edge Tool Suite are a set of OHDSI tools that should be deployed by the site that provide value around the OMOP CDM. This work was funded by the CURE ID initiative <https://cure.ncats.io>
The OHDSI open source software configured for deployment include:
- The Atlas data science platform
- The WebAPI backed for Atlas
- The HADES statistical analysis packages
- The Data Quality Dashboard
- The Perseus ETL management system
#### Simplifying the ETL process
The OHDSI community has created a series of individual software packages to facilitate the ETL from proprietary EHRs to OMOP, evaluate data quality, define cohorts, and perform analyses. The "Edge Tool" packages these individual tools to facilitate the performance of an OMOP ETL and subsequent use of the data for defining cohorts for observational research. In contrast to registry approaches which ingest data represented in various data models and perform data harmonization centrally, software components of the "Edge Tool" facilitate ETL performance locally at the "edge." This suite of software aims to drastically reduce the labor and effort required to go from "zero to OMOP." We anticipate that institutions that use the full suite of offered software will be able to reduce the person-hours required for an OMOP ETL to as little as 50 hours.
#### Software components
The Edge Tool encompasses the Perseus ETL management solution, the HADES R analysis package within an RStudio Server R integrated development environment, and the ATLAS cohort discovery tool with WebAPI web services integration (Figure). The Perseus graphic-user interface (GUI) approach provides source-to-concept mapping for the ETL, with assisted extraction of data from EHR such as flowsheets (vital signs, nursing assessments), test measurements, and diagnoses. Rather than performing a series of SQL queries with wildcards to identify data elements of interest from primary source EHR tables,users may enter desired data element terms into a browser text field which are then matched using term similarity to source table entries.Users may then evaluate the completeness and quality of the ETL using the Data Quality Dashboard which performs \>3,000 individual data quality checks on the OMOP-formatted data and is reported through a web-based reporting system.
In tandem with Perseus, OHDSI HADES and OHDSI ATLAS are the two projects within the Edge Tool that allow for advanced analysis once data has been harmonized with the OMOP CDM, such as generating cohorts for research, patient level prediction, treatment pathways, large scale population analytics, automated reporting and, optionally, participation in OHDSI network studies.The OHDSI applications within the Edge Tool have been containerized using OHDSI Broadsea, allowing for even easier deployment. Current use of the Edge Tool has proven promising and while limitations still exist - e.g., not currently capable of extracting data from unstructured fields such as notes or loose text - further process optimization and tool development will reduce this implementation time and effort further.
#### Ways to deploy the software
1. Cloud vendor software configured for use.
[OHDSI on Azure (Includes Perseus, Atlas, and Hades)](https://github.com/microsoft/OHDSIonAzure)
[OHDSI on AWS (Includes Atlas and Hades)](https://github.com/OHDSI/OHDSIonAWS)
2. Broadsea provides a set of docker containers that ease the cost of implementation
[Broadsea (Includes Atlas and Hades)](https://github.com/OHDSI/Broadsea)
3. Sites can compile the tools from the source repositories
#### OHDSI Specific
<https://github.com/OHDSI/CommonDataModel>
<https://github.com/OHDSI/Broadsea>
<https://github.com/OHDSI/Athena>
#### The ETL Process
<https://github.com/OHDSI/Perseus>
<https://github.com/OHDSI/WhiteRabbit>
<https://github.com/OHDSI/Usagi>
#### ATLAS and Cohort Discovery
<https://github.com/OHDSI/Atlas>
<https://github.com/OHDSI/WebAPI>
#### Broadsea
<https://www.youtube.com/watch?v=a9ZJURNRbUg>
#### Data Analysis
<https://github.com/OHDSI/Achilles>
<https://github.com/OHDSI/Hades>
<https://github.com/OHDSI/DataQualityDashboard>
# Best Practices
Data Quality Dos and Don'ts
- Do install the Data Quality Dashboard early on in the process.
- Don't include source values in your export - they might include PHI.
- Don't forget to check GitHub for the most recent version of the script before you run it.
- Don't send your data to the coordinating center until the tech team has a chance to review it with you in a live session.
- Duplicated results have been reported at some sites when running script 02_CURE_ID_ALL_Tables.sql. However, at this time no root cause has been found. There is some belief that adding DISTINCT to the various SELECT statements could resolve the issue, while there is some concern over the latest updates to script 00_CURE_ID_create_concept_table.sql as it has standard concepts as well as standard descendants. Be sure to review your data carefully and report any signs of duplication so that it can be investigated.
# Manuscripts
[Heavner SF, Anderson W, Kashyap R, Dasher P, Mathé EA, Merson L, Guerin PJ, Weaver J, Robinson M, Schito M, Kumar VK, Nagy P. A Path to Real-World Evidence in Critical Care Using Open-Source Data Harmonization Tools. Crit Care Explor. 2023 Apr 3;5(4):e0893. doi: 10.1097/CCE.0000000000000893. PMID: 37025303; PMCID: PMC10072311.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072311/pdf/cc9-5-e0893.pdf)
# Webinars
{{< video https://www.youtube.com/watch?v=H8Ur1xzVjS0 >}}
# OMOP Scripts for Epic
["The Spectrum Code"](https://userweb.epic.com/User/788dbdff-6712-4b16-aabc-98cdb1be4ff9): scripts and documentation created by Roger Carlson at Corewell Health
[RUMC OMOP transition scripts from Clarity, Caboodle](https://userweb.epic.com/User/788dbdff-6712-4b16-aabc-98cdb1be4ff9): Code created by Rush University Medical Center
# Using synthetic data
[MIMIC-IV](https://physionet.org/content/mimic-iv-demo-omop/0.9/1_omop_data_csv/#files-panel) 100-patient demo dataset based on MIMIC to create a OMOP instance
[Synthea](https://github.com/OHDSI/ETL-Synthea) GitHub site with files to use the synthetic patient generator Synthea
[Eunomia](https://ohdsi.github.io/Eunomia/) LInk to the Eunomia GitHub.io site with instructions and standard dataset files
# Useful Resources
[Web site for the OHDSI community](https://www.ohdsi.org/): http://www.ohdsi.org
[The Book of OHDSI:](https://ohdsi.github.io/TheBookOfOhdsi/) What is OHDSI and why should I care?
[EHDEN Academy](https://academy.ehden.eu/) : EHDEN Academy is a site with courses for developing skills working with OHDSI
[Tutorials & Workshops](https://www.youtube.com/playlist?list=PLpzbqK7kvfeXRQktX0PV-cRpb3EFA2e7Z) : Tutorial sessions 1-8 provide a comprehensive overview from vocabularies and creating cohorts to prediction
[OHDSI Forums](https://forums.ohdsi.org/) : Searchable, active user community
[OHDSI Community Dashboard](https://dash.ohdsi.org/) : Tracks publications, citations, researchers and activity within the OHDSI community
[Clinical Registry Efforts Within OHDSI (Sept. 13 Community Call)](https://www.youtube.com/watch?v=HhQ9x3bsv4o): Video discussing Perseus and Broadsea
[OHDSI-in-a-Box on GitHub](https://github.com/OHDSI/OHDSI-in-a-Box) : A learning environment created for the OHDSI community
[Integrating Flowsheet Data in OMOP Common Data Model for Clinical Research](https://arxiv.org/ftp/arxiv/papers/2109/2109.08235.pdf) Paper written by informatics teams at Stanford University and The Hospital for Sick Children
[Guide to privacy issues in OMOP journey](https://ohdsi.github.io/CommonDataModel/cdmPrivacy.html): Article outlining how to manage protected health information
[DQD](https://github.com/OHDSI/DataQualityDashboard) GitHub repository for Data Quality Dashboard tool
[PheKB](https://phekb.org/phenotypes) A phenotype repository
[Analyze observational patient data by using OHDSI with the OMOP CDM](https://learn.microsoft.com/en-us/azure/architecture/example-scenario/digital-health/patient-data-ohdsi-omop-cdm) Microsoft guide to OMOP CDM
[OHDSI on Azure GitHub](https://github.com/microsoft/OHDSIonAzure) Automation code and documentation for deploying OHDSI CDM in Azure
[OHDSI/Perseus GitHub site](https://github.com/OHDSI/Perseus) OHDSI/Perseus on GitHub
[How to develop capacity for observational research within a health system](https://www.ohdsi.org/2022showcase-79/) Presentation on building capacity from 2022 OHDSI Collaborator Showcase