Skip to content

Commit

Permalink
Adding exercises by @ericaVoss and @clairblacketer to ETL chapters
Browse files Browse the repository at this point in the history
  • Loading branch information
schuemie committed Aug 23, 2019
1 parent a4abd72 commit 2182344
Show file tree
Hide file tree
Showing 3 changed files with 88 additions and 1 deletion.
53 changes: 52 additions & 1 deletion ExtractTransformLoad.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Two closely-integrated tools have been developed to support the ETL design proce

To initiate an ETL process on a database you need to understand your data, including the tables, fields, and content. This is where the [White Rabbit](https://github.com/OHDSI/WhiteRabbit) tool comes in. White Rabbit is a software tool to help prepare for ETLs of longitudinal healthcare databases into the [OMOP CDM](https://github.com/OHDSI/CommonDataModel). White Rabbit scans your data and creates a report containing all the information necessary to begin designing the ETL. All source code and installation instructions, as well as a link to the manual, are available on GitHub [^whiteRabbitGithubUrl]. \index{White Rabbit} \index{data profiling|see {White Rabbit}}

[^whiteRabbitGithubUrl]: https://github.com/OHDSI/WhiteRabbit].
[^whiteRabbitGithubUrl]: https://github.com/OHDSI/WhiteRabbit.

#### Scope and Purpose {-}

Expand Down Expand Up @@ -382,3 +382,54 @@ The ETL process is a difficult one to master for many reasons, not the least of
- There are many ETL examples and agreed upon conventions you can use as a guide
```

## Exercises

```{exercise, exerciseEtl1}
Put the steps of the ETL process in the proper order:
A) Data experts and CDM experts together design the ETL
B) A technical person implements the ETL
C) People with medical knowledge create the code mappings
D) All are involved in quality control
```

```{exercise, exerciseEtl2}
Using OHDSI resources of your choice, spot four issues with the PERSON record show in Table \@ref(tab:exercisePersonTable) (table abbreviated for space):
Table: (\#tab:exercisePersonTable) A PERSON table.
Column | Value
:---------------- |:-----------
PERSON_ID | A123B456
GENDER_CONCEPT_ID | 8532
YEAR_OF_BIRTH | NULL
MONTH_OF_BIRTH | NULL
DAY_OF_BIRTH | NULL
RACE_CONCEPT_ID | 0
ETHNICITY_CONCEPT_ID | 8527
PERSON_SOURCE_VALUE | A123B456
GENDER_SOURCE_VALUE | F
RACE_SOURCE_VALUE | WHITE
ETHNICITY_SOURCE_VALUE | NONE PROVIDED
```

```{exercise, exerciseEtl3}
Let us try to generate VISIT_OCCURRENCE records. Here is some example logic written for Synthea:
Sort data in ascending order by PATIENT, START, END. Then by PERSON_ID, collapse lines of claim as long as the time between the END of one line and the START of the next is <=1 day. Each consolidated inpatient claim is then considered as one inpatient visit, set:
- MIN(START) as VISIT_START_DATE
- MAX(END) as VISIT_END_DATE
- "IP" as PLACE_OF_SERVICE_SOURCE_VALUE
If you see a set of visits as shown in Figure \@ref(fig:exerciseSourceData) in your source data, how would you expect the resulting VISIT_OCCURRENCE record(s) to look in the CDM?
```

```{r exerciseSourceData, fig.cap='Example source data.',echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("images/ExtractTransformLoad/exerciseSourceData.png")
```

Suggested answers can be found in Appendix \@ref(Etlanswers).
36 changes: 36 additions & 0 deletions SuggestedAnswers.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,42 @@ cat(" OBSERVATION_PERIOD_ID PERSON_ID OBSERVATION_PERIOD_START_DATE ...



## Extract Transform Load {#Etlanswers}

#### Exercise \@ref(exr:exerciseEtl1) {-}

A) Data experts and CDM experts together design the ETL
C) People with medical knowledge create the code mappings
B) A technical person implements the ETL
D) All are involved in quality control

#### Exercise \@ref(exr:exerciseEtl2) {-}

Column | Value | Answer
:---------------- |:----------- |:-----------------------
PERSON_ID | A123B456 | This column has a data type of integer so the source record value needs to be translated to a numeric value.
GENDER_CONCEPT_ID | 8532 |
YEAR_OF_BIRTH | NULL | If we do not know the month or day of birth, we do not guess. A person can exist without a month or day of birth. If a person lacks a birth year that person should be dropped. This person would have to be dropped due to now year of birth.
MONTH_OF_BIRTH | NULL |
DAY_OF_BIRTH | NULL |
RACE_CONCEPT_ID | 0 | The race is WHITE which should be mapped to 8527.
ETHNICITY_CONCEPT_ID | 8527 | No ethnicity was provided, this should be mapped to 0.
PERSON_SOURCE_VALUE | A123B456 |
GENDER_SOURCE_VALUE | F |
RACE_SOURCE_VALUE | WHITE |
ETHNICITY_SOURCE_VALUE | NONE PROVIDED |

#### Exercise \@ref(exr:exerciseEtl3) {-}

Column | Value
:---------------- |:-----------
VISIT_OCCURRENCE_ID | 1
PERSON_ID | 11
VISIT_START_DATE | 2004-09-26
VISIT_END_DATE | 2004-09-30
VISIT_CONCEPT_ID | 9201
VISIT_SOURCE_VALUE | inpatient


## Data Analytics Use Cases {#UseCasesanswers}

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2182344

Please sign in to comment.