Adding exercises by @ericaVoss and @clairblacketer to ETL chapters

ohdsi-japan · Aug 23, 2019 · 2182344 · 2182344
1 parent a4abd72
commit 2182344
Show file tree

Hide file tree

Showing 3 changed files with 88 additions and 1 deletion.
diff --git a/ExtractTransformLoad.Rmd b/ExtractTransformLoad.Rmd
@@ -25,7 +25,7 @@ Two closely-integrated tools have been developed to support the ETL design proce
 
 To initiate an ETL process on a database you need to understand your data, including the tables, fields, and content. This is where the [White Rabbit](https://github.com/OHDSI/WhiteRabbit) tool comes in. White Rabbit is a software tool to help prepare for ETLs of longitudinal healthcare databases into the [OMOP CDM](https://github.com/OHDSI/CommonDataModel). White Rabbit scans your data and creates a report containing all the information necessary to begin designing the ETL. All source code and installation instructions, as well as a link to the manual, are available on GitHub [^whiteRabbitGithubUrl]. \index{White Rabbit} \index{data profiling|see {White Rabbit}}
 
-[^whiteRabbitGithubUrl]: https://github.com/OHDSI/WhiteRabbit].
+[^whiteRabbitGithubUrl]: https://github.com/OHDSI/WhiteRabbit.
 
 #### Scope and Purpose  {-}
 
@@ -382,3 +382,54 @@ The ETL process is a difficult one to master for many reasons, not the least of
 - There are many ETL examples and agreed upon conventions you can use as a guide
 
 ```
+
+## Exercises
+
+```{exercise, exerciseEtl1}
+Put the steps of the ETL process in the proper order:
+  
+A) Data experts and CDM experts together design the ETL
+B) A technical person implements the ETL
+C) People with medical knowledge create the code mappings
+D) All are involved in quality control 
+
+```
+
+```{exercise, exerciseEtl2}
+Using OHDSI resources of your choice, spot four issues with the PERSON record show in Table \@ref(tab:exercisePersonTable) (table abbreviated for space):
+
+Table: (\#tab:exercisePersonTable) A PERSON table.
+
+Column | Value
+:---------------- |:-----------
+PERSON_ID | A123B456
+GENDER_CONCEPT_ID | 8532
+YEAR_OF_BIRTH | NULL
+MONTH_OF_BIRTH | NULL
+DAY_OF_BIRTH | NULL
+RACE_CONCEPT_ID | 0
+ETHNICITY_CONCEPT_ID | 8527
+PERSON_SOURCE_VALUE | A123B456
+GENDER_SOURCE_VALUE | F
+RACE_SOURCE_VALUE | WHITE
+ETHNICITY_SOURCE_VALUE | NONE PROVIDED
+
+```
+
+```{exercise, exerciseEtl3}
+Let us try to generate VISIT_OCCURRENCE records.  Here is some example logic written for Synthea:
+Sort data in ascending order by PATIENT, START, END. Then by PERSON_ID, collapse lines of claim as long as the time between the END of one line and the START of the next is <=1 day. Each consolidated inpatient claim is then considered as one inpatient visit, set:
+  
+- MIN(START) as VISIT_START_DATE
+- MAX(END) as VISIT_END_DATE
+- "IP" as PLACE_OF_SERVICE_SOURCE_VALUE
+
+If you see a set of visits as shown in Figure \@ref(fig:exerciseSourceData) in your source data, how would you expect the resulting VISIT_OCCURRENCE record(s) to look in the CDM?
+
+```
+
+```{r exerciseSourceData, fig.cap='Example source data.',echo=FALSE, out.width='100%', fig.align='center'}
+knitr::include_graphics("images/ExtractTransformLoad/exerciseSourceData.png")
+```
+
+Suggested answers can be found in Appendix \@ref(Etlanswers).
diff --git a/SuggestedAnswers.Rmd b/SuggestedAnswers.Rmd
@@ -128,6 +128,42 @@ cat("  OBSERVATION_PERIOD_ID PERSON_ID OBSERVATION_PERIOD_START_DATE ...
 
 
 
+## Extract Transform Load {#Etlanswers}
+
+#### Exercise \@ref(exr:exerciseEtl1) {-}
+
+A) Data experts and CDM experts together design the ETL
+C) People with medical knowledge create the code mappings
+B) A technical person implements the ETL
+D) All are involved in quality control 
+
+#### Exercise \@ref(exr:exerciseEtl2) {-}
+
+Column | Value | Answer
+:---------------- |:----------- |:-----------------------
+PERSON_ID | A123B456 | This column has a data type of integer so the source record value needs to be translated to a numeric value.
+GENDER_CONCEPT_ID | 8532 | 
+YEAR_OF_BIRTH | NULL | If we do not know the month or day of birth, we do not guess. A person can exist without a month or day of birth. If a person lacks a birth year that person should be dropped.  This person would have to be dropped due to now year of birth.
+MONTH_OF_BIRTH | NULL | 
+DAY_OF_BIRTH | NULL | 
+RACE_CONCEPT_ID | 0 | The race is WHITE which should be mapped to 8527.
+ETHNICITY_CONCEPT_ID | 8527 | No ethnicity was provided, this should be mapped to 0.
+PERSON_SOURCE_VALUE | A123B456 | 
+GENDER_SOURCE_VALUE | F | 
+RACE_SOURCE_VALUE | WHITE | 
+ETHNICITY_SOURCE_VALUE | NONE PROVIDED | 
+
+#### Exercise \@ref(exr:exerciseEtl3) {-}
+
+Column | Value
+:---------------- |:----------- 
+VISIT_OCCURRENCE_ID | 1
+PERSON_ID | 11
+VISIT_START_DATE | 2004-09-26
+VISIT_END_DATE | 2004-09-30
+VISIT_CONCEPT_ID | 9201
+VISIT_SOURCE_VALUE | inpatient
+
 
 ## Data Analytics Use Cases {#UseCasesanswers}
 

diff --git a/images/ExtractTransformLoad/exerciseSourceData.png b/images/ExtractTransformLoad/exerciseSourceData.png