Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for using sqlite DDLs for non-R users #693

Open
barabo opened this issue May 28, 2024 · 3 comments
Open

Instructions for using sqlite DDLs for non-R users #693

barabo opened this issue May 28, 2024 · 3 comments

Comments

@barabo
Copy link

barabo commented May 28, 2024

Hi there!

I've been playing around trying to get the DDLs loaded for sqlite3, version 5.4 using the sqlite3 command line and I've been having a pretty rough go of it using the provided definitions.

I didn't find any documentation here or elsewhere about the @cdmDatabaseSchema notation in the ddl files, so I went ahead and removed them. I assume they are something used by the sqlrender utility - but I'm not an R user - I'm just trying to create a working OMOPCDM database in sqlite.

I also discovered that sqlite does not allow you to apply primary or foreign keys to tables using the ALTER TABLE commands provided. So, I wrote a brittle little script to apply the PK and FK definitions to the main _ddl.sql file.

This allows me to use the patched _ddl.sql file to define tables with PKs and FKs. The _indices.sql file seems fine as-is.

I did notice that the COHORT table is missing a PRIMARY KEY, so the foreign key check fails in a clean install for just that table. It almost seems like COHORT should have cohort_id integer PRIMARY KEY and COHORT_DEFINITION should have cohort_definition_id integer PRIMARY KEY - and then the COHORT table should have a FK on COHORT_DEFINITION.cohort_definition_id (instead, it seems backwards). But as it is now, if I make COHORT.cohort_definition_id the PK of the COHORT table - it gets me past the failing foreign key checks. Anyway, I expect some of that will change in future versions - my script includes a hack for this.

For posterity, I'm providing my script here. However, are there instructions for people who don't want to use R but just want to create a working DB out there somewhere?

Click to see my script
#!/bin/bash
#
# File: update-ddl.sh
#
# Converts the OMOP-provided DDL database definition scripts into a format
# that can be loaded by sqlite3.
#
# This has been tested against the provided sqlite 5.4 DDL files.
#
# To run this, place copies of the provided .sql files into this directory
# and run the script.  It will modify the sql files and create an empty
# database called cdm.db
#

set -e
set -o pipefail
set -u


# There should be only one DDL file in the local directory.
DDL=$( ls OMOPCDM*_ddl.sql )
[ ! -e ${DDL} ] && echo "Failed to find an OMOPCDM*_ddl.sql schema!" && exit 1


# Awk scriptlet to add a PRIMARY KEY declaration to a table.column in a table
# definition.
#
# Assumptions:
#   'table' is padded with spaces
#   'column' begins with a tab and ends with a space
#
AWK_PK='
/^CREATE TABLE / {
  if ($0 ~ table) { target=1 } else { target=0 }
}

{
  if (target && $0 ~ column && (!($0 ~ / PRIMARY KEY,/))) {
    sub(/,/, " PRIMARY KEY,");
  }
  print $0
}
'

# Awk scriptlet to add a FOREIGN KEY declaration to a table.column in a table
# definition.
#
# Assumptions:
#   'table' is padded with spaces
#   'column' begins with a tab and ends with a space
#   'ref' is multiple words long
#
AWK_FK='
/^CREATE TABLE / {
  if ($0 ~ table) { target=1 } else { target=0 }
}

{
  if (target && $0 ~ column && (!($0 ~ / REFERENCES /))) {
    if ($0 ~ /,$/) {
      sub(/,$/, " REFERENCES " ref ",");
    } else {
      sub(/ );$/, " REFERENCES " ref ");");
    }
  }
  print $0
}
'

##
# Gets the table names and PK columns from constraints.
#
function get_pks() {
  grep '^ALTER TABLE ' OMOPCDM*_primary_keys.sql \
    | sed -e 's:.* \(.*\) ADD CONSTRAINT .* PRIMARY KEY .\(.*\).;:\1 \2:'
}


##
# Decorate the PK columns with PRIMARY KEY if not already decorated.
#
function mark_pk() {

  # Apply the change to the ddl file content, saving into a temp file.
  cat ${DDL} | awk -v table=" ${1} " -v column="        ${2} " "${AWK_PK}" \
    > ${DDL}.tmp

  # diff the files to assert diff count is 1 changed line only.
  (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1

  # Apply the temp file to the DDL file.
  mv ${DDL}.tmp ${DDL}
}


function get_fks() {
  grep '^ALTER TABLE ' OMOPCDM*_constraints.sql \
    | sed -e 's:ALTER TABLE \(.*\) ADD CONSTRAINT .* FOREIGN KEY .\([^ ]*\). REFERENCES:\1 \2:' \
    | sed -e 's:;::'
}


##
# Decorate the PK columns with PRIMARY KEY if not already decorated.
#
function mark_fk() {

  # Apply the change to the ddl file content, saving into a temp file.
  cat ${DDL} \
    | awk -v table=" ${1} " -v column=" ${2} " -v ref="${3}" "${AWK_FK}" \
    > ${DDL}.tmp

  # diff the files to assert diff count is 1 changed line only.
  (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1

  # Apply the temp file to the DDL file.
  mv ${DDL}.tmp ${DDL}
}


# Remove the template schema reference, which isn't supported in sqlite.
sed -i -e 's:@cdmDatabaseSchema.::g' OMOPCDM*.sql && rm -f OMOPCDM*.sql-e

# Update the table definitions to insert the PRIMARY KEY declarations.
echo "Adding PRIMARY KEYS to DDL"
get_pks | while read table column; do
  mark_pk ${table} ${column}
done

# Insert the FK constraints into the table.
echo "Adding FOREIGN KEYS to DDL"
get_fks | while read table column reference; do
  mark_fk ${table} ${column} "${reference}"
done

# HACK: no PK in the cohort table, so this is needed to satisfy FK checks.
mark_pk cohort cohort_definition_id

# Populate the DB tables and indices.
echo "Creating empty cdm.db database"
rm -f cdm.db
touch cdm.db
sqlite3 cdm.db < ${DDL}
sqlite3 cdm.db < OMOPCDM*_indices.sql
sqlite3 cdm.db "pragma foreign_key_check"

# Success!
echo "DONE!"

This updates the _ddl.sql file to look like this:

CREATE TABLE person (
                        person_id integer NOT NULL PRIMARY KEY,
                        gender_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID),
                        year_of_birth integer NOT NULL,
                        month_of_birth integer NULL,
                        day_of_birth integer NULL,
                        birth_datetime REAL NULL,
                        race_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID),
                        ethnicity_concept_id integer NOT NULL REFERENCES CONCEPT (CONCEPT_ID),
                        location_id integer NULL REFERENCES LOCATION (LOCATION_ID),
                        provider_id integer NULL REFERENCES PROVIDER (PROVIDER_ID),
                        care_site_id integer NULL REFERENCES CARE_SITE (CARE_SITE_ID),
                        person_source_value TEXT NULL,
                        gender_source_value TEXT NULL,
                        gender_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID),
                        race_source_value TEXT NULL,
                        race_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID),
                        ethnicity_source_value TEXT NULL,
                        ethnicity_source_concept_id integer NULL REFERENCES CONCEPT (CONCEPT_ID));

When the script runs correctly, you see output like this:

(base) anderson.carl3@home test % ./update-ddl.sh
Adding PRIMARY KEYS to DDL
Adding FOREIGN KEYS to DDL
Creating empty cdm.db database
DONE!
@thisismexp
Copy link

Thanks for your script, saved me a lot of effort. I extended your script to replace tabs with spaces:

#!/bin/bash
#
# File: update-ddl.sh
#
# Converts the OMOP-provided DDL database definition scripts into a format
# that can be loaded by sqlite3.
#
# This has been tested against the provided sqlite 5.4 DDL files.
#
# To run this, place copies of the provided .sql files into this directory
# and run the script.  It will modify the sql files and create an empty
# database called cdm.db
#

set -e
set -o pipefail
set -u


# @thisismexp: replace \t with spaces
d="$(mktemp -d)"
for filename in *.sql; do
    expand -t 4 "$filename" > "$d"/"$filename"
done
mv "$d"/* .
rmdir "$d"


# There should be only one DDL file in the local directory.
DDL=$( ls OMOPCDM*_ddl.sql )
[ ! -e ${DDL} ] && echo "Failed to find an OMOPCDM*_ddl.sql schema!" && exit 1


# Awk scriptlet to add a PRIMARY KEY declaration to a table.column in a table
# definition.
#
# Assumptions:
#   'table' is padded with spaces
#   'column' begins with a tab and ends with a space
#
AWK_PK='
/^CREATE TABLE / {
  if ($0 ~ table) { target=1 } else { target=0 }
}

{
  if (target && $0 ~ column && (!($0 ~ / PRIMARY KEY,/))) {
    sub(/,/, " PRIMARY KEY,");
  }
  print $0
}
'

# Awk scriptlet to add a FOREIGN KEY declaration to a table.column in a table
# definition.
#
# Assumptions:
#   'table' is padded with spaces
#   'column' begins with a tab and ends with a space
#   'ref' is multiple words long
#
AWK_FK='
/^CREATE TABLE / {
  if ($0 ~ table) { target=1 } else { target=0 }
}

{
  if (target && $0 ~ column && (!($0 ~ / REFERENCES /))) {
    if ($0 ~ /,$/) {
      sub(/,$/, " REFERENCES " ref ",");
    } else {
      sub(/ );$/, " REFERENCES " ref ");");
    }
  }
  print $0
}
'

##
# Gets the table names and PK columns from constraints.
#
function get_pks() {
  grep '^ALTER TABLE ' OMOPCDM*_primary_keys.sql \
    | sed -e 's:.* \(.*\) ADD CONSTRAINT .* PRIMARY KEY .\(.*\).;:\1 \2:'
}


##
# Decorate the PK columns with PRIMARY KEY if not already decorated.
#
function mark_pk() {

  # Apply the change to the ddl file content, saving into a temp file.
  cat ${DDL} | awk -v table=" ${1} " -v column="        ${2} " "${AWK_PK}" \
    > ${DDL}.tmp

  # diff the files to assert diff count is 1 changed line only.
  (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1

  # Apply the temp file to the DDL file.
  mv ${DDL}.tmp ${DDL}
}


function get_fks() {
  grep '^ALTER TABLE ' OMOPCDM*_constraints.sql \
    | sed -e 's:ALTER TABLE \(.*\) ADD CONSTRAINT .* FOREIGN KEY .\([^ ]*\). REFERENCES:\1 \2:' \
    | sed -e 's:;::'
}


##
# Decorate the PK columns with PRIMARY KEY if not already decorated.
#
function mark_fk() {

  # Apply the change to the ddl file content, saving into a temp file.
  cat ${DDL} \
    | awk -v table=" ${1} " -v column=" ${2} " -v ref="${3}" "${AWK_FK}" \
    > ${DDL}.tmp

  # diff the files to assert diff count is 1 changed line only.
  (( $( diff -by --suppress-common-lines ${DDL} ${DDL}.tmp | wc -l ) == 1 )) || exit 1

  # Apply the temp file to the DDL file.
  mv ${DDL}.tmp ${DDL}
}


# Remove the template schema reference, which isn't supported in sqlite.
sed -i -e 's:@cdmDatabaseSchema.::g' OMOPCDM*.sql && rm -f OMOPCDM*.sql-e

# Update the table definitions to insert the PRIMARY KEY declarations.
echo "Adding PRIMARY KEYS to DDL"
get_pks | while read table column; do
  mark_pk ${table} ${column}
done

# Insert the FK constraints into the table.
echo "Adding FOREIGN KEYS to DDL"
get_fks | while read table column reference; do
  mark_fk ${table} ${column} "${reference}"
done

# HACK: no PK in the cohort table, so this is needed to satisfy FK checks.
mark_pk cohort cohort_definition_id

# Populate the DB tables and indices.
echo "Creating empty cdm.db database"
rm -f cdm.db
touch cdm.db
sqlite3 cdm.db < ${DDL}
sqlite3 cdm.db < OMOPCDM*_indices.sql
sqlite3 cdm.db "pragma foreign_key_check"

# Success!
echo "DONE!"```

</details>

@wardle
Copy link

wardle commented Jun 9, 2024

This is at an early stage of development- but would be grateful for any feedback - but this might help efforts with SQLite. See https://github.com/wardle/iort - it can now dynamically generate and execute DDL and handle the vocab import and is working for SQLite and PostgreSQL. I need to add customisation to choose different databases in a similar way to schema in PostgreSQL for example, but might help?

@barabo
Copy link
Author

barabo commented Jun 9, 2024

For those on this thread, or anyone who lands here. I'm working with a sqlite cdm database for a demo, which uses an experimental new jq module library to quickly map FHIR resources into cdm records.

Demo: https://github.com/barabo/fhir-to-omop-demo
fhir-jq: https://github.com/barabo/fhir-jq

Both are still WIP, but this is what motivated me to update the DDL like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants