-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct translation of quotes (e.g for offset
field in nlp_note
)
#341
Comments
select offset from cdmv5.note_nlp limit 1;
valid ohdsi-sql?
select offset from cdmv5.note_nlp limit 1;
valid ohdsi-sql?select offset from cdmv5.note_nlp;
valid ohdsi-sql?
It appears 'offset' is a reserved word in PostgreSQL. It is unfortunate that it is used in the CDM specifications. It seems the CDM folks are aware of this, since the field name is quoted on the CDM website. So it is not valid OhdsiSql, but to your point, there's no way a user would know this. The current workaround would be to quote the field name, but that would make the field name case sensitive as well, which might cause problems for some sites in OHDSI. |
I tried testing In general, how do I know that my SQL is valid OHDSI-SQL or not? |
The workaround seems to work across many of the ohdsi platforms but I think there is an issue with bigquery. library(DatabaseConnector)
# cross platform sql
ohdsisql <- 'SELECT "offset" FROM @cdmDatabaseSchema.NOTE_NLP;'
# postgres ----
connection <- connect(dbms = "postgresql",
user = "postgres",
password = "",
server = "localhost/covid")
#> Connecting using PostgreSQL driver
cdmDatabaseSchema <- "cdm5"
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# oracle ----
connectionDetails <- createConnectionDetails(dbms = "oracle",
user = Sys.getenv("CDM5_ORACLE_USER"),
password = Sys.getenv("CDM5_ORACLE_PASSWORD"),
server = Sys.getenv("CDM5_ORACLE_SERVER"))
connection <- connect(connectionDetails)
#> Connecting using Oracle driver
#> - using THIN to connect
cdmDatabaseSchema <- Sys.getenv("CDM5_ORACLE_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "OHDSI")
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# redshift ----
connection <- connect(dbms = "redshift",
server = Sys.getenv("CDM5_REDSHIFT_SERVER"),
user = Sys.getenv("CDM5_REDSHIFT_USER"),
password = Sys.getenv("CDM5_REDSHIFT_PASSWORD"),
port = Sys.getenv("CDM5_REDSHIFT_PORT"))
#> Connecting using Redshift driver
cdmDatabaseSchema <- Sys.getenv("CDM5_REDSHIFT_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# snowflake ----
connection <- connect(
dbms = "snowflake",
connectionString = Sys.getenv("SNOWFLAKE_CONNECTION_STRING"),
user = Sys.getenv("SNOWFLAKE_USER"),
password = Sys.getenv("SNOWFLAKE_PASSWORD")
)
#> Connecting using Snowflake driver
cdmDatabaseSchema <- Sys.getenv("SNOWFLAKE_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# bigquery ----
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
user="",
password='',
pathToDriver = bqDriverPath)
connection <- connect(connectionDetails)
#> Connecting using BigQuery driver
cdmDatabaseSchema <- "synpuf_110k"
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] F0_
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# The returned column name is unexpected on bigquery. Maybe I'm doing something wrong here...
# sql server ----
connectionDetails <- createConnectionDetails(
dbms = "sql server",
server = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
user = Sys.getenv("CDM5_SQL_SERVER_USER"),
password = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
port = Sys.getenv("CDM5_SQL_SERVER_PORT")
)
connection <- connect(connectionDetails)
#> Connecting using SQL Server driver
cdmDatabaseSchema <- "cdmv54.dbo"
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)
# sqlite ----
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
#> Connecting using SQLite driver
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "main")
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection) Created on 2023-08-11 with reprex v2.0.2 |
In general, quoting columns seems to work on some database but not all. library(DatabaseConnector)
testOhdsiSql <- function(ohdsisql) {
renderTranslateQuerySql <- purrr::safely(renderTranslateQuerySql, quiet = F)
cli::cat_rule("postgres ----")
cli::cat_line()
connection <- connect(dbms = "postgresql",
user = "postgres",
password = "",
server = "localhost/covid")
cdmDatabaseSchema <- "cdm5"
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("oracle ----")
connectionDetails <- createConnectionDetails(dbms = "oracle",
user = Sys.getenv("CDM5_ORACLE_USER"),
password = Sys.getenv("CDM5_ORACLE_PASSWORD"),
server = Sys.getenv("CDM5_ORACLE_SERVER"))
connection <- connect(connectionDetails)
cdmDatabaseSchema <- Sys.getenv("CDM5_ORACLE_CDM_SCHEMA")
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("redshift ----")
cli::cat_line()
connection <- connect(dbms = "redshift",
server = Sys.getenv("CDM5_REDSHIFT_SERVER"),
user = Sys.getenv("CDM5_REDSHIFT_USER"),
password = Sys.getenv("CDM5_REDSHIFT_PASSWORD"),
port = Sys.getenv("CDM5_REDSHIFT_PORT"))
cdmDatabaseSchema <- Sys.getenv("CDM5_REDSHIFT_CDM_SCHEMA")
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("snowflake ----")
cli::cat_line()
connection <- connect(
dbms = "snowflake",
connectionString = Sys.getenv("SNOWFLAKE_CONNECTION_STRING"),
user = Sys.getenv("SNOWFLAKE_USER"),
password = Sys.getenv("SNOWFLAKE_PASSWORD")
)
cdmDatabaseSchema <- Sys.getenv("SNOWFLAKE_CDM_SCHEMA")
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("bigquery ----")
cli::cat_line()
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
user="",
password='',
pathToDriver = bqDriverPath)
connection <- connect(connectionDetails)
cdmDatabaseSchema <- "synpuf_110k"
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("sql server ----")
cli::cat_line()
connectionDetails <- createConnectionDetails(
dbms = "sql server",
server = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
user = Sys.getenv("CDM5_SQL_SERVER_USER"),
password = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
port = Sys.getenv("CDM5_SQL_SERVER_PORT")
)
connection <- connect(connectionDetails)
cdmDatabaseSchema <- "cdmv54.dbo"
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
))
disconnect(connection)
cli::cat_rule("sqlite ----")
cli::cat_line()
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
print(dplyr::tibble(
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "main")[[1]]
))
disconnect(connection)
}
# cross platform sql
testOhdsiSql('SELECT TOP 1 "person_id" FROM @cdmDatabaseSchema.person;')
#> ── postgres ---- ───────────────────────────────────────────────────────────────
#> Connecting using PostgreSQL driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 1
#> ── oracle ---- ─────────────────────────────────────────────────────────────────
#> Connecting using Oracle driver
#> - using THIN to connect
#> Error: Error executing SQL:
#> java.sql.SQLSyntaxErrorException: ORA-00904: "person_id": invalid identifier
#>
#> An error report has been created at /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmp1SsB43/reprex-f4484725db51-cocky-doe/errorReportSql.txt
#> # A tibble: 0 × 0
#> ── redshift ---- ───────────────────────────────────────────────────────────────
#> Connecting using Redshift driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 2
#> ── snowflake ---- ──────────────────────────────────────────────────────────────
#> Connecting using Snowflake driver
#> Error: Error executing SQL:
#> net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: error line 1 at position 8
#> invalid identifier '"person_id"'
#> An error report has been created at /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmp1SsB43/reprex-f4484725db51-cocky-doe/errorReportSql.txt
#> # A tibble: 0 × 0
#> ── bigquery ---- ───────────────────────────────────────────────────────────────
#> Connecting using BigQuery driver
#> # A tibble: 1 × 1
#> F0_
#> <chr>
#> 1 person_id
#> ── sql server ---- ─────────────────────────────────────────────────────────────
#> Connecting using SQL Server driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 0
#> ── sqlite ---- ─────────────────────────────────────────────────────────────────
#> Connecting using SQLite driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 6
testOhdsiSql('SELECT TOP 1 person_id FROM @cdmDatabaseSchema.person;')
#> ── postgres ---- ───────────────────────────────────────────────────────────────
#> Connecting using PostgreSQL driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 1
#> ── oracle ---- ─────────────────────────────────────────────────────────────────
#> Connecting using Oracle driver
#> - using THIN to connect
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 540
#> ── redshift ---- ───────────────────────────────────────────────────────────────
#> Connecting using Redshift driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 2
#> ── snowflake ---- ──────────────────────────────────────────────────────────────
#> Connecting using Snowflake driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 1
#> ── bigquery ---- ───────────────────────────────────────────────────────────────
#> Connecting using BigQuery driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 112989
#> ── sql server ---- ─────────────────────────────────────────────────────────────
#> Connecting using SQL Server driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 0
#> ── sqlite ---- ─────────────────────────────────────────────────────────────────
#> Connecting using SQLite driver
#> # A tibble: 1 × 1
#> PERSON_ID
#> <dbl>
#> 1 6 Created on 2023-08-11 with reprex v2.0.2 |
Thanks! This is really helpful. The weird behavior on BigQuery is because it uses backticks instead of double quotes. I added a translation rule in the develop branch that I still need to test a bit more. For SnowFlake I'll need a testing server to figure out what is going on. |
Thanks to Odysseus I now have a testing server for Snowflake! On Snowflake the issue here was that by default field names are created in uppercase, while the search for So with the develop versions of SqlRender and DatabaseConnector quotes should work for BigQuery and Snowflake. (Note that I probably will change the DatabaseConnector dbplyr implementation to also use quotes. I had not done so in the past, to avoid issues with case like in Snowflake, but now I think it better to be more consistent with how dbplyr works for other platforms) |
Woo! So I know we have discussed this a bit in the past and may have different opinions which is totally fine but just bringing it up again.... I would like to use DatabaseConnector with dbplyr's sql translations. So DatabaseConnector is the DBI driver and dbplyr provides the SQL translation. One reason is that I don't think the sql generated by dbplyr can be considered valid OHDSI-SQL in part due to all the quoting of identifiers. Even though OHDSI-SQL is based on MSSQL you cannot in general expect any valid MSSQL to be valid OHDSI-SQL. In Darwin we are relying on dbplyr quite heavily for SQL translation and it works well on almost all OHDSI databases (Bigquery is still a work in progress). so in summary I want to have two separate/orthogonal "layers" to the architecture that cleanly divide responsibility: sql translation, and DBI driver. What do you think about this approach? If quotes are valid in OHDSI-SQL then wouldn't SqlRender need to translate those qoute symbols? Or maybe all dbms use the same quote symbols for identifiers and string literals? |
Now that we dynamically inherit the DatabaseConnector connection from the SQL Server class we could also dynamically inherit from other classes, thus invoking other dbplyr backends, avoiding the need to translate the dbplyr-generated SQL. But we'd still be 'overriding someone else's class`, so I think you still would not be happy with that? |
Yes you're right. I think we should add some code to dbplyr that will generate the correct sql for a DatabaseConnector connection. I'm happy to take that on if you're ok with that. But yea I do think DatabaseConnector connections should have their own class and not use the Part of my interest here is to see if we can rely on these sql generation/translation for real studies and from my limited experience it is easier to write cross platform dplyr than cross platform ohdsi-sql (probably just my limited experience with ohdsi-sql though). Ideally someday we will have a function that will tell you if your ohdsi-sql is correct without needing to have test data in every supported dbms. Yesterday I wrote some ohdsi-sql but had no example data with realistic measurements to test it on. |
It would be great if dbplyr could generate actual OhdsiSql. I suspect that will be lot of work though. I think you'll find you'll need to test your dbplyr code on all platforms too, so in that sense it is no different from OhdsiSql. OhdsiSql is defined, just not very formally. It basically is SQL Server SQL, but restricted to the SQL functions and structures listed here. If we could find an open source SQL validator it should be easy to adapt it for OhdsiSql. But I can't seem to find any. |
select offset from cdmv5.note_nlp;
valid ohdsi-sql?offset
field in nlp_note
offset
field in nlp_note
offset
field in nlp_note
)
Is this query valid ohdsi-sql?
select offset from cdmv5.note_nlp;
If it is not valid then how would a new user know that this is incorrect OHDSI-SQL and how to fix it?
If it is valid then I think there is a bug in translation.
Created on 2023-08-09 with reprex v2.0.2
contents of errorReportSql.txt
The text was updated successfully, but these errors were encountered: