Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix CSV format incompatibilities between readr::write_csv and PostgreSQL COPY FROM #38

Open
logstar opened this issue Sep 17, 2021 · 0 comments

Comments

@logstar
Copy link
Contributor

logstar commented Sep 17, 2021

In database building procedure, R dataframes are output into CSV files, and the CSV files are loaded into a PostgreSQL database using COPY FROM SQL command.

The R output CSV format is not completely compatible with PostgreSQL COPY FROM command.

The following case is known to be incompatible:

  • \.
    • COPY FROM: "Because backslash is not a special character in the CSV format, ., the end-of-data marker, could also appear as a data value."
    • readr::write_csv does not quote \. in output CSV.

The following cases need to be further investigated for compatibility:

  • NULL/NA/missing values
    • COPY FROM: "The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted, while a non-NULL value matching the NULL parameter string is quoted. For example, with the default settings, a NULL is written as an unquoted empty string, while an empty string data value is written with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns."
    • readr::write_csv probably writes both NA and empty string as blank in output CSV, with na = "".
  • Values that are not delimiter/quote/CR/LF/NULL/NA/missing but contain them:
    • COPY FROM: "If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character."
    • readr::write_csv probably quotes such values in CSV output, but edge cases need to be evaluated.

Currently, PostgreSQL COPY FROM compatible CSV files are output by the following function.

pgc_write_csv <- function(x, file, append, col_names) {
readr::write_csv(
x, file, na = "", append = append, col_names = col_names,
quote = "needed", escape = "double", progress = FALSE)
}

The values in ${BULK_EXP_SCHEMA}_${BULK_EXP_TPM_HISTOLOGY_TBL}.csv, which is the only CSV file as of 09/17/2021, are all compatible with PostgreSQL COPY FROM. However, CSV files that will be added in #37 may contain incompatible values.

Refs:

@logstar logstar changed the title Investigate and fix CSV format incompatibilities with PostgreSQL COPY FROM SQL command Investigate and fix CSV format incompatibilities between readr::write_csv and PostgreSQL COPY FROM Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant