Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create JSON schemas for import #356

Open
rbroth opened this issue Mar 17, 2023 · 9 comments · Fixed by #363
Open

Create JSON schemas for import #356

rbroth opened this issue Mar 17, 2023 · 9 comments · Fixed by #363

Comments

@rbroth
Copy link
Collaborator

rbroth commented Mar 17, 2023

We want to create a simple way for the scientists to check whether their csv files can be imported into the database/system. JSON schemas look like a good way of doing this.

@rbroth
Copy link
Collaborator Author

rbroth commented Mar 23, 2023

A problem I've run into: JSON schemas are made to validate data in JSON format. Ergo, before we can validate data we need to load it and convert to JSON. The use of CSV presents something of a problem:

  • Data types: JSON has a boolean datatype, csv does not. I've been futzing around with automatic converters (pandas), but maybe we should use `{ "type": ["number", "string"], "enum": [0,1,"True", "False"] } for boolean attributes
  • null values. Some csv libraries import ,, as an empty string, others as null

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 26, 2023

We have a first version of JSON schemas for data import. We also have an initial version of a function that checks a csv file against the JSON schemas (validate_against_json_schema()). However, the function isn't currently being used in the import process.

  • Validate csv files during the import process
  • Possibly: modularize the validation function so that it's easy for the scientists to use themselves.

@rbroth rbroth self-assigned this Jul 26, 2023
@rbroth
Copy link
Collaborator Author

rbroth commented Jul 27, 2023

Once the JSON schemas are being used, we can delete the view bad_biomarker_vals, which is currently used to for validating biomarker data

@rbroth rbroth linked a pull request Jul 27, 2023 that will close this issue
2 tasks
@rbroth
Copy link
Collaborator Author

rbroth commented Jul 27, 2023

NA for Null values could trip us up later, we need to ensure that there are none

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 28, 2023

MS Teams discussion

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 28, 2023

Current problem:

The import script uses the Python csv library to read the CSV files into memory as Python dictionaries. It then runs the python jsonschema library against those dictionaries to check validity. The problem comes with the automatic data type conversion:

Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).

So, either all fields will be loaded as strings, or we need to ensure that string fields are quoted and non-string fields are not quoted.

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 28, 2023

Alternatives:

  1. allow Python to load every field as a string and use regex to check that the various numeric fields look like numbers. The loading into the db works either way. Drawback: won't be able to use JSON schemas to enforce numbers being within a particular range.
  2. Perform data conversion in Python. Benefit: more forgiving when receiving input, which will be useful in the future when we get user data. Drawback: complicated to code.

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 28, 2023

I've decided to go for option 2. The idea is to load the csv as string, then use the JSON schema to determine what attributed should have the "number" datatype, then cast those attributes to floats, and then finally check the data against the JSON schema.

@rbroth
Copy link
Collaborator Author

rbroth commented Jul 28, 2023

Draft MR for me to continue working on next sprint: https://kwvmxgit.ad.nerc.ac.uk/bmgf-maps/data/db-test-data/-/merge_requests/80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants