Remove unnecessary fields in `inputs`/`dataset` section of config files #485

alexdewar · 2024-08-12T11:10:39Z

This section currently looks something like this:

"inputs": {
    "dataset": {
        "name": "India.DataFile.csv",
        "format": "csv",
        "delimiter": ",",
        "encoding": "ASCII",
        "columns": {
            "Gender": "integer",
            "Age": "integer",
            "Sector": "integer",
            "Income": "integer",
            "Carbohydrate": "double",
            "Protein": "double",
            "Sodium": "double",
            "Fat": "double",
            "PhysicalActivity": "double",
            "EnergyIntake": "double",
            "Weight": "double",
            "Height": "double"
        }
    }

This seems a bit bloated and I'm wondering if we can cut it down (which would simplify #449).

Some thoughts after going through the examples repo:

format: This is always "csv". Not used by HGPS anywhere. HGPS can only read CSV files anyway, so I suggest we just drop this.
delimiter: This is always ",", though it looks like the code should work with other delimiters. That said, I can't think why users would actually want to do this and I don't think we should encourage users to separate their fields with heart emojis or whatever, so let's just drop this too.
encoding: This is always "ASCII" or "UTF8", though all of the examples are in plain ASCII. Not used by HGPS anywhere. I don't know, but I suspect if you tried throwing some unicode in there then HGPS would fall over (on Windows, at least -- C++ doesn't have builtin unicode support). (Maybe we should investigate this and open an issue if necessary?) I think we should only support ASCII or UTF8 and drop this field.
columns: This one specifies the names and types for the different columns. I'm not sure about this one, as these fields are actually used in the program. An alternative would be to infer the types of the columns based on the contents of the CSV file (or just assume everything's a double).

Any thoughts @jamesturner246?

The text was updated successfully, but these errors were encountered:

alexdewar · 2024-08-12T11:18:29Z

See also: imperialCHEPI/healthgps-examples#12

alexdewar · 2024-08-12T11:24:09Z

Just to add, one alternative to removing these fields would be to make some/al of them constants for now. We could enforce this with a schema. If we add support for changing them to HGPS at some point in the future, we can update the schema accordingly.

That said, I'd still be inclined to remove delimiter and encoding, because they seem pointless.

jamesturner246 · 2024-08-16T13:05:06Z

Things like format and encoding can go.

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I think as long as it doesn't cause problems, go for it.

alexdewar · 2024-08-28T09:17:25Z

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I think the answer there is to only allow commas as delimiters and dots as decimal points. If we open the door to users providing the data in weird formats, then we make everyone's life harder.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I was assuming that these CSV files always have the same columns -- is that not the case? If not, then it does make sense to leave things as is. I was hoping to be able to use table schemas to validate CSV files (which similarly indicate the data type of each column), but obviously that will only work if we're expecting CSV files to be in a specific format.

jamesturner246 · 2024-08-28T12:32:59Z

Yeah, columns can be added, for instance. Come to think of it, I think these India.Datafile.csv and France.Datafile.csv are only ever used to print the initial status table at the beginning, and are not actually used from then on. This caused confusion and extra work in the past, where they were looking for data for this, but it wasn't actually used beyond the initial status message.

alexdewar · 2024-08-28T13:37:43Z

That's a bit strange...

So, to be clear, are you saying that there are:

Some columns which must be present and whose type we know in advance
And possibly extra columns, which aren't used anywhere (and therefore whose type doesn't matter)?

If so, we could define table schemas which require certain columns to be present, but allowing for arbitrary extra columns, if the user wants.

jamesturner246 · 2024-08-28T17:41:29Z

I mean that the whole file is used only in displaying the initial stat table, I think. Never used elsewhere, 97% sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary fields in `inputs`/`dataset` section of config files #485

Remove unnecessary fields in `inputs`/`dataset` section of config files #485

alexdewar commented Aug 12, 2024

alexdewar commented Aug 12, 2024

alexdewar commented Aug 12, 2024

jamesturner246 commented Aug 16, 2024

alexdewar commented Aug 28, 2024

jamesturner246 commented Aug 28, 2024

alexdewar commented Aug 28, 2024

jamesturner246 commented Aug 28, 2024

Remove unnecessary fields in inputs/dataset section of config files #485

Remove unnecessary fields in inputs/dataset section of config files #485

Comments

alexdewar commented Aug 12, 2024

alexdewar commented Aug 12, 2024

alexdewar commented Aug 12, 2024

jamesturner246 commented Aug 16, 2024

alexdewar commented Aug 28, 2024

jamesturner246 commented Aug 28, 2024

alexdewar commented Aug 28, 2024

jamesturner246 commented Aug 28, 2024

Remove unnecessary fields in `inputs`/`dataset` section of config files #485

Remove unnecessary fields in `inputs`/`dataset` section of config files #485