Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary fields in inputs/dataset section of config files #485

Open
alexdewar opened this issue Aug 12, 2024 · 7 comments
Open

Comments

@alexdewar
Copy link
Contributor

This section currently looks something like this:

"inputs": {
    "dataset": {
        "name": "India.DataFile.csv",
        "format": "csv",
        "delimiter": ",",
        "encoding": "ASCII",
        "columns": {
            "Gender": "integer",
            "Age": "integer",
            "Sector": "integer",
            "Income": "integer",
            "Carbohydrate": "double",
            "Protein": "double",
            "Sodium": "double",
            "Fat": "double",
            "PhysicalActivity": "double",
            "EnergyIntake": "double",
            "Weight": "double",
            "Height": "double"
        }
    }

This seems a bit bloated and I'm wondering if we can cut it down (which would simplify #449).

Some thoughts after going through the examples repo:

  • format: This is always "csv". Not used by HGPS anywhere. HGPS can only read CSV files anyway, so I suggest we just drop this.
  • delimiter: This is always ",", though it looks like the code should work with other delimiters. That said, I can't think why users would actually want to do this and I don't think we should encourage users to separate their fields with heart emojis or whatever, so let's just drop this too.
  • encoding: This is always "ASCII" or "UTF8", though all of the examples are in plain ASCII. Not used by HGPS anywhere. I don't know, but I suspect if you tried throwing some unicode in there then HGPS would fall over (on Windows, at least -- C++ doesn't have builtin unicode support). (Maybe we should investigate this and open an issue if necessary?) I think we should only support ASCII or UTF8 and drop this field.
  • columns: This one specifies the names and types for the different columns. I'm not sure about this one, as these fields are actually used in the program. An alternative would be to infer the types of the columns based on the contents of the CSV file (or just assume everything's a double).

Any thoughts @jamesturner246?

@alexdewar
Copy link
Contributor Author

See also: imperialCHEPI/healthgps-examples#12

@alexdewar
Copy link
Contributor Author

Just to add, one alternative to removing these fields would be to make some/al of them constants for now. We could enforce this with a schema. If we add support for changing them to HGPS at some point in the future, we can update the schema accordingly.

That said, I'd still be inclined to remove delimiter and encoding, because they seem pointless.

@jamesturner246
Copy link
Contributor

Things like format and encoding can go.

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I think as long as it doesn't cause problems, go for it.

@alexdewar
Copy link
Contributor Author

If we remove `delimiter, I wonder if the CSV lib we use is smart enough to figure out European decimal points? Had all sorts of trouble when Ali was feeding me mixed period/comma config CSVs.

I think the answer there is to only allow commas as delimiters and dots as decimal points. If we open the door to users providing the data in weird formats, then we make everyone's life harder.

I'm a bit hesitant about inferring column type from the files, as I've seen mixed 1.0s in integer columns and 1s in float columns. If we start passing in floats to code that expects ints, I wonder if things start breaking. But on the other hand I agree it does look bloated.

I was assuming that these CSV files always have the same columns -- is that not the case? If not, then it does make sense to leave things as is. I was hoping to be able to use table schemas to validate CSV files (which similarly indicate the data type of each column), but obviously that will only work if we're expecting CSV files to be in a specific format.

@jamesturner246
Copy link
Contributor

Yeah, columns can be added, for instance. Come to think of it, I think these India.Datafile.csv and France.Datafile.csv are only ever used to print the initial status table at the beginning, and are not actually used from then on. This caused confusion and extra work in the past, where they were looking for data for this, but it wasn't actually used beyond the initial status message.

@alexdewar
Copy link
Contributor Author

That's a bit strange...

So, to be clear, are you saying that there are:

  1. Some columns which must be present and whose type we know in advance
  2. And possibly extra columns, which aren't used anywhere (and therefore whose type doesn't matter)?

If so, we could define table schemas which require certain columns to be present, but allowing for arbitrary extra columns, if the user wants.

@jamesturner246
Copy link
Contributor

I mean that the whole file is used only in displaying the initial stat table, I think. Never used elsewhere, 97% sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants