Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass arguments to read_csv #26

Open
alexwalkerepi opened this issue Feb 28, 2020 · 4 comments
Open

pass arguments to read_csv #26

alexwalkerepi opened this issue Feb 28, 2020 · 4 comments

Comments

@alexwalkerepi
Copy link
Contributor

When using bq.cached_read on the query below, the subpara column is initially read as a str, but when reading from the cache, it's read as an int. Ideally, we'd be able to pass arguments like:
dtype={'subpara': str} to pd.read-csv.

SELECT
  DISTINCT SUBSTR(bnf_code, 1, 7) AS subpara,
FROM
  ebmdatalab.hscic.normalised_prescribing_standard AS prescribing
@evansd
Copy link
Contributor

evansd commented Mar 2, 2020

One option would be to muck about with the CSV headers and encode the type in them (eg. we'd store str:subpara rather than just subpara) and then we'd strip the types out and use them when we read the cached file back. I don't know how easy Pandas makes that but I'm sure it's possible.

@evansd
Copy link
Contributor

evansd commented Mar 2, 2020

It also looks like Pandas supports comments in CSV files (at least to the extent of ignoring them on read) so we could write the types into an initial comment line in the file if that's easier than modifying the headers.

@sebbacon
Copy link
Contributor

sebbacon commented Mar 2, 2020

Or we could store the types in the fingerprint_path file. But that would require a bigger refactoring. I like the comments idea.

@alexwalkerepi
Copy link
Contributor Author

This is definitely a more robust solution than passing arguments to read_csv. It took me a while to realise that the types had changed between the query and the cache. This way it should keep the same types by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants