Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save of source file can cause problems with sparse JSON #183

Open
lisad opened this issue Dec 17, 2024 · 0 comments
Open

Save of source file can cause problems with sparse JSON #183

lisad opened this issue Dec 17, 2024 · 0 comments
Labels
question Further information is requested

Comments

@lisad
Copy link
Owner

lisad commented Dec 17, 2024

What's the right behavior for phaser in this situation?

  • The pipeline defaults to CSV format. It's desirable that the checkpoints and the output be in CSV format.
  • The original source file is JSON records format. The data is fine, and some records have more fields than others which is pretty normal.
  • Phaser wants to save a copy of the source file immediately to the working directory, with line numbers, in order to be able to do diffs later if asked and detect deleted/changed rows.

Without special logic, this fails, because the library that saves to CSV stumbles over the extra fields in some JSON records and throws a ValueError.

Some possibilities:

  • If saving the source copy fails, proceed without fixing. This will make diffs not work later, but the pipeline could still work.
  • Go through the data and ensure each dict has all the fields of any dict, before or as we pass to the CSV writer, so that the save as CSV with row numbers works.
  • Raise the failure and suggest that the user do what? switch to default JSON save behavior which they might not want? Fix the data before bringing it into phaser when fixing the data IN phaser is the whole point?

We'll have this question again for saving data between phases, won't we?

@lisad lisad added the question Further information is requested label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant