Save of source file can cause problems with sparse JSON #183

lisad · 2024-12-17T23:06:58Z

What's the right behavior for phaser in this situation?

The pipeline defaults to CSV format. It's desirable that the checkpoints and the output be in CSV format.
The original source file is JSON records format. The data is fine, and some records have more fields than others which is pretty normal.
Phaser wants to save a copy of the source file immediately to the working directory, with line numbers, in order to be able to do diffs later if asked and detect deleted/changed rows.

Without special logic, this fails, because the library that saves to CSV stumbles over the extra fields in some JSON records and throws a ValueError.

Some possibilities:

If saving the source copy fails, proceed without fixing. This will make diffs not work later, but the pipeline could still work.
Go through the data and ensure each dict has all the fields of any dict, before or as we pass to the CSV writer, so that the save as CSV with row numbers works.
Raise the failure and suggest that the user do what? switch to default JSON save behavior which they might not want? Fix the data before bringing it into phaser when fixing the data IN phaser is the whole point?

We'll have this question again for saving data between phases, won't we?

lisad added the question Further information is requested label Dec 17, 2024

Provide feedback