Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/maccor conversion #7

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

erikamundson
Copy link
Contributor

@erikamundson erikamundson commented Oct 21, 2024

Conversion code for maccor csv. Test file is the first 25 cycles of one test from the UMich formation experiment.

For the conversion function, I assumed that the incoming dataframe would have a device_id and a test_id column. For the job, we can maybe accept a regex pattern in the job args to get this from the file name?

Other notes

  • Made some of the differential columns nullable in the telemetry schema since they'll be null on the first record.
  • There's a missing row in the CSV, do we want to accept the given record_number or re-index with a row_number() operation?
  • For the metadata column, I didn't cast any of the values but that should be fine.
  • The test is pretty simple (schema, row count, cycle number), let me know if there's anything else you think we should test for.
  • Added a cycler_type metadata entry with the value Maccor. Open to changing the name or value of this.

@erikamundson erikamundson requested a review from dyllamt October 21, 2024 18:09
@dyllamt
Copy link
Contributor

dyllamt commented Oct 21, 2024

is it possible to write out a schema, like we have for telemetry, statistics_step, etc.? that would make it more explicit what is expected, and could specify the partition scheme for the table, sort order, a comment for the table, and ensure compatibility of data files with the expected schema. hard to tell if the [maccor] columns are the ones that usually are included in maccor files, and the other columns are added in from the people running the experiment. do you think that's why they are named that way in the test file?

I was imaging something like this for the application:

  1. file bucket: file -> maccor
  2. incremental processing: maccor -> telemetry

For tests, i've been doing:

  1. schema validation
  2. value correctness for one or two rows
  3. null handling (easy
  4. empty frame testing

I think we can treat each of these test suppliers exactly the same as the harmonized schemas (telemetry etc.). Each file has an expected schema and a transformation function

@erikamundson
Copy link
Contributor Author

I added empty data handling, otherwise the tests have enough coverage IMO.

I was imaging something like this for the application:

  1. file bucket: file -> maccor
  2. incremental processing: maccor -> telemetry

I was thinking for the files, it makes more sense to go straight from file -> telemetry. If we want to include another step that's fine, but I don't see the benefit. I think the incremental processing should still work for steps/cycles aggregation.

I could come up with an expected schema for the CSV, although I think the VARIABLE columns and the auxiliary columns won't be expected to be the same for each test file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants