Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user, I want to receive a warning if records in file are greater than records value specified in label #535

Closed
Tracked by #717
mit3ch opened this issue Aug 22, 2022 · 19 comments Β· Fixed by #686
Closed
Tracked by #717
Assignees

Comments

@mit3ch
Copy link

mit3ch commented Aug 22, 2022

πŸ§‘β€πŸ”¬ User Persona(s)

Data Engineer

πŸ’ͺ Motivation

...so that I can know when the records value may be invalid

πŸ“– Additional Details

Validate does not, but should, give a warning if "records" in label is less than the actual records in a table. The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.

βš–οΈ Acceptance Criteria

Given
When I perform
Then I expect

βš™οΈ Engineering Details

insufficient content validation.zip

@jordanpadams jordanpadams self-assigned this Aug 22, 2022
@jordanpadams
Copy link
Member

@rchenatjpl can you check this out for us?

@jordanpadams jordanpadams added needs:triage bug Something isn't working labels Aug 22, 2022
@jordanpadams jordanpadams changed the title Validate (both 2.1.4 & 2.3.0) insufficient content validation for number of records in a table Validate insufficient content validation for number of records in a table Aug 22, 2022
@jordanpadams
Copy link
Member

Validate does not, but should, give a warning if "records" in label is less than the actual records in a table.
The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.

@mit3ch unfortunately, we will think about this, but we have received the exact opposite request in the future because PDS4 does not preclude someone from putting a footer on a data object. Anne is actually about to submit an SCR to no longer allow that, but until then, it is difficult for validate to make these guesses on a case by case basis.

per the record count issue, we will take a look

@mit3ch
Copy link
Author

mit3ch commented Aug 22, 2022 via email

@rchenatjpl
Copy link
Contributor

In case this is still relevant, validate is reporting weirdly for Mitch's #2 in the initial issue. The data file has 1 header record and 1316 data records. The .xml has Table_Character/records = 1322. Then, as Mitch wrote, validate's error message says there are 1318 records. Actually, the weirdness is a little deeper:
Table_Character/records, #records that validate says it read
1317, 1316 (correct)
1318, 1316 (correct)
1319, 1318 (wrong)
1320, 1318 (wrong)
1321, 1318 (wrong)
1322, 1318 (wrong, as Mitch reported)
9999, 1318 (wrong)

@Mitch, your label uses dictionary PDS4_RINGS_1G00_1B00.sch and .xsd. That one is not on the PDS web site, which presumably didn't affect this stuff.

@jordan, I think that's it for me on this issue, and it may already be moot. If you want me to look at something else, please say so.

@jordanpadams
Copy link
Member

thanks @rchenatjpl . we will take a look

@jordanpadams
Copy link
Member

jordanpadams commented Aug 23, 2022

@mit3ch maybe this is something we can add to the new SCR Anne just created, CCB-353, as what validate should do for older data, and then have the model handle this better in the future.

@jordanpadams jordanpadams changed the title Validate insufficient content validation for number of records in a table As a user, I want to receive a warning if records in file are greater than records value specified in label Dec 2, 2022
@jordanpadams jordanpadams added requirement New requirements p.should-have icebox and removed needs:triage bug Something isn't working labels Dec 2, 2022
@jordanpadams jordanpadams removed their assignment Dec 2, 2022
@jordanpadams
Copy link
Member

@mit3ch I split this original ticket out into 2, the new requirement desired is here and the bug your identified is here: #568

@mace-space
Copy link

mace-space commented Dec 21, 2022

I’ve written a simple Python script that compares each label and data file pair for the UVIS solar ring occs bundle. As @mit3ch mentioned, we know the data file contains a header in the first record, followed by the table, and nothing else. The script simply counts lines in the data file, subtracts one, and compares it to the value for records in the corresponding label (<Table_Character>.<records>); printing out the file name and the two record counts, and warning for any that don't match.

I'm happy to be contacted if you'd like to discuss this further.

@al-niessner
Copy link
Contributor

@jordanpadams

Is this really about un-described data at the end of a file more than about an extra record? I get that the specific scenario was extra record of same format etc but reading past the described end of the table is fraught with peril and if the real desire is a warning for data at the end of the file that is not described then it is simpler and very doable. Saying that you have extra rows corresponding to previously described table seems like a niche of the general case of unexplained data in general. I presume there is a way to say there 10 bytes of reserved (no details provided) data at the end of the file and that it may be variable.

@jordanpadams
Copy link
Member

@al-niessner I agree with your thoughts. Let's just output a warning there are X undescribed bits at end of file. I agree reading past the end of the table and assuming it is still that specific table introduces all kinds of other possible issues.

@al-niessner
Copy link
Contributor

@jordanpadams

Sorry, this is just not working out. Can you have a file area observational in a label where table A points and covers 80% of it then where table B covers 40% of it (yes, overlap)? Even without the overlap, since validate processes each table independently, it makes it impossible for the current code to make extraneous data assessment. It would have to be a different rule in the label chain that processed all content at once to identify not described portions of files. In other words, have to approach the table+array in file problem too.

@jordanpadams
Copy link
Member

@al-niessner

Can you have a file area observational in a label where table A points and covers 80% of it then where table B covers 40% of it (yes, overlap)?

That is invalid. We should catch that when the offset of table B overlaps with the end of table A, no? Can we use the singleton in the code to pass around where we are at in the data file in terms of bytes read?

@al-niessner
Copy link
Contributor

@jordanpadams

I do not think that we catch table A and table B overlapping and certainly not Table A and Array A in the same file. The bit mapping with offsets is done for each table without consideration to other tables in the same file area and certainly not array area. I see no reason why we would not allow Fred's favorite table to be all the odd columns and Cindy's favorite to be all the even columns. They would be used for two different use cases like engineering analysis vs science analysis. The two tables would cover the same data block in the file but give different views or perspectives of that data.

Suppose could use a max with the singleton to know the high water mark - out of order is a warning #683 not an error - but it does seem fraught with unknown peril. I will look at it.

@al-niessner
Copy link
Contributor

@jordanpadams

Can we make this an INFO instead of warning? Detecting this breaks nearly every regression check that we have -- means that almost all of our example data from other users and sources have data at the end of files that is not described in the file area. Since our regression tests check for errors and warnings then using INFO would be safe. I just think that all the users having validate suddenly inundate them with warnings about stuff they know and are alright with is going to cause us to undo these changes.

We could add a switch --fully-described that would enable these checks for those that have labels that qualify for this level of detail checking.

@rchenatjpl
Copy link
Contributor

@al-niessner @jordanpadams Regarding overlap, the Standards section 2B.1.1 says: "PDS requires that each digital object be physically distinct and contiguous in a single file; objects may not overlap." It then has a good example concerning RGB. So please throw an error if two Tables or Arrays or whatever overlap. And in case it's on the table, I'm against adding this as a message that only shows up with 'validate -v1'. That flag generates way too much output, and overlap is something I'd really want to call out in a review.

@al-niessner
Copy link
Contributor

@jordanpadams @rchenatjpl

Sorry for the confusion that I created. Overlaps are errors and will remain so without adding extra flags. They are checked, just not where I expected. It is possible to circumvent this rule but may be caught elsewhere.

The INFO or extra flag is just for detecting extraneous bytes at the end of files which appears to be very common currently. I would prefer the new switch or flag on the command line with validate using warnings or errors myself but have to give both options.

@rchenatjpl
Copy link
Contributor

@al-niessner I probably read the earlier stuff lazily. What you just wrote sounds fine. Thanks.

@jordanpadams
Copy link
Member

@al-niessner I like the flag idea. Let's go for that.

@mit3ch
Copy link
Author

mit3ch commented Aug 28, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants