As a user, I want to receive a warning if records in file are greater than records value specified in label #535

mit3ch · 2022-08-22T16:28:00Z

🧑‍🔬 User Persona(s)

Data Engineer

💪 Motivation

...so that I can know when the records value may be invalid

📖 Additional Details

Validate does not, but should, give a warning if "records" in label is less than the actual records in a table. The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.

⚖️ Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

insufficient content validation.zip

jordanpadams · 2022-08-22T18:52:07Z

@rchenatjpl can you check this out for us?

jordanpadams · 2022-08-22T18:56:12Z

Validate does not, but should, give a warning if "records" in label is less than the actual records in a table.
The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.

@mit3ch unfortunately, we will think about this, but we have received the exact opposite request in the future because PDS4 does not preclude someone from putting a footer on a data object. Anne is actually about to submit an SCR to no longer allow that, but until then, it is difficult for validate to make these guesses on a case by case basis.

per the record count issue, we will take a look

mit3ch · 2022-08-22T21:11:58Z

Hi Jordan, I understand that argument, sort of. If a data file has content after the table, and that content is described in the label there is no problem. If it is not described in the label, then there might be a problem; hence my suggestion that validate provide a warning, not an Error message. In the files I'm looking at, the additional records are immediately adjacent to the table records, in exactly the same format, and are not described in the label. At this time, the only way I can verify that every label which passes validate accurately describes the table is to open each label and data file pair and compare them. At RMS, we're now looking at writing a python script to do just that for this bundle. We know the data file contains a header in the first record, followed by the table, and nothing else. The script will probably just count records in the data file, subtract one, and compare to the value for records in the label, and print out the file name and the two numbers, highlighting any that don't match. Once you implemented content validation (thank you by the way), it has been obvious that if the value for records is greater than the actual number of records validate identifies an error. I'm concerned that it was not equally obvious, at least to me, that instances where the value for records is less than the actual number of records are not flagged. I'm concerned that I or others may have previously archived tables which may have this same undetected flaw. Is there a review mechanism (discuss in DDWG?) we can activate before putting this in the do-it-never pile? Mitch Dr. Mitch Gordon SETI Institute Deputy Manager PDS Ring-Moon Systems Node 276-393-8822 Pronouns: he, him, his From: Jordan Padams ***@***.***> Sent: Monday, August 22, 2022 2:56 PM To: NASA-PDS/validate ***@***.***> Cc: Mitch Gordon ***@***.***>; Mention ***@***.***> Subject: Re: [NASA-PDS/validate] Validate insufficient content validation for number of records in a table (Issue #535) Validate does not, but should, give a warning if "records" in label is less than the actual records in a table. The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning. @mit3ch<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmit3ch&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fI4qqCMxur17Xw4a%2BsCnL6pP%2BhhGXnhSNXJXNA5EziM%3D&reserved=0> unfortunately, we will think about this, but we have received the exact opposite request in the future because PDS4 does not preclude someone from putting a footer on a data object. Anne is actually about to submit an SCR to no longer allow that, but until then, it is difficult for validate to make these guesses on a case by case basis. per the record count issue, we will take a look - Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNASA-PDS%2Fvalidate%2Fissues%2F535%23issuecomment-1222785743&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5CVLxcUZ9jLYFwU20vEgOjmzEDyQUXBtNGlj1CuYMAU%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABYIQZBYEXVLGAB6FA4V473V2PENNANCNFSM57IKEWHA&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I93ajr2j%2BZ3zamrbDk7DLJ3YlcJOi%2FPdr81HsNSEFZY%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

rchenatjpl · 2022-08-22T21:56:03Z

In case this is still relevant, validate is reporting weirdly for Mitch's #2 in the initial issue. The data file has 1 header record and 1316 data records. The .xml has Table_Character/records = 1322. Then, as Mitch wrote, validate's error message says there are 1318 records. Actually, the weirdness is a little deeper:
Table_Character/records, #records that validate says it read
1317, 1316 (correct)
1318, 1316 (correct)
1319, 1318 (wrong)
1320, 1318 (wrong)
1321, 1318 (wrong)
1322, 1318 (wrong, as Mitch reported)
9999, 1318 (wrong)

@Mitch, your label uses dictionary PDS4_RINGS_1G00_1B00.sch and .xsd. That one is not on the PDS web site, which presumably didn't affect this stuff.

@jordan, I think that's it for me on this issue, and it may already be moot. If you want me to look at something else, please say so.

jordanpadams · 2022-08-23T13:52:09Z

thanks @rchenatjpl . we will take a look

jordanpadams · 2022-08-23T13:58:58Z

@mit3ch maybe this is something we can add to the new SCR Anne just created, CCB-353, as what validate should do for older data, and then have the model handle this better in the future.

jordanpadams · 2022-12-02T19:26:58Z

@mit3ch I split this original ticket out into 2, the new requirement desired is here and the bug your identified is here: #568

mace-space · 2022-12-21T19:41:51Z

I’ve written a simple Python script that compares each label and data file pair for the UVIS solar ring occs bundle. As @mit3ch mentioned, we know the data file contains a header in the first record, followed by the table, and nothing else. The script simply counts lines in the data file, subtracts one, and compares it to the value for records in the corresponding label (<Table_Character>.<records>); printing out the file name and the two record counts, and warning for any that don't match.

I'm happy to be contacted if you'd like to discuss this further.

al-niessner · 2023-08-17T15:03:04Z

@jordanpadams

Is this really about un-described data at the end of a file more than about an extra record? I get that the specific scenario was extra record of same format etc but reading past the described end of the table is fraught with peril and if the real desire is a warning for data at the end of the file that is not described then it is simpler and very doable. Saying that you have extra rows corresponding to previously described table seems like a niche of the general case of unexplained data in general. I presume there is a way to say there 10 bytes of reserved (no details provided) data at the end of the file and that it may be variable.

jordanpadams · 2023-08-17T15:16:16Z

@al-niessner I agree with your thoughts. Let's just output a warning there are X undescribed bits at end of file. I agree reading past the end of the table and assuming it is still that specific table introduces all kinds of other possible issues.

al-niessner · 2023-08-17T17:24:51Z

@jordanpadams

Sorry, this is just not working out. Can you have a file area observational in a label where table A points and covers 80% of it then where table B covers 40% of it (yes, overlap)? Even without the overlap, since validate processes each table independently, it makes it impossible for the current code to make extraneous data assessment. It would have to be a different rule in the label chain that processed all content at once to identify not described portions of files. In other words, have to approach the table+array in file problem too.

jordanpadams · 2023-08-17T21:32:12Z

@al-niessner

Can you have a file area observational in a label where table A points and covers 80% of it then where table B covers 40% of it (yes, overlap)?

That is invalid. We should catch that when the offset of table B overlaps with the end of table A, no? Can we use the singleton in the code to pass around where we are at in the data file in terms of bytes read?

al-niessner · 2023-08-19T16:17:20Z

@jordanpadams

I do not think that we catch table A and table B overlapping and certainly not Table A and Array A in the same file. The bit mapping with offsets is done for each table without consideration to other tables in the same file area and certainly not array area. I see no reason why we would not allow Fred's favorite table to be all the odd columns and Cindy's favorite to be all the even columns. They would be used for two different use cases like engineering analysis vs science analysis. The two tables would cover the same data block in the file but give different views or perspectives of that data.

Suppose could use a max with the singleton to know the high water mark - out of order is a warning #683 not an error - but it does seem fraught with unknown peril. I will look at it.

al-niessner · 2023-08-19T18:16:31Z

@jordanpadams

Can we make this an INFO instead of warning? Detecting this breaks nearly every regression check that we have -- means that almost all of our example data from other users and sources have data at the end of files that is not described in the file area. Since our regression tests check for errors and warnings then using INFO would be safe. I just think that all the users having validate suddenly inundate them with warnings about stuff they know and are alright with is going to cause us to undo these changes.

We could add a switch --fully-described that would enable these checks for those that have labels that qualify for this level of detail checking.

rchenatjpl · 2023-08-21T17:39:52Z

@al-niessner @jordanpadams Regarding overlap, the Standards section 2B.1.1 says: "PDS requires that each digital object be physically distinct and contiguous in a single file; objects may not overlap." It then has a good example concerning RGB. So please throw an error if two Tables or Arrays or whatever overlap. And in case it's on the table, I'm against adding this as a message that only shows up with 'validate -v1'. That flag generates way too much output, and overlap is something I'd really want to call out in a review.

al-niessner · 2023-08-22T16:17:52Z

@jordanpadams @rchenatjpl

Sorry for the confusion that I created. Overlaps are errors and will remain so without adding extra flags. They are checked, just not where I expected. It is possible to circumvent this rule but may be caught elsewhere.

The INFO or extra flag is just for detecting extraneous bytes at the end of files which appears to be very common currently. I would prefer the new switch or flag on the command line with validate using warnings or errors myself but have to give both options.

rchenatjpl · 2023-08-22T16:20:15Z

@al-niessner I probably read the earlier stuff lazily. What you just wrote sounds fine. Thanks.

jordanpadams · 2023-08-23T21:57:48Z

@al-niessner I like the flag idea. Let's go for that.

mit3ch · 2023-08-28T16:55:52Z

Hi all, I've been out of touch for awhile and am just catching up. Thanks for giving this some thought; I hadn't realized the full potential of the can of worms I was opening. Thanks to Richard regarding ensuring we throw an error for overlapping objects. I know that Jordan has closed this and I can live with the current solution, but I'd really prefer a warning rather than info for additional bytes at the end of the file. I thought about totaling the bytes of all of the objects and comparing to the file total, but there is no rule preventing extraneous bytes between objects. This originally came up because a provider described a table indicating fewer records than the table actually contained. I've been thinking of a split message - an info message if there are less than some threshold number of extraneous bytes (~10) at the end, and a warning if there are more than the threshold number of extraneous bytes at the end. Clearly, I'm thinking of just the last table, but it would be possible for the miscount of records to occur in the description of any table and not trigger a message if the following object start_byte was entered correctly. My initial problem is addressed by the current solution - information that there are undescribed bytes at the end of the file. A more complete (and perhaps not achievable now) solution would be a message identifying all undescribed bytes, something like: Undescribed bytes: (start/byte / number of Undescribed bytes) 128/45, 422/2, 877/200, ... This would probably be easier to determine if we required that all objects be described in the label in the same order as they appear in the data file, but that was not approved by the DDWG. Sorry. Regards, Mitch Dr. Mitch Gordon SETI Institute Senior Astronomical Archivist PDS Ring-Moon Systems Node 276-393-8822 Pronouns: he, him, his From: Jordan Padams ***@***.***> Sent: Thursday, August 24, 2023 10:40 AM To: NASA-PDS/validate ***@***.***> Cc: Mitch Gordon ***@***.***>; Mention ***@***.***> Subject: Re: [NASA-PDS/validate] As a user, I want to receive a warning if records in file are greater than records value specified in label (Issue #535) Closed #535<#535> as completed via #686<#686>. - Reply to this email directly, view it on GitHub<#535 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABYIQZGZSHQJPQMWAWY6IMDXW5RSPANCNFSM57IKEWHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jordanpadams self-assigned this Aug 22, 2022

jordanpadams added needs:triage bug Something isn't working labels Aug 22, 2022

jordanpadams changed the title ~~Validate (both 2.1.4 & 2.3.0) insufficient content validation for number of records in a table~~ Validate insufficient content validation for number of records in a table Aug 22, 2022

jordanpadams changed the title ~~Validate insufficient content validation for number of records in a table~~ As a user, I want to receive a warning if records in file are greater than records value specified in label Dec 2, 2022

jordanpadams added requirement New requirements p.should-have icebox and removed needs:triage bug Something isn't working labels Dec 2, 2022

jordanpadams removed their assignment Dec 2, 2022

jordanpadams assigned al-niessner Aug 15, 2023

jordanpadams added B14.0 sprint-backlog and removed icebox labels Aug 15, 2023

al-niessner mentioned this issue Aug 19, 2023

New --complete-descriptions flag to warn for data not described by metadata in label #686

Merged

jordanpadams closed this as completed in #686 Aug 24, 2023

jordanpadams mentioned this issue Sep 29, 2023

Verify documentation for requirements #367, #388, #415, #462, #476, #482, #524, #535, #604, #605, #617 #717

Closed

11 tasks

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a user, I want to receive a warning if records in file are greater than records value specified in label #535

As a user, I want to receive a warning if records in file are greater than records value specified in label #535

mit3ch commented Aug 22, 2022 •

edited by jordanpadams

Loading

jordanpadams commented Aug 22, 2022

jordanpadams commented Aug 22, 2022

mit3ch commented Aug 22, 2022 via email

rchenatjpl commented Aug 22, 2022

jordanpadams commented Aug 23, 2022

jordanpadams commented Aug 23, 2022 •

edited

Loading

jordanpadams commented Dec 2, 2022

mace-space commented Dec 21, 2022 •

edited

Loading

al-niessner commented Aug 17, 2023

jordanpadams commented Aug 17, 2023

al-niessner commented Aug 17, 2023

jordanpadams commented Aug 17, 2023

al-niessner commented Aug 19, 2023

al-niessner commented Aug 19, 2023

rchenatjpl commented Aug 21, 2023

al-niessner commented Aug 22, 2023

rchenatjpl commented Aug 22, 2023

jordanpadams commented Aug 23, 2023

mit3ch commented Aug 28, 2023 via email

As a user, I want to receive a warning if records in file are greater than records value specified in label #535

As a user, I want to receive a warning if records in file are greater than records value specified in label #535

Comments

mit3ch commented Aug 22, 2022 • edited by jordanpadams Loading

🧑‍🔬 User Persona(s)

💪 Motivation

📖 Additional Details

⚖️ Acceptance Criteria

⚙️ Engineering Details

jordanpadams commented Aug 22, 2022

jordanpadams commented Aug 22, 2022

mit3ch commented Aug 22, 2022 via email

rchenatjpl commented Aug 22, 2022

jordanpadams commented Aug 23, 2022

jordanpadams commented Aug 23, 2022 • edited Loading

jordanpadams commented Dec 2, 2022

mace-space commented Dec 21, 2022 • edited Loading

al-niessner commented Aug 17, 2023

jordanpadams commented Aug 17, 2023

al-niessner commented Aug 17, 2023

jordanpadams commented Aug 17, 2023

al-niessner commented Aug 19, 2023

al-niessner commented Aug 19, 2023

rchenatjpl commented Aug 21, 2023

al-niessner commented Aug 22, 2023

rchenatjpl commented Aug 22, 2023

jordanpadams commented Aug 23, 2023

mit3ch commented Aug 28, 2023 via email

mit3ch commented Aug 22, 2022 •

edited by jordanpadams

Loading

jordanpadams commented Aug 23, 2022 •

edited

Loading

mace-space commented Dec 21, 2022 •

edited

Loading