-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a user, I want to receive a warning if records in file are greater than records value specified in label #535
Comments
@rchenatjpl can you check this out for us? |
@mit3ch unfortunately, we will think about this, but we have received the exact opposite request in the future because PDS4 does not preclude someone from putting a footer on a data object. Anne is actually about to submit an SCR to no longer allow that, but until then, it is difficult for validate to make these guesses on a case by case basis. per the record count issue, we will take a look |
Hi Jordan,
I understand that argument, sort of. If a data file has content after the table, and that content is described in the label there is no problem. If it is not described in the label, then there might be a problem; hence my suggestion that validate provide a warning, not an Error message.
In the files I'm looking at, the additional records are immediately adjacent to the table records, in exactly the same format, and are not described in the label. At this time, the only way I can verify that every label which passes validate accurately describes the table is to open each label and data file pair and compare them. At RMS, we're now looking at writing a python script to do just that for this bundle. We know the data file contains a header in the first record, followed by the table, and nothing else. The script will probably just count records in the data file, subtract one, and compare to the value for records in the label, and print out the file name and the two numbers, highlighting any that don't match.
Once you implemented content validation (thank you by the way), it has been obvious that if the value for records is greater than the actual number of records validate identifies an error. I'm concerned that it was not equally obvious, at least to me, that instances where the value for records is less than the actual number of records are not flagged. I'm concerned that I or others may have previously archived tables which may have this same undetected flaw.
Is there a review mechanism (discuss in DDWG?) we can activate before putting this in the do-it-never pile?
Mitch
Dr. Mitch Gordon
SETI Institute
Deputy Manager
PDS Ring-Moon Systems Node
276-393-8822
Pronouns: he, him, his
From: Jordan Padams ***@***.***>
Sent: Monday, August 22, 2022 2:56 PM
To: NASA-PDS/validate ***@***.***>
Cc: Mitch Gordon ***@***.***>; Mention ***@***.***>
Subject: Re: [NASA-PDS/validate] Validate insufficient content validation for number of records in a table (Issue #535)
Validate does not, but should, give a warning if "records" in label is less than the actual records in a table.
The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.
@mit3ch<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmit3ch&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fI4qqCMxur17Xw4a%2BsCnL6pP%2BhhGXnhSNXJXNA5EziM%3D&reserved=0> unfortunately, we will think about this, but we have received the exact opposite request in the future because PDS4 does not preclude someone from putting a footer on a data object. Anne is actually about to submit an SCR to no longer allow that, but until then, it is difficult for validate to make these guesses on a case by case basis.
per the record count issue, we will take a look
-
Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNASA-PDS%2Fvalidate%2Fissues%2F535%23issuecomment-1222785743&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5CVLxcUZ9jLYFwU20vEgOjmzEDyQUXBtNGlj1CuYMAU%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABYIQZBYEXVLGAB6FA4V473V2PENNANCNFSM57IKEWHA&data=05%7C01%7Cmgordon%40seti.org%7Cc4a68cdb5baf4da06b5608da847001dd%7Cdeac5258294749c2a474e8ab151104fb%7C0%7C0%7C637967913849465193%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I93ajr2j%2BZ3zamrbDk7DLJ3YlcJOi%2FPdr81HsNSEFZY%3D&reserved=0>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
In case this is still relevant, validate is reporting weirdly for Mitch's #2 in the initial issue. The data file has 1 header record and 1316 data records. The .xml has Table_Character/records = 1322. Then, as Mitch wrote, validate's error message says there are 1318 records. Actually, the weirdness is a little deeper: @Mitch, your label uses dictionary PDS4_RINGS_1G00_1B00.sch and .xsd. That one is not on the PDS web site, which presumably didn't affect this stuff. @jordan, I think that's it for me on this issue, and it may already be moot. If you want me to look at something else, please say so. |
thanks @rchenatjpl . we will take a look |
@mit3ch maybe this is something we can add to the new SCR Anne just created, CCB-353, as what validate should do for older data, and then have the model handle this better in the future. |
Iβve written a simple Python script that compares each label and data file pair for the UVIS solar ring occs bundle. As @mit3ch mentioned, we know the data file contains a header in the first record, followed by the table, and nothing else. The script simply counts lines in the data file, subtracts one, and compares it to the value for records in the corresponding label ( I'm happy to be contacted if you'd like to discuss this further. |
Is this really about un-described data at the end of a file more than about an extra record? I get that the specific scenario was extra record of same format etc but reading past the described end of the table is fraught with peril and if the real desire is a warning for data at the end of the file that is not described then it is simpler and very doable. Saying that you have extra rows corresponding to previously described table seems like a niche of the general case of unexplained data in general. I presume there is a way to say there 10 bytes of reserved (no details provided) data at the end of the file and that it may be variable. |
@al-niessner I agree with your thoughts. Let's just output a warning there are X undescribed bits at end of file. I agree reading past the end of the table and assuming it is still that specific table introduces all kinds of other possible issues. |
Sorry, this is just not working out. Can you have a file area observational in a label where table A points and covers 80% of it then where table B covers 40% of it (yes, overlap)? Even without the overlap, since validate processes each table independently, it makes it impossible for the current code to make extraneous data assessment. It would have to be a different rule in the label chain that processed all content at once to identify not described portions of files. In other words, have to approach the table+array in file problem too. |
That is invalid. We should catch that when the offset of table B overlaps with the end of table A, no? Can we use the singleton in the code to pass around where we are at in the data file in terms of bytes read? |
I do not think that we catch table A and table B overlapping and certainly not Table A and Array A in the same file. The bit mapping with offsets is done for each table without consideration to other tables in the same file area and certainly not array area. I see no reason why we would not allow Fred's favorite table to be all the odd columns and Cindy's favorite to be all the even columns. They would be used for two different use cases like engineering analysis vs science analysis. The two tables would cover the same data block in the file but give different views or perspectives of that data. Suppose could use a max with the singleton to know the high water mark - out of order is a warning #683 not an error - but it does seem fraught with unknown peril. I will look at it. |
Can we make this an INFO instead of warning? Detecting this breaks nearly every regression check that we have -- means that almost all of our example data from other users and sources have data at the end of files that is not described in the file area. Since our regression tests check for errors and warnings then using INFO would be safe. I just think that all the users having validate suddenly inundate them with warnings about stuff they know and are alright with is going to cause us to undo these changes. We could add a switch |
@al-niessner @jordanpadams Regarding overlap, the Standards section 2B.1.1 says: "PDS requires that each digital object be physically distinct and contiguous in a single file; objects may not overlap." It then has a good example concerning RGB. So please throw an error if two Tables or Arrays or whatever overlap. And in case it's on the table, I'm against adding this as a message that only shows up with 'validate -v1'. That flag generates way too much output, and overlap is something I'd really want to call out in a review. |
Sorry for the confusion that I created. Overlaps are errors and will remain so without adding extra flags. They are checked, just not where I expected. It is possible to circumvent this rule but may be caught elsewhere. The INFO or extra flag is just for detecting extraneous bytes at the end of files which appears to be very common currently. I would prefer the new switch or flag on the command line with validate using warnings or errors myself but have to give both options. |
@al-niessner I probably read the earlier stuff lazily. What you just wrote sounds fine. Thanks. |
@al-niessner I like the flag idea. Let's go for that. |
Hi all,
I've been out of touch for awhile and am just catching up. Thanks for giving this some thought; I hadn't realized the full potential of the can of worms I was opening. Thanks to Richard regarding ensuring we throw an error for overlapping objects.
I know that Jordan has closed this and I can live with the current solution, but I'd really prefer a warning rather than info for additional bytes at the end of the file. I thought about totaling the bytes of all of the objects and comparing to the file total, but there is no rule preventing extraneous bytes between objects.
This originally came up because a provider described a table indicating fewer records than the table actually contained. I've been thinking of a split message - an info message if there are less than some threshold number of extraneous bytes (~10) at the end, and a warning if there are more than the threshold number of extraneous bytes at the end. Clearly, I'm thinking of just the last table, but it would be possible for the miscount of records to occur in the description of any table and not trigger a message if the following object start_byte was entered correctly.
My initial problem is addressed by the current solution - information that there are undescribed bytes at the end of the file. A more complete (and perhaps not achievable now) solution would be a message identifying all undescribed bytes, something like:
Undescribed bytes: (start/byte / number of Undescribed bytes) 128/45, 422/2, 877/200, ...
This would probably be easier to determine if we required that all objects be described in the label in the same order as they appear in the data file, but that was not approved by the DDWG. Sorry.
Regards,
Mitch
Dr. Mitch Gordon
SETI Institute
Senior Astronomical Archivist
PDS Ring-Moon Systems Node
276-393-8822
Pronouns: he, him, his
From: Jordan Padams ***@***.***>
Sent: Thursday, August 24, 2023 10:40 AM
To: NASA-PDS/validate ***@***.***>
Cc: Mitch Gordon ***@***.***>; Mention ***@***.***>
Subject: Re: [NASA-PDS/validate] As a user, I want to receive a warning if records in file are greater than records value specified in label (Issue #535)
Closed #535<#535> as completed via #686<#686>.
-
Reply to this email directly, view it on GitHub<#535 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABYIQZGZSHQJPQMWAWY6IMDXW5RSPANCNFSM57IKEWHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
π§βπ¬ User Persona(s)
Data Engineer
πͺ Motivation
...so that I can know when the records value may be invalid
π Additional Details
Validate does not, but should, give a warning if "records" in label is less than the actual records in a table. The attached pair, uvis_euv_2008_003_solar_time_series_ingress, passes, but should give a warning.
βοΈ Acceptance Criteria
Given
When I perform
Then I expect
βοΈ Engineering Details
insufficient content validation.zip
The text was updated successfully, but these errors were encountered: