Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marking error datasets and warnings/caveats #575

Closed
max-zilla opened this issue May 9, 2019 · 13 comments
Closed

Marking error datasets and warnings/caveats #575

max-zilla opened this issue May 9, 2019 · 13 comments
Assignees

Comments

@max-zilla
Copy link
Contributor

Have an ERROR.txt or something in the dataset & on disk to indicate dataset should be skipped for processing.

@max-zilla max-zilla added this to the TERRA Sprint - April 2019 milestone May 9, 2019
@dlebauer
Copy link
Member

dlebauer commented May 9, 2019

See terraref/reference-data#218

For data that a human has recognized as being in error (e.g. w/ blurry FLIR data, point clouds clipped at some height)

  1. add a text file named "ERROR" with optional content / explanation / pointer to a github issue, could be a set of key: value pairs, perhaps in yaml that get parsed directly to json metadata
  2. have an extractor that finds these files and adds a tag "quality"{ "ERROR" = "TRUE", "description":"", "key2":"value2"
  3. add a a general rule perhaps at the level of extractors, or perhaps at the level of RabbitMQ that says any time an error (file or flag) is found, skip processing the dataset.

@dlebauer
Copy link
Member

dlebauer commented May 9, 2019

For example (consider this a draft; should probably use a consistent / standard way of encoding this information), every FLIR dataset following May 2017 could have a file named "ERROR.yml" that contains:

quality:
  status: ERROR
  description: Sand and water contaminated FLIR Camera lens so temperature values are invalid  
  url: https://github.com/terraref/reference-data/issues/182

@max-zilla
Copy link
Contributor Author

duplicate of #557

@max-zilla
Copy link
Contributor Author

there are different severities of error - FLIR 2017 is prominent example, but the stair-stepping on the laser3D data is not so cut and dry and may still have valuable data in it

@max-zilla max-zilla changed the title Marking error datasets Marking error datasets and warnings/caveats May 20, 2019
@max-zilla
Copy link
Contributor Author

Proposed script will add new metadata entry from Maricopa Field user with body like:

{
"quality": "ERROR",
"description": "Sand and water contaminated FLIR Camera lens so temperature values are invalid", 
"url": https://github.com/terraref/reference-data/issues/182"
}

Other status could be WARNING, ADVISORY, etc. We would also write a corresponding yaml file to the globus directory with those contents as suggested, perhaps at the day level rather than repeated for entire dataset? or do we want it repeated at the dataset (timestamp) level?

@dlebauer
Copy link
Member

We should also add a file named "ERROR" that contains the description and url to the affected dataset. I think repeating this at the dataset level will be good. There may be use for having a tag at a higher level, but that would be in addition to the dataset level flag.

@max-zilla
Copy link
Contributor Author

My script is prepared to generate the YAML files & metadata, however due to the raw_data directories being owned by dlebauer I am unable to write into them. We can discuss how to handle this during meeting... probably one of:

  • run as sudo and chmod the yaml files when created so they are owned by dlebauer consistently with the others
  • have dlebauer run the script

@dlebauer
Copy link
Member

I am glad that the raw data folder is locked down. I am not sure it makes sense for me to be the folder owner (as opposed to a user or group like ‘terraref’ but ... the idea is that we don’t touch the raw_data folder.

In the end, the key requirement is that any data that have known errors (or other issues) are clearly labeled as such.

It makes sense (at this point) to have to use sudo to touch the raw_data folder, if we should ever touch it at all. But maybe there is a ‘better’ way to handle this. Certainly none of the existing files should be touched, but allowing the same user that transfers the files to be able to create a new file would also seem reasonable.

For the FLIR, we did process the data to Level 1. Is the plan to also add an error file to the level 1 data?

@max-zilla
Copy link
Contributor Author

Script is running now. Will close this when completed.

I would argue that in the FLIR case we don't add an error to Level_1 data, the goal was to flag the raw so that in the future we dont even process these erroneous datasets. I would make an argument for deleting level 1 + data from this time period for FLIR to be consistent with that.

@max-zilla
Copy link
Contributor Author

standardized taxonomy for different error cases - ERROR that should not be sent through processing vs. WARNING vs. other classes

@max-zilla
Copy link
Contributor Author

Max will write up some documentation/wiki once this is done to propose a standard approach to handling this.

@max-zilla
Copy link
Contributor Author

support ability to explicitly define files (vs entire dataset) - 'all' could be default value.

@max-zilla
Copy link
Contributor Author

Created #589 to follow this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants