Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tag templates to validate tag metadata on taggables #79

Open
6 tasks done
cortadocodes opened this issue May 10, 2021 · 9 comments
Open
6 tasks done

Use tag templates to validate tag metadata on taggables #79

cortadocodes opened this issue May 10, 2021 · 9 comments
Assignees

Comments

@cortadocodes
Copy link
Member

cortadocodes commented May 10, 2021

We currently have tags that can have any number of subtags. We are going to move to:

  • Tags with no subtags: added to a keywords/labels container attribute
  • Tags with one subtag: added as attributes to the Taggable instance
  • Tags with more than one subtag: disallowed

The tags should conform to a tag schema/template that includes:

  • Title
  • Description
  • Type: enum of different possible types

We may need a custom JSON parser for this.

To do:

  • Add schema for tags for datafiles
  • Remove the filter field from manifest files
  • Remove the multi-dataset enum from manifest schema
  • Decide on naming for tags, key-value pair (attributes), and keywords (or labels)
  • additionalProperties needs to be true for tags in the schema to allow extra metadata on files
  • Whether a key-value pair is required should be specified, but keywords should all be optional
@cortadocodes
Copy link
Member Author

@thclark do we want to require that tag templates are provided in twine.json or make it optional? It won't be backwards compatible if we require it, but it will ensure this feature is used.

@thclark
Copy link
Collaborator

thclark commented May 10, 2021

I think best not to require it. Some files may well not require metadata; in that event we don't really want people to have to add empty sections to twine.json.

@cortadocodes
Copy link
Member Author

cortadocodes commented May 12, 2021

What do you think about this naming convention?

  • Tags - any key-value pair or keyword added to a file, of which:
    • Labels - keyword tags become labels (I prefer "label" over "keyword" because it's a verb as well as a noun - you can label a file but you can't keyword a file)
    • Custom attributes - key-value tags become custom attributes

An alternative to "tags" could be "custom metadata", which maybe reflects better that some of the tags become labels/keywords while others become attributes of the datafile

@thclark
Copy link
Collaborator

thclark commented May 12, 2021

Agree re labels rather than keywords, good thought about verb usage (and no longer ambiguous, now that GCS has moved to using "custom metadata" instead of "labels")

Is your suggestion we then retain "tags" as being a superset of labels and custom attributes? Like this:

metadata
    |-> fixed metadata (GCS stuff like content_type)
    |-> custom metadata
          |-> tags
                 |-> labels
                 |-> custom attributes

I'm slightly worried that "attribute" is a word that is meaningful for us, and for python, but is likely not for an amateur programmer or someone who's come from e.g. MATLAB or C++.

What about using tags as an alternative to custom attributes?

metadata
    |-> fixed metadata (GCS stuff like content_type)
    |-> custom metadata
          |-> labels
          |-> tags

Note: The above are taxonomies, not object hierarchies (because of course tags/attributes would be expanded to live directly in custom GCS metadata)

@thclark
Copy link
Collaborator

thclark commented May 12, 2021

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

@cortadocodes
Copy link
Member Author

cortadocodes commented May 12, 2021

Some thoughts in reply:

  • On second thoughts, maybe we shouldn't use the phrase custom metadata because it would separately refer to GCS custom metadata and the superset {labels, tags}
  • If we use tags for key-value pairs, then I think we should rename the Tag class to Label or we'd have two different meanings for tag (and also taggable to labellable or something)
  • I wonder if tag really captures the key-value nature of the required tags? I suppose it's because my main language is python, but attribute infers to me a key-value nature i.e. a name and a value

@cortadocodes
Copy link
Member Author

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

Namespace it in GCS?

@thclark
Copy link
Collaborator

thclark commented May 12, 2021

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

Namespace it in GCS?

yeah, not on the Datafile objects our side

@cortadocodes
Copy link
Member Author

cortadocodes commented May 13, 2021

New requirements

  • Twine schema fixes - move to different PR
  • Twine schema - in the manifest datafile level
    • Labels field - strings
    • Tag template field - JSON validated types
    • Ensure that the tag template items in $defs.tags_template.properties.properties are not of type array or object so that they are flat
  • twine.py
    • Validate each datafile in the dataset against the tag template schema
    • Put example tag template in public bucket exposed on octue domain
    • Locally mock the getting of the schema in the bucket
  • SDK:
    • Add tags to TagSet (i.e don't add them as attributes to Datafile)
    • Be able to instantiate like:
      Datafile(path="here", id="123", tags={**stuff_from_somewhere}, labels=['one','two','three'])
      NB can reuse the instantiations from json src like for input_values etc (reduced number of ways to instantiate)
    • Add a Labelable mixin
    • Rewrite filterset to be compatible with new tags system
# Example filtering syntax
dataset.files.filter(tags__manufacturer__equals="vestas")
dataset.files.filter(labels__contains="mykeyword1")

# Equivalents in django
Datasets.objects.filter(files__id__equals="dfg")
Datasets.objects.filter(tags__manufacturer__equals="vestas")
Datasets.objects.filter(id__in=['one', 'two'], tags__manufacturer__equals="vestas")

Example manifest contents:

{
    "id": "8ead7669-8162-4f64-8cd5-4abe92509e17",
    "datasets": [
        {
            "id": "7ead7669-8162-4f64-8cd5-4abe92509e17",
            "name": "my meteorological dataset",
            "tags": ["met", "mast", "wind"],
            "files": [
                {
                    "path": "input/datasets/7ead7669/file_1.csv",
                    "cluster": 0,
                    "sequence": 0,
                    "extension": "csv",
                    "labels": ["mykeyword1", "mykeyword2"],
                    "tags": {
                        "manufacturer": "vestas",
                        "height: 500,
                        "is_recycled": true,
                        "number_of_blades": 3,
                    },
                    "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
                    "name": "file_1.csv"
                },
                {
                    "path": "input/datasets/7ead7669/file_1.csv",
                    "cluster": 0,
                    "sequence": 1,
                    "extension": "csv",
                    "tags": ["manufacturer:Zestas", "height:350", "is_recycled:true", "number_of_blades:3"],
                    "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
                    "name": "file_1.csv"
                }
            ]
        }
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Priority 1 (Low)
Development

No branches or pull requests

2 participants