Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hxl-yml-spec-to-hxl-json-spec: HXL Data processing specs exporter #14

Open
fititnt opened this issue Mar 12, 2021 · 5 comments
Open

hxl-yml-spec-to-hxl-json-spec: HXL Data processing specs exporter #14

fititnt opened this issue Mar 12, 2021 · 5 comments
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation proof-of-concept-already-exist Do exist proof of concept (or better) for this issue

Comments

@fititnt
Copy link
Member

fititnt commented Mar 12, 2021

Quick links:


Let's do an proof of concept of the thing!

fititnt added a commit that referenced this issue Mar 13, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 13, 2021

Maybe this

hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

and an file like this

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      source:
        - iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0
          filters:
            - filter: with_columns
              with_columns: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            - filter: without_rows
              without_rows: "#vocab+code+v_6391="

could be a good starting point. But form my experience with Ansible (and very, very large Ansible playbooks) we could do from start allow parsing several YAML files at once and just output all the json specs line by line.

But the come to this point, the hdpcli needs to implement some way to at least concatenate more than one YAML file. (the part about include_file options may be something for later).

fititnt added a commit that referenced this issue Mar 13, 2021
…_init, HDP._safer_zone_hosts, HDP._safer_zone_list, HDP.export_schema_json(), HDP.export_yml()
fititnt added a commit that referenced this issue Mar 13, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 13, 2021

This is the yaml file (there is some extra markup, but ignore for now).

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | grep "^[^#;]"

---
- hsilo:
    name: "test1"
    desc: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      iri_example:
        - iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
          sheet_index: 1
      recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f

This is the json spec result

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

{
    "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
    "recipe": [
        {
            "aggregators": [
                "sum(population) as Population#population"
            ],
            "filter": "count",
            "patterns": "adm1+name,adm1+code"
        },
        {
            "filter": "clean_data",
            "number": "population",
            "number_format": ".0f"
        }
    ],
    "sheet_index": 1
}

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | hxlspec

ERROR (hxl.io): Skipping column(s) with malformed hashtag specs: #
Gov,Gov Pcode,Population
#adm1+name,#adm1+code,#population
Abyan,YE12,618892
Ad Dali',YE30,818507
Aden,YE24,1053455
Al Bayda,YE14,795107
Al Hodeidah,YE18,2996334
Al Jawf,YE16,633596
Al Maharah,YE28,175606
Al Mahwit,YE27,770920
Amran,YE29,1221908
Dhamar,YE20,2194159
Hadramawt,YE19,1551347
Hajjah,YE17,2630678
Ibb,YE11,3143818
Lahj,YE25,1076296
Ma'rib,YE26,1086663
Raymah,YE31,562930
Sa'dah,YE22,934201
Sana'a,YE23,1370798
Sana'a City,YE13,3296342
Shabwah,YE21,676408
Socotra,YE32,69004
Ta'iz,YE15,3104579

And actually, redirecting to command line hxlspec actually worked. Just a quick warning, but worked!

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ pip3 show libhxl | grep Version

Version: 4.22

@fititnt fititnt added the proof-of-concept-already-exist Do exist proof of concept (or better) for this issue label Mar 13, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 14, 2021

We need some way to make 'inline' data tables that could work an some way to test if an HXL Data processing specs is working (and needs to work offline).

This implies add some new attributes, in special the concept of inline data and expected result data. Or maybe the concept of 'example'.

fititnt added a commit that referenced this issue Mar 14, 2021
…fy more than one source; also content without translation will be prefixed with a single _
@fititnt
Copy link
Member Author

fititnt commented Mar 14, 2021

Current example

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json

# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298

# To inspect the result (pretty print)
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (first item of array, use jq '.[0]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[0]' | hxlspec
# To pipe the result direct to hxlspec (first item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec

---

# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      _recipe:
        - filter: with_columns
          includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
        - filter: without_rows
          queries: "#vocab+code+v_6391="
      exemplum:
        - fontem:
            iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0

- hsilo: 
    nomen: "test1"
    descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      exemplum:
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

[
    {
        "input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    }
]

fititnt added a commit that referenced this issue Mar 14, 2021
…input_data', 'output_data', as part of hrecipe.exemplum; the underlining inplementation still not ready, but the idea is be able to specify self-contained example when creating recipes with YAML; the hrecipe.exemplum[N]objectivum.datum can be used for self-contained testing!
@fititnt
Copy link
Member Author

fititnt commented Mar 14, 2021

Now the hdpcli --export-to-hxl-json-processing-spec, to generate the input parameter specified by the HXL data processing specs, should be as first item of an array. If using the internal language, this means put in something like hrecipe.[0].exemplum.[0].fontem.iri instead of hrecipe.[0].iri_example.[0].iri.

The idea of use 'exemplum' is because if one goal of recipes would be reusability, this means that any input data there would be... just as example/reference.

The inpact of this is that now, the first item when exporting will always be without example inputs, but the second one would be like before.

Both 'input_data' / 'output_data' actually are one way to express, as inline data, input data (to not use external link) and 'output_data' is if eventually we implement some way to use an recipe to be able to be tested on different proxies.

Also, the idea of 'input_data' / 'output_data', even if ignored by HXL data processing specs, can be used to just looking at one YAML file have an idea of what the recipe would do. (ok that the idea is actually test if really works, but at least for documentation it already serve!)


fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats/tests/hrecipe$ cat hello-world.hrecipe.hdp.yml

# cd tests/hrecipe
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml | jq '.[1]' | hxlspec
---
- hsilo:
    nomen: hello-world.hrecipe.hdp.yml
    linguam: mul # https://iso639-3.sil.org/code/mul
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      # iri_example:
      #   - iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
      #     sheet_index: 1
      exemplum:
        # Example one
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

        # Example two includes both an inline data
        - fontem:
            # Note: fontem.datum not fully implemented. But the idea here is
            #       be able to create an ad-hoc table instead of use
            #       external input. So help show as quick example or...
            #       as some sort of unitary test for an HXL data processing
            #       spec!
            datum:
              - ["header 1", "header 2", "header 3"]
              - ["#item +id", "#item +name", "#item +value"]
              - ["ACME1", "ACME Inc.", "123"]
              - ["XPTO1", "XPTO org", "456"]
          objectivum:
            # Note: fontem.objectivum not fully implemented. But the idea here
            #       is (like the fontem.datum) work as ad-hoc table, but is
            #       really allow create some sort of unitary test for a HXL
            #       data processing spec!
            datum:
              - ["header 1", "header 2", "header 3"]
              - ["#item +id", "#item +name", "#item +value"]
              - ["ACME1", "ACME Inc.", "123"]
              - ["XPTO1", "XPTO org", "456"]

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats/tests/hrecipe$ hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml

[
    {
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ],
        "sheet_index": 1
    },
    {
        "input_data": [
            [
                "header 1",
                "header 2",
                "header 3"
            ],
            [
                "#item +id",
                "#item +name",
                "#item +value"
            ],
            [
                "ACME1",
                "ACME Inc.",
                "123"
            ],
            [
                "XPTO1",
                "XPTO org",
                "456"
            ]
        ],
        "output_data": [
            [
                "header 1",
                "header 2",
                "header 3"
            ],
            [
                "#item +id",
                "#item +name",
                "#item +value"
            ],
            [
                "ACME1",
                "ACME Inc.",
                "123"
            ],
            [
                "XPTO1",
                "XPTO org",
                "456"
            ]
        ],
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    }
]

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json

# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298

# To inspect the result (pretty print)
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (second item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec
# To pipe the result direct to hxlspec (4º item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[3]' | hxlspec

---

# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      _recipe:
        - filter: with_columns
          includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
        - filter: without_rows
          queries: "#vocab+code+v_6391="
      exemplum:
        - fontem:
            iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0

- hsilo: 
    nomen: "test1"
    descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      exemplum:
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

[
    {
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ],
        "sheet_index": 1
    }
]

@fititnt fititnt added the data-transformation https://en.wikipedia.org/wiki/Data_transformation label Mar 28, 2021
@fititnt fititnt changed the title hxl-yml-spec-to-hxl-json-spec hxl-yml-spec-to-hxl-json-spec: HXL Data processing specs exporter Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation proof-of-concept-already-exist Do exist proof of concept (or better) for this issue
Projects
None yet
Development

No branches or pull requests

1 participant