Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter external repos #396

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 42 additions & 6 deletions docs/user_guide/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,55 @@ multiple external repositories can be used as the example below illustrates for
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
region:
repository: common-definitions
repository:
name: common-definitions
variable:
repositories:
- common-definitions
- legacy-definitions
- name: common-definitions
- name: legacy-definitions
mappings:
repository: common-definitions
repository:
name: common-definitions

The value in *definitions.region.repository* needs to reference the repository in the
*repositories* section.
The value in *definitions.region.repository* can be a list or a single value, needs to
contain the ``name`` keyword and reference the repository in the *repositories* section.

For model mappings the process is analogous using *mappings.repository*.

Filter code lists imported from external repositories
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Since importing the entirety of, for example, common-definitions is too much for most
projects, the list can be filtered using ``include`` and ``exclude`` keywords. Under
these keywords, lists of filters can be given that will be applied to the code list from
the given repository.

The filtering can be done by any attribute:

.. code:: yaml

repositories:
common-definitions:
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
variable:
repository:
name: common-definitions
include:
- name: [Primary Energy*, Final Energy*]
- name: "Population*"
tier: 1
exclude:
- name: "Final Energy|*|*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a misleading example - it seems to show only level-2 exclusion when in fact it excludes all variables at level 2 or below. Better to use the level-argument explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this is subjective then. For me it was totally clear that this excludes anything level 2 and beyond.
I can see your point though about this being ambiguous


In the example above we are including:
1. All variables starting with *Primary Energy* or *Final Energy*
2. All variables starting with *Population* **and** with the tier attribute equal to 1

From this list we are then **excluding** all variables that match "Final Energy|*|*".
This means that the final resulting list will contain no Final Energy variables with
three or more levels.


Adding countries to the region codelist
---------------------------------------
Expand Down
14 changes: 13 additions & 1 deletion nomenclature/code.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ class VariableCode(Code):
)
method: str | None = None
check_aggregate: bool | None = Field(default=False, alias="check-aggregate")
components: Union[List[str], List[Dict[str, List[str]]]] | None = None
components: Union[List[str], Dict[str, list[str]]] | None = None
drop_negative_weights: bool | None = None
model_config = ConfigDict(populate_by_name=True)

Expand All @@ -187,6 +187,18 @@ def deserialize_json(cls, v):
def convert_none_to_empty_string(cls, v):
return v if v is not None else ""

@field_validator("components", mode="before")
def cast_variable_components_args(cls, v):
"""Cast "components" list of dicts to a codelist"""

# translate a list of single-key dictionaries to a simple dictionary
if v is not None and isinstance(v, list) and isinstance(v[0], dict):
comp = {}
for val in v:
comp.update(val)
return comp
return v

@field_serializer("unit")
def convert_str_to_none_for_writing(self, v):
return v if v != "" else None
Expand Down
49 changes: 18 additions & 31 deletions nomenclature/codelist.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,13 +209,12 @@ def from_directory(
for repo in getattr(
config.definitions, name.lower(), CodeListConfig()
).repositories:
code_list.extend(
cls._parse_codelist_dir(
config.repositories[repo].local_path / "definitions" / name,
file_glob_pattern,
repo,
)
repository_code_list = cls._parse_codelist_dir(
config.repositories[repo.name].local_path / "definitions" / name,
file_glob_pattern,
repo.name,
)
code_list.extend(repo.filter_list_of_codes(repository_code_list))
errors = ErrorCollector()
mapping: Dict[str, Code] = {}
for code in code_list:
Expand Down Expand Up @@ -567,21 +566,6 @@ def check_weight_in_vars(cls, v):
)
return v

@field_validator("mapping")
@classmethod
def cast_variable_components_args(cls, v):
"""Cast "components" list of dicts to a codelist"""

# translate a list of single-key dictionaries to a simple dictionary
for var in v.values():
if var.components and isinstance(var.components[0], dict):
comp = {}
for val in var.components:
comp.update(val)
v[var.name].components = comp

return v

def vars_default_args(self, variables: List[str]) -> List[VariableCode]:
"""return subset of variables which does not feature any special pyam
aggregation arguments and where skip_region_aggregation is False"""
Expand Down Expand Up @@ -706,21 +690,25 @@ def from_directory(

# importing from an external repository
for repo in config.definitions.region.repositories:
repo_path = config.repositories[repo].local_path / "definitions" / "region"
repo_path = (
config.repositories[repo.name].local_path / "definitions" / "region"
)

code_list = cls._parse_region_code_dir(
code_list,
repo_list_of_codes = cls._parse_region_code_dir(
repo_path,
file_glob_pattern,
repository=repo,
repository=repo.name,
)
code_list = cls._parse_and_replace_tags(
code_list, repo_path, file_glob_pattern
repo_list_of_codes = cls._parse_and_replace_tags(
repo_list_of_codes, repo_path, file_glob_pattern
)
code_list.extend(repo.filter_list_of_codes(repo_list_of_codes))

# parse from current repository
code_list = cls._parse_region_code_dir(code_list, path, file_glob_pattern)
code_list = cls._parse_and_replace_tags(code_list, path, file_glob_pattern)
local_code_list = cls._parse_region_code_dir(path, file_glob_pattern)
code_list.extend(
cls._parse_and_replace_tags(local_code_list, path, file_glob_pattern)
)

# translate to mapping
mapping: Dict[str, RegionCode] = {}
Expand Down Expand Up @@ -756,13 +744,12 @@ def hierarchy(self) -> List[str]:
@classmethod
def _parse_region_code_dir(
cls,
code_list: List[Code],
path: Path,
file_glob_pattern: str = "**/*",
repository: str | None = None,
) -> List[RegionCode]:
""""""

code_list: List[RegionCode] = []
for yaml_file in (
f
for f in path.glob(file_glob_pattern)
Expand Down
108 changes: 87 additions & 21 deletions nomenclature/config.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from enum import Enum
from pathlib import Path
from typing import Annotated, Optional
from typing import Any
from fnmatch import fnmatch

import yaml
from git import Repo
Expand All @@ -11,29 +12,83 @@
field_validator,
model_validator,
ConfigDict,
BeforeValidator,
)
from nomenclature.code import Code


class RepositoryWithFilter(BaseModel):
name: str
include: list[dict[str, Any]] = [{"name": "*"}]
exclude: list[dict[str, Any]] = Field(default_factory=list)

def filter_function(self, code: Code, filter: dict[str, Any], keep: bool):
# if is list -> recursive
# if is str -> fnmatch
# if is int -> match exactly
# if is None -> Attribute does not exist therefore does not match
def check_attribute_match(code_value, filter_value):
if isinstance(filter_value, int):
return code_value == filter_value
if isinstance(filter_value, str):
return fnmatch(code_value, filter_value)
if isinstance(filter_value, list):
return any(
check_attribute_match(code_value, value) for value in filter_value
)
if filter_value is None:
return False
raise ValueError("Something went wrong with the filtering")

filter_match = all(
check_attribute_match(getattr(code, attribute, None), value)
for attribute, value in filter.items()
)
if keep:
return filter_match
else:
return not filter_match

def filter_list_of_codes(self, list_of_codes: list[Code]) -> list[Code]:
# include first
filter_result = [
code
for code in list_of_codes
if any(
self.filter_function(
code,
filter,
keep=True,
)
for filter in self.include
)
]

if self.exclude:
filter_result = [
code
for code in filter_result
if any(
self.filter_function(code, filter, keep=False)
for filter in self.exclude
)
]


def convert_to_set(v: str | list[str] | set[str]) -> set[str]:
match v:
case set(v):
return v
case list(v):
return set(v)
case str(v):
return {v}
case _:
raise TypeError("`repositories` must be of type str, list or set.")
return filter_result


class CodeListConfig(BaseModel):
dimension: str | None = None
repositories: Annotated[set[str], BeforeValidator(convert_to_set)] = Field(
default_factory=set, alias="repository"
repositories: list[RepositoryWithFilter] = Field(
default_factory=list, alias="repository"
)
model_config = ConfigDict(populate_by_name=True)

@field_validator("repositories", mode="before")
def convert_to_set_of_repos(cls, v):
if not isinstance(v, list):
return [v]
return v

@property
def repository_dimension_path(self) -> str:
return f"definitions/{self.dimension}"
Expand Down Expand Up @@ -109,8 +164,8 @@ class DataStructureConfig(BaseModel):

"""

region: Optional[RegionCodeListConfig] = Field(default_factory=RegionCodeListConfig)
variable: Optional[CodeListConfig] = Field(default_factory=CodeListConfig)
region: RegionCodeListConfig = Field(default_factory=RegionCodeListConfig)
variable: CodeListConfig = Field(default_factory=CodeListConfig)

@field_validator("region", "variable", mode="before")
@classmethod
Expand All @@ -126,12 +181,22 @@ def repos(self) -> dict[str, str]:
}


class MappingRepository(BaseModel):
name: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Das The mapping will also have to inherit the region-filters, right? Otherwise, a model could map to a region that is not included in the DataStructureDefinition.



class RegionMappingConfig(BaseModel):
repositories: Annotated[set[str], BeforeValidator(convert_to_set)] = Field(
default_factory=set, alias="repository"
repositories: list[MappingRepository] = Field(
default_factory=list, alias="repository"
)
model_config = ConfigDict(populate_by_name=True)

@field_validator("repositories", mode="before")
def convert_to_set_of_repos(cls, v):
if not isinstance(v, list):
return [v]
return v


class DimensionEnum(str, Enum):
model = "model"
Expand All @@ -157,8 +222,9 @@ def check_definitions_repository(
mapping_repos = {"mappings": v.mappings.repositories} if v.mappings else {}
repos = {**v.definitions.repos, **mapping_repos}
for use, repositories in repos.items():
if repositories - v.repositories.keys():
raise ValueError((f"Unknown repository {repositories} in '{use}'."))
repository_names = [repository.name for repository in repositories]
if unknown_repos := repository_names - v.repositories.keys():
raise ValueError((f"Unknown repository {unknown_repos} in '{use}'."))
return v

def fetch_repos(self, target_folder: Path):
Expand Down
2 changes: 1 addition & 1 deletion nomenclature/processor/region.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,7 +487,7 @@ def from_directory(cls, path: DirectoryPath, dsd: DataStructureDefinition):
mapping_files.extend(
f
for f in (
dsd.config.repositories[repository].local_path / "mappings"
dsd.config.repositories[repository.name].local_path / "mappings"
).glob("**/*")
if f.suffix in {".yaml", ".yml"}
)
Expand Down
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Empty file.
Empty file.
17 changes: 17 additions & 0 deletions tests/data/config_filter/nomenclature.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repositories:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add more structure to the validation test data by using subfolders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#399 implements a cleanup of the test data folder, once that PR is merged, I'll rebase this one

common-definitions:
url: https://github.com/IAMconsortium/common-definitions.git/
legacy-definitions:
url: https://github.com/IAMconsortium/legacy-definitions.git/
definitions:
variable:
repository:
- name: common-definitions
filters:
- name: [Primary Energy*, Final Energy*]
- name: "Population*"
tier: 1
- name: legacy-definitions
region:
repository: common-definitions
country: true
6 changes: 4 additions & 2 deletions tests/data/general-config-only/nomenclature.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ repositories:
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
region:
repository: common-definitions
repository:
name: common-definitions
variable:
repository: common-definitions
repository:
name: common-definitions
6 changes: 4 additions & 2 deletions tests/data/general-config/nomenclature.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ repositories:
url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
region:
repository: common-definitions
repository:
name: common-definitions
variable:
repository: common-definitions
repository:
name: common-definitions
Loading