Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added configuration management using pydantic #986

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

benmalef
Copy link
Contributor

@benmalef benmalef commented Feb 12, 2025

add Pydantic configuration

Fixes #ISSUE_NUMBER

Proposed Changes

  • Add Pydantic configuration
  • Refactor the config_manager

Checklist

  • CONTRIBUTING guide has been followed.
  • PR is based on the current GaNDLF master .
  • Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
  • Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
  • Code has been blacked for style consistency and linting.
  • If applicable, version information has been updated in GANDLF/version.py.
  • If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
  • Usage documentation has been updated, if appropriate.
  • Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
  • If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].
  • The logging library is being used and no print statements are left.

add Pydantic configuration
Copy link
Contributor

github-actions bot commented Feb 12, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sarthakpati
Copy link
Collaborator

Please check the codacy errors: https://app.codacy.com/gh/mlcommons/GaNDLF/pull-requests/986/issues

@sarthakpati sarthakpati changed the title Add Pydantic configuration (new) Added configuration management using pydantic Feb 12, 2025
@sarthakpati
Copy link
Collaborator

@szmazurek could you please take a first pass?

@sarthakpati sarthakpati mentioned this pull request Feb 17, 2025
11 tasks
@szmazurek
Copy link
Collaborator

@szmazurek could you please take a first pass?

yup, will do in like 1hr or tomorrow morning

save_output: bool = Field(
default=False, description="Save outputs during validation/testing."
)
in_memory: bool = Field(default=False, description="Pin data to CPU memory.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to "pin data to cpu/gpu memory"? Also, is the 'in_memory' really enforcing a page-lock on memory storing given chunk of data or just keeps it all in RAM?

data_postprocessing: Union[dict, set] = Field(
default={}, description="Default data postprocessing configuration."
)
grid_aggregator_overlap: str = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of other options can we have here? I believe this cannot be an arbitraty string, therefore it should be an optional literal here with available strings?

model_config = ConfigDict(
extra="allow"
) # it allows extra fields in the model dict
dimension: Optional[int] = Field(description="Dimension.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the dimension optional? Also, maybe it should accept only 2 or 3, as no other dimensionalities are supported. And perhaps the description can be made more expressive - like 'model input dimension (2D or 3D).'?

) # it allows extra fields in the model dict
dimension: Optional[int] = Field(description="Dimension.")
architecture: Union[ARCHITECTURE_OPTIONS, dict] = Field(description="Architecture.")
final_layer: str = Field(description="Final layer.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are also limited to certain amount of acceptable values - leveraging literal seems like good option.

),
default=3,
) # TODO: check it
type: Optional[str] = Field(description="Type of model.", default="torch")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also be literal? Probably options are torch, openvino? @sarthakpati am I right?

default=3,
) # TODO: check it
type: Optional[str] = Field(description="Type of model.", default="torch")
data_type: str = Field(description="Data type.", default="FP32")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true that we support such field in the config and it really influences anything in base gandlf? I tought that precision is chaning only when amp is enabled

type: Optional[str] = Field(description="Type of model.", default="torch")
data_type: str = Field(description="Data type.", default="FP32")
save_at_every_epoch: bool = Field(default=False, description="Save at every epoch.")
amp: bool = Field(default=False, description="Amplifier.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amp stands for automatic mixed precision, not amplifier

default=-5,
description="this controls the number of validation data folds to be used for model *selection* during training (not used for back-propagation)",
)
proportional: Optional[bool] = Field(default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this parameter do? also, if it's boolean, can't we set default as False?

description="this will perform stratified k-fold cross-validation but only with offline data splitting",
)
testing: int = Field(
default=-5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question - are there any limits to values that can be set in this field? Like, what happens if I set testing to 10? If there are limits, maybe it's worth to include possible ranges when defining this field, what do you think ? @sarthakpati @benmalef



class PatchSampler(BaseModel):
type: str = Field(default="uniform")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other options available for type and padding_mode? if so, maybe we should use literal here?

@model_validator(mode="after")
def validate_version(self) -> Self:
if version_check(self.model_dump(), version_to_check=version("GANDLF")):
return self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rise an error here if the condition is not met?

)
# min_lr: 0.00001, #TODO: this should be defined ??
# max_lr: 1, #TODO: this should be defined ??
step_size: float = Field(description="step_size", default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think on different classes that would allow for definition of params for separate schedulers - for example, if we use a scheduler which does reduce on plateau, then we need a field to define tracked metric. Not really sure how to implement that nicely tho. Other approach is define all possible fields that any scheduler can take and later provide validation logic which takes care of conditionality - i.e if reduce_on_plateau type chosen, then we require monitor field

class UserDefinedParameters(DefaultParameters):
version: Version = Field(
default=Version(minimum=version("GANDLF"), maximum=version("GANDLF")),
description="Whether weighted loss is to be used or not.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descritpion is not valid I believe :P

patch_size: Union[list[Union[int, float]], int, float] = Field(
description="Patch size."
)
model: Model = Field(..., description="The model to use. ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be list of avaiable strings no?

description="Scheduler.", default=Scheduler(type="triangle_modified")
)
optimizer: Union[str, Optimizer] = Field(
description="Optimizer.", default=Optimizer(type="adam")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about naming - here I first assumed we are initializing real torch optimizer - perhaps the names of config classes should be suffixed with Config/params?

description="Inference mechanism.", default=InferenceMechanism()
)
data_postprocessing_after_reverse_one_hot_encoding: dict = Field(
description="data_postprocessing_after_reverse_one_hot_encoding.", default={}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any options we support here? Not sure if this can be an arbitrary dict defined by user

data_postprocessing_after_reverse_one_hot_encoding: dict = Field(
description="data_postprocessing_after_reverse_one_hot_encoding.", default={}
)
differential_privacy: Any = Field(description="Differential privacy.", default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be just a boolean field?

Field(description="Data preprocessing."),
AfterValidator(validate_data_preprocessing),
] = {}
# TODO: It should be defined with a better way (using a BaseModel class)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comments, it would allow for a lot of clarity

file.write("\n".join(markdown))


def initialize_key(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need such utility? Meaning, if there are default parameters to be set, ideally they are defined via pydantic and automatically populated if user did not set them explicitly

@@ -85,6 +87,7 @@
"openslide-bin",
"openslide-python==1.4.1",
"lion-pytorch==0.2.2",
"pydantic",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would fix version, maybe in the future releases there are going to be some breaking changes of pydantic (little chance but I am paranoid a little)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants