Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimator #57

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Estimator #57

wants to merge 15 commits into from

Conversation

lionelkusch
Copy link
Collaborator

The estimator Dnn_learner and RandomForestModified were not tested after that the BBI file was removed. I tried to implement some tests for improving the coverage of the tests.

I moved these files into a sub-module of hidimstat because they are not part of the core of methods proposed by the library.
Moreover, Dnn_learner include a dependency on torch and torchmetric which is not essential for the other methods of the library. In consequence, moving them to a sub-module, removes this requirement for using the other methods of the libraries.

@@ -241,6 +241,7 @@ def fit(self, X, y=None):
loss = np.array(res_ens[4])

if self.n_ensemble == 1:
raise Warning("The model can't be fit with n_ensemble = 1")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard Can you check if it's correct that without multiple ensembles, there is not fitting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

Copy link
Collaborator Author

@lionelkusch lionelkusch Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the Issue #75. This will be modified in the future.

@@ -283,6 +284,9 @@ def encode_outcome(self, y, train=True):
y = y.reshape(-1, 1)
if self.problem_type == "regression":
list_y.append(y)
# Encoding the target with the ordinal case
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard @bthirion
Can you tell me, if you know what is the "ordinal methods".
if yes, do you thing it's interesting to keep it? (the function was half implemented.)
If it's interesting to keep it, can you check if my modification was corrected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be address in the future. #76

if samples_in_leaf.size > 0:
leaf_samples.append(
y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard @bthirion
I modified the function for considering when the samples leaf was empty.
However, I don't know what was doing the function.
Can you validate if it's the correct way to do it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of samples_in_leaf is explain in #42.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be corrected with the issues #77 and #78.

assert not np.all(predict_prob[:, 0] == 0)
assert not np.all(predict_prob[:, 1] == 0)
# Check if the predicted probabilities are not all ones for each class
assert not np.all(predict_prob[:, 0] == 1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably enough ;-)
We should just make sure they're not redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advise for new tests address in issue #79.

learner = DnnLearnerSingle(do_hypertuning=True, problem_type="ordinal", n_jobs=10, verbose=0)
learner.fit(X, y)
predict_prob = learner.predict_proba(X)[:,0]
# Check if the predicted class labels match the true labels for at least one instance
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard @bthirion
Can you help me to define some tests for this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed in #79.

# Check if the feature importances are not all close to zero
assert not np.allclose(learner.feature_importances_, 0)
# Check if the feature importances are not all close to one
assert not np.allclose(learner.feature_importances_, 1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpaillard
Can you check if there are not too many assertions and if I miss some assertions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed in #79.

Copy link

codecov bot commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 90.81081% with 34 lines in your changes missing coverage. Please review.

Project coverage is 94.32%. Comparing base (e95b4f6) to head (bbab162).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
hidimstat/estimator/_utils/Dnn.py 83.82% 33 Missing ⚠️
hidimstat/estimator/Dnn_learner_single.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #57       +/-   ##
===========================================
+ Coverage   77.09%   94.32%   +17.23%     
===========================================
  Files          46       52        +6     
  Lines        2471     2608      +137     
===========================================
+ Hits         1905     2460      +555     
+ Misses        566      148      -418     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for opening this ! I have a bunch of comments.

@@ -283,6 +284,9 @@ def encode_outcome(self, y, train=True):
y = y.reshape(-1, 1)
if self.problem_type == "regression":
list_y.append(y)
# Encoding the target with the ordinal case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands for regression problems where the ordering of values matters, but not the values themselves. Usually the values are discretized. I propose to keep it for the moment.

if samples_in_leaf.size > 0:
leaf_samples.append(
y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be empty by construction (Random forests represent the samples in a tree structures). By default, there is a minimum number of samples in each leaf.

leaf_samples.append(
y_minus_i[rng.choice(samples_in_leaf, size=random_samples)]
)
if samples_in_leaf.size > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the condition here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be corrected with the issues #77 and #78.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed ?

Data matrix
y : np.array
Target vector
grps : np.array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
grps : np.array
groups : np.array

self.loss = 0

def forward(self, x):
if self.group_stacking:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by the issue #81.

x = torch.cat(list_stacking, dim=1)
return self.layers(x)

def training_step(self, batch, device, problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by the issue #81.

loss = F.binary_cross_entropy_with_logits(y_pred, y)
return loss

def validation_step(self, batch, device, problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by the issue #81.

"batch_size": len(X),
}

def validation_epoch_end(self, outputs, problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring ?

print("Epoch [{}], val_mse: {:.4f}".format(epoch + 1, result["val_mse"]))


def _evaluate(model, loader, device, problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring (1 line)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by the issue #81.

@@ -241,6 +241,7 @@ def fit(self, X, y=None):
loss = np.array(res_ens[4])

if self.n_ensemble == 1:
raise Warning("The model can't be fit with n_ensemble = 1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a fitting, as suggested by the function call joblib_ensemble_dnnet (btw should this be a private function ? Should we integrate it in the same module for clarity ?)

The following lines select the n_ensemble best models (which is useless when n_ensemble==1).

I would suggest to have a function _keep_n_ensemble that is called line 243 and gathers the code until line 261 to manage this distinction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with you, as a user, I don't like integrating the ensembling (n_ensemble) and hyper-parameter tuning (do_hypertuning, dict_hypertuning) in a single class, which becomes huge.

Also, I think other libraries (sklearn for ensembling, optuna for hyperparameters) offer more & better options for these advanced training strategies.

I suggest separating these aspects from the DNN_learner class and leaving it up to the user to optimize the training separately from humidistat.

@lionelkusch
Copy link
Collaborator Author

The primary aim of this pull request was to increase test coverage and separate the estimation functions from the other methods to reduce dependency on torch.

Most of your comments concern the code that was there before. I had no intention of dealing with it at the moment because, for me, it wasn't the priority. However, if you think it's very important, I can do it.

@lionelkusch
Copy link
Collaborator Author

This new commits are to address the most simple modifications. The other requests will be addressed in future PR and to trace them, 7 issues were open.

This is a particular case because the estimators require important refactoring and their are not the focus of the library.
Additionally, this PR is important to reduce the dependence to torch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants