IGNITE-20139 Random forest stopping criteria check fixed. Gain calculation implemented. #256
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The issue happens when one “pure“ node (with impurity* = 0) is presented in the tree. We calculate an impurity only for children nodes and not for the current node, as well as do not check whether the node is “pure“ and contains just one label, due to that, the “bestSplit” calculation is executed for the already “pure“ node, which decides that all items should be moved to the left child node and no items to the right (leaf node), which gives 2 “pure“ children nodes. Since we don’t calculate impurity for the current (parent) node the
parentNode.getImpurity() - split.get().getImpurity() > minImpurityDelta
check is always true, and we continue to split the already “pure“ node until the max tree depth is reached.The following changes were made to resolve the issue:
(1 - sum(p^2))
to get the correct values in the range from 0 to 0.5 as required for the Gini index.* Impurity - is a value from 0 to 0.5, which shows whether the node is “pure“ (impurity = 0) having just 1 label or “impure” with impurity=0.5, which is the worst scenario where the label ratio is 1:1.
** Gain - is a difference between the parent node’s impurity and weighted children nodes' impurity. The split which provides the maximum gain value is considered the best. See https://www.learndatasci.com/glossary/gini-impurity/