-
Notifications
You must be signed in to change notification settings - Fork 117
[HIVEMALL-233] RandomForest regressor accepts sparse vector input #178
Conversation
@@ -70,7 +70,7 @@ public DoubleArrayList add(@Nonnull double[] values) { | |||
private void expand(int max) { | |||
while (data.length < max) { | |||
final int len = data.length; | |||
double[] newArray = new double[len * 2]; | |||
double[] newArray = new double[(len + 1) * 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takuti Why this change has been made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See d7695d4
In case # of samples is less than 10, SmileExtUtils#sort
calls DoubleArrayList(0)
, and zero-sized array list falls into infinite loop in the expand
method as newArray = new double[0 * 2]
. The change prevents this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this PR does not resolves a potential bug in expand. Returning array should be >= max
and max
should be minCapacity
where expand
's argument is expected to be >=1.
https://github.com/karussell/fastutil/blob/master/src/it/unimi/dsi/fastutil/doubles/DoubleArrayList.java#L203
https://github.com/karussell/fastutil/blob/master/src/it/unimi/dsi/fastutil/doubles/DoubleArrays.java#L136
Let me fix this in another PR.
It should be completed by HIVEMALL-75. However, since some critical parts in RandomForestClassifierUDTF are not properly implemented in RandomForestRegressionUDTF, RF regressor doesn't work efficiently for sparse vector input.
7de3e27
to
8938342
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takuti OutputObjectInspector should be configured as follows:
if (denseInput) {
fieldOIs.add(ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector));
} else {
fieldOIs.add(ObjectInspectorFactory.getStandardMapObjectInspector(
PrimitiveObjectInspectorFactory.writableIntObjectInspector,
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector));
}
I'll fix it another PR.
What changes were proposed in this pull request?
Enable RandomForestRegressor to accept sparse vector input as RandomForestClassifier already does.
#51 made incomplete modifications on RandomForestRegressor, while its classifier counterpart has been properly updated.
What type of PR is it?
Improvement
What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-233
How was this patch tested?
Unit tests and manual test with tiny sample data:
Checklist
./bin/format_code.sh
, for your commit?