Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tribuo changes types between input dataset and prediction #391

Open
behrica opened this issue Jan 30, 2024 · 6 comments
Open

tribuo changes types between input dataset and prediction #391

behrica opened this issue Jan 30, 2024 · 6 comments

Comments

@behrica
Copy link
Contributor

behrica commented Jan 30, 2024

see https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/tribuo.20prediction.20datatype.20does.20not.20match

(ns scicloj.ml.tribuo
  (:require
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.modelling :as ds-model]
   [tech.v3.libs.tribuo :as tribuo])
  (:import
   (com.oracle.labs.mlrg.olcut.config DescribeConfigurable)
   (org.tribuo.classification.sgd.linear LogisticRegressionTrainer)))



(def logreg-trainer (LogisticRegressionTrainer.))

(def dummy-ds
  (->
   (ds/->dataset {:x [1 1] :y [0 1]})
   (ds-model/set-inference-target :y)))
(-> dummy-ds :y seq)
;; => (0 1)


(def m (tribuo/train-classification logreg-trainer dummy-ds))

(->
 (tribuo/predict-classification m dummy-ds)
 :prediction
 seq)
;; => ("0" "0")
@behrica
Copy link
Contributor Author

behrica commented Feb 4, 2024

This is problematic as usual accuracy is calculated by comparing:
[0 1] and ["0" "0"], which is eventually problematic in automatic evaluations as nobody might see the different types,
and this gets evaluated as "non matching" even though they do match.
In metamorph.ml and its predict method I will try to fail on all this situations

@behrica behrica changed the title tribuo chnages types between input dataset and prediction tribuo changes types between input dataset and prediction Feb 4, 2024
@behrica
Copy link
Contributor Author

behrica commented Apr 7, 2024

Any news on this issue ?
The fact that the tribuo trainer changes he "dataype" in its prediction from float to string makes it a bad player among the different models.
A model which is trained on [0 1 0 1 ..] (as int), should never predict "0" or "1".
I think a well behaving (classification) model should never predict anything which it never saw in training data.
0 and "0" is not the same in this context.

@cnuernber
Copy link
Collaborator

I agree

@behrica
Copy link
Contributor Author

behrica commented May 14, 2024

I think the "problem" here is , that the Classification in Tribuo has only foreseen, that training data target column is of type "Label" (which is a String fundamentally).
In java this is guarantied by the type system and its generic types.
In Clojure we circumvent this, which in some for shows a "bug" in Tribuo, but it cannot happen using Java Code.
(so it's not a bug)

@behrica
Copy link
Contributor Author

behrica commented May 14, 2024

I went deeper.
The issue is this line:

(dtype/make-reader :object n-rows (Label. (->string (data idx))))

in which the code converts keyword,numbers and Strings to "string",
and forgets about initial type.
Probably we need to remember" the original type in some way, and convert back after prediction.

@behrica
Copy link
Contributor Author

behrica commented Nov 24, 2024

I a thinking know, that this is not solvable on the level of tech.ml.dataset,
as it would require that "train" and "predict" to communicate the "label type", which is not possible in the current way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants