-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import the outlier detection benchmark results from http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ #17
Comments
Hi Erich, Thanks, looks super-interesting. A lot of them seem to originate from UCI. Are there some for which the original data is not on OpenML already? Am I correct that the data is the same as the UCI data but with an additional feature indicating the outliers? How do you establish the ground truth about which points are outliers? For the integration, we'll need the following:
It seems you are focussing on the unsupervised setting. Should we consider the supervised setting as well (classification and regression)? We're doing a hackathon in Munich on Feb 27 - Mar 3. Would you be able to help us during that week? Cheers, |
Most data sets are derived from UCI data sets. In order to obtain "outlier" data sets, authors often downsample all but the largest class, but few publish their exact resulting data set. For reproducibility, we made static samples, and uploaded them. We don't have an ARFF writer yet. But if I have some spare time, I can try to first add ARFF output (which is likely of wide interest) and then provide you with a command line call for evaluation. Supervised setting: ELKI only has very basic supervised methods such as kNN classification. Largely because Weka etc. were already rather good here, and it makes more sense to implement methods that aren't already somewhere else. I don't know about end of February yet. |
Hi Erich,
Would it be interesting to import these as well?
https://arxiv.org/abs/1503.01158
Cheers,
Joaquin
On Sun, Nov 27, 2016 at 12:05 AM Erich Schubert ***@***.***> wrote:
Most data sets are *derived* from UCI data sets. In order to obtain
"outlier" data sets, authors often downsample all but the largest class,
but few publish their exact resulting data set. For reproducibility, we
made static samples, and uploaded them.
Also most methods won't accept non-numeric attributes, so there are
variants with different approaches of e.g. one-hot-encoding categorial
attributes.
All the data and experiments in that repository are for unsupervised
outlier detection.
We don't have an ARFF writer yet. But if I have some spare time, I can try
to first add ARFF output (which is likely of wide interest) and then
provide you with a command line call for evaluation.
But there is also the problem that some outlier methods will assign small
values to outliers; while most will assign large values. So there is some
metadata necessary to interpret results.
The result files you have on the web page are easy: one row per run, one
column per object with the outlier score only (this doesn't include above
metadata - you have to know that e.g. FastABOD uses small scores for
outliers; our evaluation tool has a regexp to recognize known methods based
on the row label).
Supervised setting: ELKI only has very basic supervised methods such as
kNN classification. Largely because Weka etc. were already rather good
here, and it makes more sense to implement methods that aren't already
somewhere else.
I don't know about end of February yet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/openml/OpenML/issues/349#issuecomment-263091369>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV5D79TTssVlqWX-M7ddJ93RfO8EDks5rCLtJgaJpZM4K8g8O>
.
--
Thank you,
Joaquin
|
The data sets are rather similar, I'm not sure if adding another 100+ variants of the same UCI "mother set" (as it is called in their article) adds much; in particular as the UCI data sets are not very well suited for anomaly detection but need such derivations in the first place. Their work follow a much more complex derivation procedure (with a KLR to estimate the difficulty of an anomaly, and a very tight control of the exact rate of anomalies), too; while our repository follows the procedure you find in various published literature of random downsampling of the minority class; i.e. we tried to reproduce the data sets that were used in earlier work. |
http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/
Is a repository of outlier detection benchmark data and results.
Every data set comes with a downloadable "raw algorithm results" package containing the results of a few hundred (algorithm, parameter) combinations on these data sets, and there is a separate file with generated evaluation results, too. Alternatively, you could also import only the best-of results.
As mentioned in openml/openml-java#6 it would also be nice to have a "submit to OpenML" function in ELKI; and on the other hand, OpenML could use ELKI for evaluating outlier and clustering results (ELKI has 19 supervised evaluation measures for clustering, 9 internal evalution measures, with 3 different strategies for handling noise objects. For outlier evaluation, it has 4 measures + adjustment for chance for them, which yields 7 interesting measures). Except for the internal cluster evaluation measures (which may need O(n^2) memory and pairwise distances) they are all very fast to compute.
I don't have the capacity right now to do the integration myself; but I can assist e.g. with adapting the scripts used to generate above results. Or we can simply transfer the data as ascii for submission?
From the API documentation, I do not understand how to format result data for submission. Are arbitrary file types or only ARFF allowed? How are evaluation results uploaded?
The text was updated successfully, but these errors were encountered: