Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dataset: Hotgym #7

Open
2 tasks
breznak opened this issue Jul 25, 2019 · 2 comments
Open
2 tasks

New Dataset: Hotgym #7

breznak opened this issue Jul 25, 2019 · 2 comments
Labels
data dataset

Comments

@breznak
Copy link
Member

breznak commented Jul 25, 2019

If it's not already provided(?)

  • add "hotgym" data from htm.core (this is the "Hellow world" time-series dataset used in HTM commonly)
  • install NAB as a part for QA in htm.core and run htmcore detector results on (some) datasets
@breznak breznak added the data dataset label Jul 25, 2019
@ctrl-z-9000-times
Copy link
Collaborator

I don't think we should add more data to the benchmark, because it will invalidate any old results from before we added the data. After changing the benchmark data we would need to clear the scoreboard.

@breznak
Copy link
Member Author

breznak commented Jul 26, 2019

should [not] add more data to the benchmark, because it will invalidate any old results from before we added the data.

well, I would say the point of NAB is to offer a framework for "anomaly detection benchmarks for algorithms on time-series datasets (with focus to HTM)".

I have some regards to the current data presented:

  • human annotations (can be prone to err)
  • no syntetic datasets for benching behavior on controlled conditions (eg boost effect on "flat line")
  • no multi-modal datasets (the advantage of HTM compared to (simple) threshold-based approaches is detection of irregularities in co-occuring patterns)
  • add more well-estabilished AD datasets (eg the ECG from Physionet)

After changing the benchmark data we would need to clear the scoreboard.

Yes, but there are the options:

  • tagged version. Even Numenta suggests keep NAB tagged, so results are reproducible
    • should that be a problem, would it make sense to separate NAB+detectors -and- "datasets+results+scoreboard"? It could be a sub-repo for NAB, even Numata could share it. And users would easily run any version desired
    • we can re-run the algos on new dataset and update overall results
      • that's what we want, keep and updated, overall comparison between the detectors
      • for detectors that are not reproducible (not OSS, or not runnable by us), I'd say scratch them. Not able to reproduce renders the results untrustworthy.

TL;DR: Suggested approaches:

  • keep NAB git-tagged
  • separate NAB-datasets repo
  • just update the results with all detectors re-run on the new datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data dataset
Projects
None yet
Development

No branches or pull requests

2 participants