-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a set of benchmark dataset #127
Comments
I can help with this. It'd be best to run this with a testing framework, to allow for the tests or CI to check if changes to the models/code (e.g. defaults or improvements) break/reduce performance |
Anecdotally, I recently noticed a drop in accuracy (or maybe convergence speed) on Forest Cover Type when upgrading version of PyTorch... Would be interested to see whether others experience same and understand whether there's some issue that needs addressing or it's just a statistical variation. Stopping at 200 epochs, observed test accuracy of:
|
@athewsey thanks for reporting that, it seems quite a lot for just changing the torch version. Have you been experimenting this on the latest release just by changing the pytorch version? I understand that random seeds could change from one version to another, but after 200 epochs there should not be such a gap. @Hartorn @eduardocarvp did you notice such strong changes when monitoring tabnet scores? |
@Optimox those figures were I believe all using develop code as of my recent PR #164. Taking a random 80/10/10 Training/Validation/Test split of Forest Cover Type, and just trying the different PyTorch framework versions via AWS' provided deep learning container images on SageMaker - so all with Python 3.6, Ubuntu 16.04, and (if I interpret the container versioning correctly) CUDA 10.1... But there's a chance there are some small, relevant library differences between them. All the training was run on an Appreciate the library versions aren't as controlled as they could be between tests, and will try to re-run on a fully controlled/local env with only PyTorch different if possible - but it's tricky as my current workflow is mostly set up for using those pre-built images. Just thought it was worth mentioning for consideration in this ticket's priority, and as I hadn't seen discussion about cross-version benchmarking/accuracy checks elsewhere on the project. |
Feature request
I created some Research Issues that would be interesting to work on. But it's hard to tell if an idea is a good idea without having a clear benchmark on different dataset.
So it would be great to have a few notebooks that could run on different datasets in order to monitor performances uplift of a new implementation.
What is the expected behavior?
The idea would be to run this for each improvement proposal and see whether it helped or not.
How should this be implemented in your opinion?
This issue could be closed little by little by adding new notebooks that each perform a benchmark on one well known dataset.
Or maybe it's a better a idea to incorporate tabnet to existing benchmarks like Catboost Benchmark : https://github.com/catboost/benchmarks
Are you willing to work on this yourself?
yes of course, but any help would be appreciated!
The text was updated successfully, but these errors were encountered: