Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Investigate rebalancing the training set #61

Open
redshiftzero opened this issue Oct 12, 2016 · 2 comments
Open

Investigate rebalancing the training set #61

redshiftzero opened this issue Oct 12, 2016 · 2 comments

Comments

@redshiftzero
Copy link
Contributor

We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.

@psivesely
Copy link
Contributor

@redshiftzero and I discussed this in person for a minute and whether we should increase the monitored_nonmonitored_ratio in fpsd/config.ini. We decided to leave it for now, but in the future if we realize we want more SD data it might be better to bump that from 10 to 100, which would give us roughly a 50:50 class split in terms of frontpage_traces. That's not to say there isn't good stuff in the library linked and we shouldn't see what we can get out of some of the functionality there. The conclusion was that getting more raw data will give more accurate results than oversampling from the same data-set where you are essentially replicating traces. Let me know if I missed anything here @redshiftzero.

@psivesely
Copy link
Contributor

Matthews correlation coefficient (sklearn.metrics.matthews_corrcoef) "is used in machine learning as a measure of the quality of binary (two-class) classifications... generally regarded as a balanced measure which can be used even if the classes are of very different sizes."

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants