Investigate rebalancing the training set #61

redshiftzero · 2016-10-12T00:22:55Z

We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.

psivesely · 2016-10-24T19:08:22Z

@redshiftzero and I discussed this in person for a minute and whether we should increase the monitored_nonmonitored_ratio in fpsd/config.ini. We decided to leave it for now, but in the future if we realize we want more SD data it might be better to bump that from 10 to 100, which would give us roughly a 50:50 class split in terms of frontpage_traces. That's not to say there isn't good stuff in the library linked and we shouldn't see what we can get out of some of the functionality there. The conclusion was that getting more raw data will give more accurate results than oversampling from the same data-set where you are essentially replicating traces. Let me know if I missed anything here @redshiftzero.

psivesely · 2016-10-28T23:28:22Z

Matthews correlation coefficient (sklearn.metrics.matthews_corrcoef) "is used in machine learning as a measure of the quality of binary (two-class) classifications... generally regarded as a balanced measure which can be used even if the classes are of very different sizes."

redshiftzero added the machine learning label Oct 12, 2016

psivesely mentioned this issue Oct 29, 2016

Add initial machine learning pipeline #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate rebalancing the training set #61

Investigate rebalancing the training set #61

redshiftzero commented Oct 12, 2016

psivesely commented Oct 24, 2016

psivesely commented Oct 28, 2016

Investigate rebalancing the training set #61

Investigate rebalancing the training set #61

Comments

redshiftzero commented Oct 12, 2016

psivesely commented Oct 24, 2016

psivesely commented Oct 28, 2016