Skip to content
Matt Hall edited this page Aug 26, 2022 · 1 revision

A recommendation from Machine learning safety measures:

Build software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

Things that can go wrong, from Functional but unsafe machine learning:

  • Allowing information leakage across features or across records, resulting in erroneously high accuracy claims. For example, splitting related (e.g. nearby) records into the training and validation sets.
  • Not accounting for under-represented classes so that predictions are biased towards over-represented ones. This kind of error was common to see in the McMurray Formation of Alberta, which is 80% pay.
  • Forgetting to standardize or normalize numerical inputs to a model in production, producing erroneous predictions. For example, training on gamma-ray Z-scores of roughly –3 to +3, then asking for a prediction for a value of 75.
  • Using cost functions that do not reflect the opinions of humans with expertise about ‘good’ vs ‘bad’ predictions.
  • Racial or gender bias in a human resource model, such as might be used for hiring or career mapping.

From my initial planning notebook:

  • Class imbalance
  • Redundant column (all same value)
  • Monotonic column
  • Test for unimodal vs multimodal (find peaks in KDE of 1 feature)
  • Highly correlated features
  • Self-correlated features
  • Looks like noise (uncorrelated with all other columns, no structure, white)
  • Standardized (not sure how to do this robustly, mean and stdev might not be 0 and 1 exactly)
  • Non-numerical feature (not allowed into sklearn)
  • Apparently categorical feature (or check type if dataframe)
  • Multioutput (>1 col in y) - constrains model types
  • Missing values
  • Significant outliers
  • Spikes / spukes / weird values or clumps of values
  • Clipped features (spike(s) at end(s) of histogram)
  • Otherwise weirdly shaped histogram
  • Non-normal distributions, esp power or exponential
  • negative values in mostly +ve data
  • out of bounds in mostly 0-1 ot 0-100 <-- Could be interesting
  • confounding variable? (Can tell if one is present? not sure)
  • How much of distribution is sampled for this dimensionality?
  • check for record independence?
    • Check autocorrelation is ~0 for non-zero lags?
    • Check spectrum is ~white?
    • Compare train/test on slice vs shuffle (on any prediction... slice should not be dramatically different)

If we get a train/test/val flag too, then:

  • Non-stratified features wrt train/val/test
  • Different distributions of features wrt train/val/test
  • Different distributions of target / labels wrt train/val/test

Other relevant writing

Clone this wiki locally