Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Preprocessing - IQR Transformer #717

Open
fabioscantamburlo opened this issue Nov 13, 2024 · 2 comments
Open

[FEATURE] Preprocessing - IQR Transformer #717

fabioscantamburlo opened this issue Nov 13, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@fabioscantamburlo
Copy link
Contributor

fabioscantamburlo commented Nov 13, 2024

Hello,

In my Kaggle journey I use quite often the IQR technique to fill out-of-scale values with predefined or data driven values.

I already have a scikit-compatible implementation of such a method that I use in pipelines to easy validate my models against KFold.

I think that it would be a waste of code to do not include this feature in Sklego, so I'm proposing it to the community. 🧑‍🤝‍🧑

Use case scenario:

import pandas as pd
import numpy as np 


data = {
    'A': np.random.randint(10, 20, size=10),
    'B': np.random.randint(100, 200, size=10),
    'C': np.random.randint(50, 80, size=10),
    'D': np.random.randint(1, 3, size=10)
}
df = pd.DataFrame(data)
df = pd.concat([df, pd.DataFrame({
    # Adding by hand some out of scale values 
    'A': [300, -100],
    'B': [1200, -200],
    'C': [360, -10],
    'D': [30, -40]
    })], axis=0)
array([[  11,  168,   62,    1],
       [  12,  154,   64,    2],
       [  16,  156,   76,    2],
       [  10,  176,   50,    2],
       [  19,  121,   57,    2],
       [  14,  130,   73,    1],
       [  17,  107,   56,    1],
       [  12,  184,   67,    1],
       [  17,  139,   60,    1],
       [  18,  128,   54,    2],
       [ 300, 1200,  360,   30],
       [-100, -200,  -10,  -40]])

In this example I decide to fill the values with the column mean (excluding the out-of-scale values detected by IQR)
After transformation:

array([[ 11. , 189. ,  77. ,   1. ],
       [ 14. , 151. ,  50. ,   1. ],
       [ 10. , 177. ,  53. ,   1. ],
       [ 19. , 197. ,  63. ,   1. ],
       [ 19. , 146. ,  65. ,   2. ],
       [ 10. , 189. ,  62. ,   2. ],
       [ 10. , 197. ,  54. ,   1. ],
       [ 19. , 146. ,  56. ,   1. ],
       [ 14. , 162. ,  69. ,   1. ],
       [ 12. , 148. ,  75. ,   2. ],
       [ 13.8, 170.2,  62.4,   1.3],
       [ 13.8, 170.2,  62.4,   1.3]])

Do you think such feature will add value to the lego toolkit?

@fabioscantamburlo fabioscantamburlo added the enhancement New feature or request label Nov 13, 2024
@koaning
Copy link
Owner

koaning commented Nov 14, 2024

I am not super sure what you mean when you refer to the IQR technique, could you elaborate on it and also explain why it is so beneficial in ML pipelines?

@fabioscantamburlo
Copy link
Contributor Author

fabioscantamburlo commented Nov 15, 2024

Hello Vincent!

Yeah happy to do that.

IQR TRICK
The idea is to use the following approach: Credits

  • Calculate the first quartile (Q1) and third quartile (Q3) of the column (feeature). (eg q1=0.25 and q2=0.75 or any other of choice)
  • Compute the interquartile range (IQR) as the difference between Q3 and Q1 (IQR = Q3 – Q1).
  • Define the lower outlier threshold as Q1 – (constant * IQR) and the upper outlier threshold as Q3 + (constant * IQR).
  • Identify any data points that fall below the lower threshold or above the upper threshold. These observations are considered outliers.

It is a rather simple methodology but in some cases I think it is a nice starting point to get rid of some crazy values in the data without having a specific domain knowledge about the features.

In the proposed transformer, the idea is just to take the "IQR identified outliers" and replace them with some specific values.

The RobustScaler in scikit-learn is built around more or less the same idea, resulting in a scaling without considering the values flagged as outliers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants