You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.
Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().
The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of y_score (as the mean, var are locked), or y_proba, or even Pcomb variables.
But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the y_bool column in results['outliers'].
So, in short, fitting data then transforming data with update_outlier_params=False will change the y_proba and y_bool of original fit data in certain cases.
I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.
This example changes the number of HotellingT2 outliers (as determined by y_bool) of original fit data from 1 to 0.
importnumpyasnpimportpandasaspdfrompcaimportpca# Create datasetnp.random.seed(42)
X_orig=pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert OutliersX_orig.iloc[500:510, 8:] =15# PCA Trainingmodel=pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results=model.fit_transform(X=X_orig)
outliers_original=model.results['outliers']
# Create New DataX_new=pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))
# Transform New Datamodel.transform(X=X_new, update_outlier_params=False)
outliers_new=model.results['outliers']
# Compare Original Points Outlier Results Before and After Transformprint("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())
I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.
I understand that inherently it makes sense for the y_proba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.
The text was updated successfully, but these errors were encountered:
I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.
Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().
The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of y_score (as the mean, var are locked), or y_proba, or even Pcomb variables.
But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the y_bool column in results['outliers'].
So, in short, fitting data then transforming data with update_outlier_params=False will change the y_proba and y_bool of original fit data in certain cases.
I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.
This example changes the number of HotellingT2 outliers (as determined by y_bool) of original fit data from 1 to 0.
I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.
I understand that inherently it makes sense for the y_proba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.
The text was updated successfully, but these errors were encountered: