Understanding Drift Detectors with synthetic data #844

TawabG · 2022-02-14T17:27:39Z

TawabG
Feb 14, 2022

Hi all!

Thank you for developing this awesome package!

I have been experimenting with drift detection methods, however, I cannot understand why certain things are happening. I have followed @jacobmontiel tutorial as part of the "Open Source Machine Learning for Data Streams" tutorial at DSAA 2021.

let me take you through my process:
The data is generated using the AGRAWAL data generator with 3 gradual drifts at the 5k, 10k, and 15k marks. agr_a_20k.csv

I then used progressive_val_score() to print the accuracy scores and store them in a log file. Then I used that log file and transformed the scores into a readable format and inserted them into a list.

with open('test-drift.log', 'w') as f:
    metric = progressive_val_score(dataset=stream,
                          model=model,
                          metric=metric,
                          print_every=1,
                          file=f)
results = []

with open('test-drift.log') as f:
    for line in f.read().splitlines():
        percentages = line[-6:]
        result = float(percentages.strip('%'))
        results.append(result)

I then plot the results and we can see this graph:

However, when I am trying to discover when a certain drift is happening by using ADWIN, I get these results:

import numpy as np
from river.drift import ADWIN
from river import drift

# Auxiliary function to plot the data
def plot_data(dist_a, drifts=None):
    fig = plt.figure(figsize=(7,3), tight_layout=True)
    gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1])
    ax1, ax2 = plt.subplot(gs[0]), plt.subplot(gs[1])
    ax1.grid()
    ax1.plot(results, label='Stream')
    ax2.grid(axis='y')
    ax2.hist(dist_a, label=r'$dist_a$')
    if drifts is not None:
        for drift_detected in drifts:
            ax1.axvline(drift_detected, color='red')
    plt.show()


for i, val in enumerate(results):
    in_drift, in_warning = adwin.update(val)
    if in_drift:
        print(f"Change detected at index {i}, input value: {val}")


drift_detector = drift.ADWIN()
drifts = []

for i, val in enumerate(results):
    drift_detector.update(val)   # Data is processed one sample at a time
    if drift_detector.change_detected:
        # The drift detector indicates after each sample if there is a drift in the data
        print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()   # As a best practice, we reset the detector

plot_data(results, drifts)

I can understand why ADWIN is detecting so many changes, but I wonder if there is a specific way to just focus on these gradual changes at the 5k/10k/15k mark. I really hope somebody could explain this topic to me!

Update 1: When using the metric to print every 100 steps. it starts to get interesting:

However, this is still not very accurate since we know the gradual drifts are at the 5k/10k/15K mark (this should translate to a drift detection mark at around 50(*100), 100(*100) and 150(*100) which is clearly not the case as can be seen from the graph above.

MaxHalford · 2022-02-18T16:24:46Z

MaxHalford
Feb 18, 2022
Maintainer

I'm sorry I don't know enough about drift detection to be of help. @jacobmontiel is a bit busy at the moment but I'm sure he'll pop by at some point.

0 replies

liucf · 2023-05-01T10:20:20Z

liucf
May 1, 2023

@TawabG Did you figure it out? Why when print every 100 steps, the two drifts are not correct?

1 reply

TawabG May 1, 2023
Author

My conclusion was that ADWIN has many false positives on a large population and not very accurate on a smaller population.
However, this conclusion COULD be the case of the input data used.
I did not have time to continue investigating this, but if you do have new insights, please share :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Drift Detectors with synthetic data #844

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Understanding Drift Detectors with synthetic data #844

TawabG Feb 14, 2022

Replies: 2 comments · 1 reply

MaxHalford Feb 18, 2022 Maintainer

liucf May 1, 2023

TawabG May 1, 2023 Author

TawabG
Feb 14, 2022

Replies: 2 comments 1 reply

MaxHalford
Feb 18, 2022
Maintainer

liucf
May 1, 2023

TawabG May 1, 2023
Author