Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUSUM.py get_chu_stinchcombe_white_statistics speed-up and bug fix #417

Open
jasonharris438 opened this issue Jul 2, 2020 · 7 comments
Open
Labels
enhancement New feature or request

Comments

@jasonharris438
Copy link

There is one minor issue in the existing function that I have fixed in my proposed solution, however the main purpose here is for a speed-up.

Change No. 1:

Bug on line 55 - currently s_n_t = 1 / (sigma_sq_t * np.sqrt(integer_index - temp_integer_index)) * values_diff

Change to: s_n_t = 1 / (np.sqrt(sigma_sq_t * (integer_index - temp_integer_index))) * values_diff

To take sqrt of sigma_sq_t to get \hat\sigma_{t} - Correctly standardizes according to the original paper and Marcos' book.

S_{n,t} & = (y_{t}-y_{n})(\hat\sigma_{t}\sqrt{t-n})^{-1}, \ \ t>n \

The test statistics are currently disproportionately large compared to the critical values as they are not scaled correctly.

To Reproduce
The script from the docs is sufficient to see this.

import pandas as pd
import numpy as np
from mlfinlab.structural_breaks import (get_chu_stinchcombe_white_statistics,
get_chow_type_stat, get_sadf)

Importing price data

bars = pd.read_csv('BARS_PATH', index_col = 0, parse_dates=[0])

Changing to log prices data

log_prices = np.log(bars.close) # see p.253, 17.4.2.1 Raw vs Log Prices

Chu-Stinchcombe test (one-sided and two-sided)

one_sided_test = get_chu_stinchcombe_white_statistics(log_prices, test_type='one_sided')
two_sided_test = get_chu_stinchcombe_white_statistics(log_prices, test_type='two_sided')

Change No. 2:

I have implemented a modified _get_s_n_for_t() function that results in a large time saving. Currently the script takes about 7.8 minutes to run on a simulated dataset of 10,000 points. The proposed changes result in a running time of 0.45 minutes on the same dataset.

This also removes the need for _get_values_diff(), as it adds unnecessary overhead.

Can you please point me to where I can do your unit testing for this? I have no apparent issues on my local environment, however before I create a PR I would like to be certain. I will edit this issue and create a PR once the new script passes.

Thanks guys

@PanPip PanPip added the enhancement New feature or request label Jul 2, 2020
@PanPip
Copy link
Contributor

PanPip commented Jul 2, 2020

@jasonharris438 Thank you for these improvements! I wrote a list of tests to run in the comments to Issue #348.

@PanPip
Copy link
Contributor

PanPip commented Jul 2, 2020

@jasonharris438 Also, please be sure to make your local branch from develop branch and then make a PR also to develop and not master.

@jasonharris438
Copy link
Author

jasonharris438 commented Jul 5, 2020

Thanks @PanPip , I just sent you a msg on slack.
I have passed the styling requirements, just got hit on line 74 of the test_structural_breaks.py test.
I think there might be an issue with the asserted stats for the one_sided test, as the value of 3729.001 looks too big to me (as do the others that are >100). It should be on the same scale as the values for the 2 sided test.

@Jackal08
Copy link
Member

Jackal08 commented Aug 5, 2020

Hi @jasonharris438, has this ticket been resolved?

@jasonharris438
Copy link
Author

Hi @Jackal08, this change requires a change to the unit test script. I'll outline for you over the weekend

@jasonharris438
Copy link
Author

Hey @Jackal08, maybe don't worry about this one. I've had far too much on my plate the last couple of months, just caught up on all of the discussion around the source code going private. I've been working and studying full time during stage 4.5 lockdown alongside searching for a career change. Unfortunately haven't had the time to revisit this library. I think my fork is now quite out of date, so probably will struggle to do a PR to maximise the benefit not being able to see the latest version. Effectively what I had done here requires a change to the unit test as it was not asserting the correct output.

I've attached my version of the CUSUM.py file, it's very similar to the original, just with the inner loop removed and a fix to the calculation of the denominator on line 55 (it was previously using the variance, this needs to be the standard deviation).

cusum.txt

@hpad06
Copy link

hpad06 commented Dec 25, 2020

@jasonharris438 I tried your cusum, it gets totally different result from original code, is this indeed correct?

For example, the same data, the first is the chart using original code with stat>critical value, the 2nd is yours
image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants