-
Notifications
You must be signed in to change notification settings - Fork 85
Open
Description
Problem
Seems like causalpy/data/simulate_data.py module has some reproducibility issues and a bit of refactoring that could be done
Reproducibility
The module declares a seeded RNG but doesn't use it consistently:
rng = np.random.default_rng(RANDOM_SEED) # Declared on line 27, only used once
# Most functions use unseeded random:
norm(0, 0.25).rvs(N) # scipy.stats uses global numpy state
np.random.choice(2, size=N) # Uses global numpy stateResult: Functions produce different data each run. Generated CSV files cannot be reproduced.
CSV Usage:
- Many CSVs committed to git
- Cannot regenerate them deterministically
Proposed Solution
- Add seed parameter to all generation functions
- Replace norm().rvs() with rng.normal(), dirichlet().rvs() with rng.dirichlet(), etc.
- Delete generated CSV files; use pytest fixtures instead
- Update tests to generate data dynamically
- Fix bug: create_series() ignores length_scale parameter (line 488)
- Reduce duplication (lines 87-93: repeated function calls)
- Other light touch refactoring (separation of responsibility in functions, reduce LOC on
_smoothed_gaussian_random_walk(lines 87-5))
Metadata
Metadata
Assignees
Labels
No labels