Skip to content

Fix reproducibility, refactor simulate_data.py, use functions in tests #545

@louismagowan

Description

@louismagowan

Problem

Seems like causalpy/data/simulate_data.py module has some reproducibility issues and a bit of refactoring that could be done

Reproducibility

The module declares a seeded RNG but doesn't use it consistently:

rng = np.random.default_rng(RANDOM_SEED)  # Declared on line 27, only used once

# Most functions use unseeded random:
norm(0, 0.25).rvs(N)           # scipy.stats uses global numpy state
np.random.choice(2, size=N)     # Uses global numpy state

Result: Functions produce different data each run. Generated CSV files cannot be reproduced.

CSV Usage:

  • Many CSVs committed to git
  • Cannot regenerate them deterministically

Proposed Solution

  1. Add seed parameter to all generation functions
  2. Replace norm().rvs() with rng.normal(), dirichlet().rvs() with rng.dirichlet(), etc.
  3. Delete generated CSV files; use pytest fixtures instead
  4. Update tests to generate data dynamically
  5. Fix bug: create_series() ignores length_scale parameter (line 488)
  6. Reduce duplication (lines 87-93: repeated function calls)
  7. Other light touch refactoring (separation of responsibility in functions, reduce LOC on _smoothed_gaussian_random_walk (lines 87-5))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions