Swap out `patsy` for `formulaic` #463

ksolarski · 2025-04-21T14:52:09Z

Solving issue #386

Starting with DiD, will continue with other methods if you with general design @drbenvincent

Seems like the key practical difference between formulae and patsy is lack of build_design_matrices method in formulae. User has to then provide formula again.

Edit: After discussion, it was decided that formulaic suits us best.

📚 Documentation preview 📚: https://causalpy--463.org.readthedocs.build/en/463/

drbenvincent · 2025-04-21T15:27:49Z

Cool. Thanks @ksolarski, just a quick reply from my phone...

Don't do this for the synthetic control because I have an in progress PR that will change it. It won't have a formula input.

But can I just get some clarification... does this change the API? Can we get the exact same functionality? If not, let's think again.

Will try to look at the code properly when I can 👍🏻

drbenvincent · 2025-04-21T15:41:21Z

I can't find where I saw it in the patsy docs at this point. But I think one of the things that build_design_matrices did was to ensure that predictions on new/out of sample data are correct. For example, you could get a situation where you don't have all levels of a categorical variable in one predictor for out of sample data. So I think if you to it naively, you can get silent errors.

I'm not 100% sure that this is a problem, and apologies I can't find the relevant part in the docs. But does my concern make sense?

ksolarski · 2025-04-22T07:38:37Z

You're right, Patsy has the power of preserving the transformation / encoding of variables through build_design_matrices method. There's no equivalent way in formulae so it's certainly not straightforward to copy paste the current behaviour with formulae.

However, Patsy repo suggests migration to https://github.com/matthewwardrop/formulaic instead, which is capable of "reusing the encoding choices made during conversion of one data-set on other datasets." (see https://matthewwardrop.github.io/formulaic/latest/). There's also a migration guide from Patsy to Formulaic to switch would be easy. It also supports many operators: https://matthewwardrop.github.io/formulaic/latest/guides/grammar/

Did you check out this library before? What do you think about using this instead of formulae?

ksolarski · 2025-04-28T07:50:22Z

@drbenvincent any strong opinions about using formulaic instead of formulae package?

drbenvincent · 2025-05-15T18:39:35Z

Sorry for the delayed response @ksolarski. So as far as I understand, formulaic is considered the successor to patsy. Though it is only formulae which allows hierarchical modeling?

Right now there are no use-cases for hierarchical modelling. That might change in the future, though I don't have any specific use cases in mind.

So I guess the only choice at the moment is formulaic. But I'll just ask @tomicapretto if there's any plan on how it handles out of sample prediction when not all levels of categorical variables are in the out of sample data (see short discussion above).

codecov · 2025-05-15T18:47:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.66%. Comparing base (a39e015) to head (41a0236).
Report is 23 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #463      +/-   ##
==========================================
- Coverage   94.66%   94.66%   -0.01%     
==========================================
  Files          32       32              
  Lines        2195     2194       -1     
==========================================
- Hits         2078     2077       -1     
  Misses        117      117

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ksolarski · 2025-05-18T12:29:32Z

@drbenvincent Yes, from the docs it seems that no hierarchical models are allowed in the formulaic. I think it properly handles the transformation of categorical variables in out of sample data. Here's some small example that I generated:

import pandas as pd
from formulaic import model_matrix
import formulaic

# Create a training dataset
train_data = pd.DataFrame(
    {
        "feature1": ["A", "B", "C", "D"],
        "target": [0, 1, 0, 1],
    }
)

# Create a test dataset
test_data = pd.DataFrame(
    {
        "feature1": [
            "A",  # In training
            "D",  # In training
            "E",  # Not in training
        ],
        "target": [0, 1, 0],
    }
)

# Generate the model matrix for the training data
train_matrix = model_matrix("target ~ 0 + feature1", train_data)

# Print the training matrix and spec
print("Training Matrix:")
print(train_matrix)

# Use the same spec to transform the test data
test_matrix = model_matrix(spec=train_matrix.model_spec, data=test_data)

# Print the test matrix - see that columns are properly aligned from the training data transformation
print("\nTest Matrix:")
print(test_matrix)

Is that the problem you had in mind or something else?

drbenvincent · 2025-05-26T11:18:35Z

Is that the problem you had in mind or something else?

@ksolarski Yes that's pretty much it. Turns out the phrase I was looking for was "stateful transforms" which you pretty much said here.

I'm actually wondering - if we don't get the additional functionality of hierarchical modeling, is there much benefit from moving from patsy to formulaic? Not saying it's not worth it, but we should be clear about the rationale. Is it because patsy is no longer maintained and formulaic might see more features, or does it have a richer formula API?

Sorry for the extended conversation on this by the way - but given the formula aspect is a core part of the API it's worth thinking it through :)

tomicapretto · 2025-05-26T13:42:59Z

Hi @ksolarski and @drbenvincent, sorry for the delay in my response. I can add a few pieces of information.

formulae:

supports stateful transformations
allows to evaluate new data, see https://bambinos.github.io/formulae/notebooks/getting_started.html#Evaluating-new-data
supports generation of design matrices for unseen categories, it was added in Allow evaluations on new data with new categories bambinos/formulae#96. Unfortunately, there's no proper documentation for it, but here you have the tests https://github.com/bambinos/formulae/blob/d4ea54f87652a064659df29273b12e872a904b3c/tests/test_eval_new_data.py#L315-L402

With that said, unless you need hierarchical models supported via the | operator, I would consider formulaic as it's more popular and better maintained I think. There's a PR they have to support mixed-effects models, but it has not been merged yet. I'm not sure if there's anything fundamental that prevents that. matthewwardrop/formulaic#34

ksolarski · 2025-05-28T19:27:15Z

@drbenvincent I went with formulaic as suggested. Quite a straightforward switch it was. I also noticed that the library mentions that there's some speed benefit: https://github.com/matthewwardrop/formulaic?tab=readme-ov-file#benchmarks

drbenvincent mentioned this pull request May 15, 2025

API change for the SyntheticControl experiment class #460

Merged

3 tasks

ksolarski added 2 commits May 28, 2025 22:16

DID switch

b06ce8e

Switch to formulaic

0734c8b

ksolarski force-pushed the patsy_to_formulae branch from 23221c2 to 0734c8b Compare May 28, 2025 19:18

ksolarski changed the title ~~Swap out patsy for formulae~~ Swap out patsy for formulaic May 28, 2025

ksolarski marked this pull request as ready for review May 28, 2025 19:20

Remove patsy from pyproject

699c380

ksolarski and others added 3 commits June 16, 2025 10:25

Merge branch 'main' into patsy_to_formulae

35e3cc6

Remove unused code

15dd56a

Fix ruff check

f0bfc5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Swap out `patsy` for `formulaic` #463

Swap out `patsy` for `formulaic` #463

Uh oh!

ksolarski commented Apr 21, 2025 •

edited

Loading

Uh oh!

drbenvincent commented Apr 21, 2025

Uh oh!

drbenvincent commented Apr 21, 2025

Uh oh!

ksolarski commented Apr 22, 2025

Uh oh!

ksolarski commented Apr 28, 2025

Uh oh!

drbenvincent commented May 15, 2025

Uh oh!

codecov bot commented May 15, 2025

Uh oh!

ksolarski commented May 18, 2025

Uh oh!

drbenvincent commented May 26, 2025

Uh oh!

tomicapretto commented May 26, 2025

Uh oh!

ksolarski commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Swap out patsy for formulaic #463

Are you sure you want to change the base?

Swap out patsy for formulaic #463

Uh oh!

Conversation

ksolarski commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drbenvincent commented Apr 21, 2025

Uh oh!

drbenvincent commented Apr 21, 2025

Uh oh!

ksolarski commented Apr 22, 2025

Uh oh!

ksolarski commented Apr 28, 2025

Uh oh!

drbenvincent commented May 15, 2025

Uh oh!

codecov bot commented May 15, 2025

Codecov Report

Uh oh!

ksolarski commented May 18, 2025

Uh oh!

drbenvincent commented May 26, 2025

Uh oh!

tomicapretto commented May 26, 2025

Uh oh!

ksolarski commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Swap out `patsy` for `formulaic` #463

Swap out `patsy` for `formulaic` #463

ksolarski commented Apr 21, 2025 •

edited

Loading