-
Notifications
You must be signed in to change notification settings - Fork 2
Weighted aggregation. #8
Comments
You mean like rolling functions? We have those. |
Ah, I meant in just plain old aggregation. |
We also have those. Or do you mean as an example here? |
Really? Interesting, in pandas I've seen people resort to apply to calculate those. |
Can you show me a pandas snippet, so I understand? |
This is how you'd calculate a mean in pandas. import pandas as pd
data = [
{"group": "a", "rating": 10, "weight": 0.5},
{"group": "a", "rating": 5, "weight": 1.5},
{"group": "b", "rating": 5, "weight": 1.5}
]
pd.DataFrame(data).groupby("group").agg(weighted_mean=("rating", "mean")) But that's not a weighted mean. Instead you'd like to do something like; import pandas as pd
import numpy as np
data = [
{"group": "a", "rating": 10, "weight": 0.5},
{"group": "a", "rating": 5, "weight": 1.5},
{"group": "b", "rating": 5, "weight": 1.5}
]
(pd.DataFrame(data)
.groupby("group")
.agg(weighted_mean=("rating", lambda d: np.sum(d['rating'])/np.sum(d['weight'])))) But this doesn't work in pandas with a named aggregation because in (pd.DataFrame(data)
.groupby("group")
.apply(lambda d: np.average(d['rating'], weights=d['weight']))) But this is using That's why I think it'd be nice to just have a method that can attach a weighted mean column, but done in a performant method. |
Yeap, that's why the expressions are awesome! 😄 (pl.DataFrame(data)
.groupby("group")
.agg([
(pl.col("rating") * pl.col("weight")).sum() / pl.sum("weight")
])
) |
Yep! But that's why this may also be a nice example to add to this repository. Not 100% sure though. It feels like something that's only worth the effort in pandas-land. Less so in polars-country. |
Yes, it shows the power of expressions. One of my arguments often is that the expression API reduces the need of running python bytecode, which this example shows. So yeah, I think it fits. |
The above example using expressions is great, yet it does not address the many cases where some values are NaN or the weights sum up to 0. I'm currently trying to reproduce the behaviour we have with a much slower pandas aggregation. |
Things like weighted mean/sum/std might be good to support.
The text was updated successfully, but these errors were encountered: