-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Let na.rm
of mean()
support removing NaN
as in base R
#44089
Comments
na.rm
of mean()
support NaN
as base Rna.rm
of mean()
support removing NaN
as in base R
Hi @psychelzh, a change here might require some discussion. My understanding is that different types of systems treat NaNs differently and what you're seeing is a result of arrow having different semantics than R. R and the pandas Python package treat NaNs as NA/None whereas database systems do not and I think arrow is emulating the database system semantics here. For comparison, Python has a similar situation as R: pandas treats NaN as NA/null (and actually ignores them by default so this is slightly different from R): In [1]: pd.Series([1, 2, np.nan]).mean()
Out[1]: 1.5 While PyArrow does not: In [2]: pc.mean(pa.array([1, 2, np.nan]))
Out[2]: <pyarrow.DoubleScalar: nan> However, because compatibility is important, PyArrow supports a In [3]: pc.mean(pa.array([1, 2, np.nan], from_pandas=True))
Out[3]: <pyarrow.DoubleScalar: 1.5> There are a few ways this could be addressed and I think this type of issue has come up before in places like #29060. The R package could add something similar to PyArrow's |
I have mixed feelings - I'm generally opposed to changing how something has worked for a long time given that other users may be depending on it or working around it , but in favour of matching base R when we are emulating their functions. @amoeba I'm a bit rusty with the codebase - if we were to do this, how complex would it be to implement? Note: we allow NaN to be treated as NA on Arrow Datum objects, e.g. library(arrow) my_vec <- c(42, NA, NaN)
is.na(my_vec)
#> [1] FALSE TRUE TRUE my_arrow_array <- Array$create(my_vec)
is.na(my_arrow_array)
#> Array
#> <bool>
#> [
#> false,
#> true,
#> true
#> ] I think, given the above behaviour that we have chosen to implement, if this isn't too complex and someone is willing to implement it, we could totally look into supporting it. Could we perhaps have it as an option? Or something else? Keen to hear what others think though. Some tangentially related PRs: |
Describe the enhancement requested
See the following example. In base R
na.rm
will also removeNaN
values, because of the fact thatis.na(NaN) == TRUE
.Created on 2024-09-12 with reprex v2.1.0
Component(s)
R
The text was updated successfully, but these errors were encountered: