-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add writing functionality for dataframes #45
base: develop
Are you sure you want to change the base?
Changes from 65 commits
3a6664d
153f803
4557559
6eeb992
ffddf74
eb82ff6
1868d8a
8d9cb55
398d1e9
9cdd37c
5084d2d
af0f6fe
1c71a86
8fa951e
b205d8d
963a9bc
61a2ea2
937908b
8eda454
efbb09d
32a2cc6
1f4e8d8
237bc22
6859b8c
5ac49d0
6ad1408
1c458ba
92429ca
5cf678d
f379fc9
ddabf65
9dd2559
8652271
993e2ed
dc1950d
38c80d7
b622c9c
c2728e2
e27217c
9da1a42
3dc0169
d4049ba
cb3c487
76bdc83
87194d4
d9c3be5
22fe43d
3bd285a
80a0c9b
b73d4bd
e88419e
6bb5d5b
fd813e2
c63dfe7
b8b6948
e773101
828f9da
0fe903f
2755b3c
6040bae
7c60566
d450f73
4d3e38f
ac948de
957887c
c3fe17c
74a11ba
5449199
74bf9d8
da9c940
3acc180
4557d99
3e330a3
0d45965
de5a408
5d8187c
59f4269
540a59a
aa7239d
df8b391
1a00c1d
ff6b6a9
daf1e3a
943e697
8a269ae
9718161
a7c7066
5a430aa
87f4c65
25a14af
7984089
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,8 @@ | |
import xarray | ||
from typing_extensions import override | ||
|
||
from rdata.parser._parser import get_altrep_name | ||
|
||
from .. import parser | ||
|
||
ConversionFunction = Callable[[Union[parser.RData, parser.RObject]], Any] | ||
|
@@ -394,20 +396,70 @@ def convert_array( | |
return value # type: ignore [no-any-return] | ||
|
||
|
||
R_INT_MIN = -2**31 | ||
def convert_altrep_to_range( | ||
r_altrep: parser.RObject, | ||
) -> range: | ||
""" | ||
Convert a R altrep to range object. | ||
|
||
Args: | ||
r_altrep: R altrep object | ||
|
||
Returns: | ||
Range object. | ||
""" | ||
if r_altrep.info.type != parser.RObjectType.ALTREP: | ||
msg = "Must receive an altrep object" | ||
raise TypeError(msg) | ||
|
||
info, state, attr = r_altrep.value | ||
assert attr.info.type == parser.RObjectType.NILVALUE | ||
|
||
altrep_name = get_altrep_name(info) | ||
|
||
if altrep_name != b"compact_intseq": | ||
msg = "Only compact integer sequences can be converted to range" | ||
raise NotImplementedError(msg) | ||
|
||
n = int(state.value[0]) | ||
start = int(state.value[1]) | ||
step = int(state.value[2]) | ||
stop = start + (n - 1) * step | ||
return range(start, stop + 1, step) | ||
|
||
|
||
def _dataframe_column_transform(source: Any) -> Any: # noqa: ANN401 | ||
|
||
if isinstance(source, np.ndarray): | ||
dtype: Any | ||
if np.issubdtype(source.dtype, np.integer): | ||
return pd.Series(source, dtype=pd.Int32Dtype()).array | ||
|
||
if np.issubdtype(source.dtype, np.bool_): | ||
return pd.Series(source, dtype=pd.BooleanDtype()).array | ||
dtype = pd.Int32Dtype() | ||
elif np.issubdtype(source.dtype, np.floating): | ||
# We return the numpy array here, which keeps | ||
# R_FLOAT_NA, np.nan, and other NaNs as they were originally in the file. | ||
# Users can then decide if they prefer to interpret | ||
# only R_FLOAT_NA or all NaNs as "missing". | ||
return source | ||
# This would create an array with all NaNs as "missing": | ||
# dtype = pd.Float64Dtype() # noqa: ERA001 | ||
# This would create an array with only R_FLOAT_NA as "missing": | ||
# from rdata.missing import is_na # noqa: ERA001 | ||
# return pd.arrays.FloatingArray(source, is_na(source)) # noqa: ERA001 | ||
elif np.issubdtype(source.dtype, np.complexfloating): | ||
# There seems to be no pandas type for complex array | ||
return source | ||
elif np.issubdtype(source.dtype, np.bool_): | ||
dtype = pd.BooleanDtype() | ||
elif np.issubdtype(source.dtype, np.str_): | ||
dtype = pd.StringDtype() | ||
elif np.issubdtype(source.dtype, np.object_): | ||
for value in source: | ||
assert isinstance(value, str) or value is None | ||
dtype = pd.StringDtype() | ||
else: | ||
return source | ||
|
||
if np.issubdtype(source.dtype, np.str_): | ||
return pd.Series(source, dtype=pd.StringDtype()).array | ||
return pd.Series(source, dtype=dtype).array | ||
|
||
return source | ||
|
||
|
@@ -430,7 +482,7 @@ def dataframe_constructor( | |
and isinstance(row_names, np.ma.MaskedArray) | ||
and row_names.mask[0] | ||
) | ||
else tuple(row_names) | ||
else row_names | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why this change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd keep this change even though I removed the altrep-to-range conversion. The reason is that if a user would like to create a custom There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say that such mapping is an error. However, I do not have anything against this particular change, unless the |
||
) | ||
|
||
return pd.DataFrame(obj, columns=obj, index=index) | ||
|
@@ -820,6 +872,9 @@ def _convert_next( # noqa: C901, PLR0912, PLR0915 | |
|
||
value = None | ||
|
||
elif obj.info.type == parser.RObjectType.ALTREP: | ||
value = convert_altrep_to_range(obj) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel that I am missing something here. Altreps should not be present here when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This relates to the support for writing dataframes with range indices like The reason that this change is in this PR is that it allows testing R-to-Python-to-R roundtrip for such a dataframe (in R But, other than enabling that testing, this change wouldn't really belong to this PR, so it could be removed if you think that's better. In general, I feel that it could be nicer if altreps would be expanded in the conversion stage instead of the parsing stage. The reason is that then the RData object would be a one-to-one representation of the file contents, which is not the case for altreps at the moment (unless There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think altreps are intended to be an internal representation detail. That is why I made the default to expand them in the parser: I assumed that, except for very specific situations, users would prefer to ignore their existence. That said, IMHO the default conversion routines should be able to deal with altreps (in case that For me ranges are a very different beast than NumPy arrays, and returning such different kind of objects for the "same" underlying R object would be confusing. Moreover, there is no equivalent to If we want to convert an array back to an altrep for space savings we could probably try to detect if the array is a sequence (with For the above reasons I think it is preferrable not to consider altreps in the roundtrip comparison (that is, comparing both files with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with your points on range and that it would be confusing to always convert intseq altrep to a range object as it doesn't behave similarly to numpy array (in contrast to behavior in R). So, I removed this altrep-to-range conversion from R-to-Python conversion. In Python-to-R conversion side, I also removed the general conversion of range objects, but changed that logic to apply to pd.RangeIndex only, that is, pd.RangeIndex converts to altrep (if possible; step=1) and pd.Index converts to array. I added a few tests for this case as the roundtrip comparison can't reach this case. Do you think this is ok?
I agree that could be done at some point if needed.
I agree. The R-Python-R roundtrip tests do file checking only for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the |
||
|
||
else: | ||
msg = f"Type {obj.info.type} not implemented" | ||
raise NotImplementedError(msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can they decide?