Add writing functionality for dataframes #45

trossi · 2024-09-16T12:44:01Z

References to issues or other PRs

Closes #20.

Describe the proposed changes

This PR will add support for writing pandas dataframes. This turned out to be a quite large change to the current Python-to-R conversion functionality. Overview of changes:

Functions in rdata.conversion.to_r reorganized, a class ConverterFromPythonToR added to simplify keeping track of references in RData files.
Functionality for distinguishing R's NA float value from other NaN values added to rdata.missing.
Added more tests on reading and writing dataframes with various dtypes and a mix of NA and NaN values.
Added convert_altrep_to_range() (rdata.conversion._conversion) that enable converting compact intseq to range object (e.g. as a dataframe index).
Unparsing REF and ALTREP added.

Additional information

The functions in rdata.missing could be useful for users to handle NA values in the desired way, e.g., pd.arrays.FloatingArray(array) would set all NaNs (including R's NA value) as "missing", but pd.arrays.FloatingArray(array, is_na(array)) can be used to set only R's NA values as "missing".

Checklist before requesting a review

I have performed a self-review of my code
The code conforms to the style used in this package (checked with Ruff)
The code is fully documented and typed (type-checked with Mypy)
I have added thorough tests for the new/changed functionality

vnmabus · 2024-09-22T16:43:01Z

rdata/conversion/_conversion.py

@@ -820,6 +872,9 @@ def _convert_next(  # noqa: C901, PLR0912, PLR0915

            value = None

+        elif obj.info.type == parser.RObjectType.ALTREP:
+            value = convert_altrep_to_range(obj)


I feel that I am missing something here. Altreps should not be present here when expand_altrep is True (and that is the default). Also, I am not sure that it is a good idea to convert to different Python objects depending on the internal way some R object is stored (AFAIK, altreps are supposed to be an implementation detail).

This relates to the support for writing dataframes with range indices like pd.DataFrame(..., index=range(2, 5)).

The reason that this change is in this PR is that it allows testing R-to-Python-to-R roundtrip for such a dataframe (in R data.frame(..., row.names=2:4), this is test_dataframe_range_rownames.rds). Here row names is compact_intseq altrep, which would get converted to a regular array by default, which would then fail the roundtrip.

But, other than enabling that testing, this change wouldn't really belong to this PR, so it could be removed if you think that's better.

In general, I feel that it could be nicer if altreps would be expanded in the conversion stage instead of the parsing stage. The reason is that then the RData object would be a one-to-one representation of the file contents, which is not the case for altreps at the moment (unless expand_altrep=False). I have interpreted compact_intseq altrep to implement the same idea as Python's range object, so it would be useful if users would have an option to choose whether they want this altrep to be converted to a numpy array or a range object, which seems not possible (or easy?) if altreps are expanded already in parsing. This would be larger change though (beyond this PR). What do you think in general?

I think altreps are intended to be an internal representation detail. That is why I made the default to expand them in the parser: I assumed that, except for very specific situations, users would prefer to ignore their existence.

That said, IMHO the default conversion routines should be able to deal with altreps (in case that expand_altreps is False). I still think they should convert it to the same object as if they weren't altreps, for consistency.

For me ranges are a very different beast than NumPy arrays, and returning such different kind of objects for the "same" underlying R object would be confusing. Moreover, there is no equivalent to range for compact sequences of float.

If we want to convert an array back to an altrep for space savings we could probably try to detect if the array is a sequence (with np.diff, a subtraction, and np.all or something like that it would be easy, at least for integer dtype). However that would convert to altrep even arrays that were not stored as altreps in the original file, for whatever reason. So, the choice would be between no altreps at all when writing, or all possible ones.

For the above reasons I think it is preferrable not to consider altreps in the roundtrip comparison (that is, comparing both files with expand_altreps=True. Any thoughts on this?

I agree with your points on range and that it would be confusing to always convert intseq altrep to a range object as it doesn't behave similarly to numpy array (in contrast to behavior in R). So, I removed this altrep-to-range conversion from R-to-Python conversion.

In Python-to-R conversion side, I also removed the general conversion of range objects, but changed that logic to apply to pd.RangeIndex only, that is, pd.RangeIndex converts to altrep (if possible; step=1) and pd.Index converts to array. I added a few tests for this case as the roundtrip comparison can't reach this case. Do you think this is ok?

If we want to convert an array back to an altrep for space savings we could probably try to detect if the array is a sequence

I agree that could be done at some point if needed.

For the above reasons I think it is preferrable not to consider altreps in the roundtrip comparison (that is, comparing both files with expand_altreps=True. Any thoughts on this?

I agree. The R-Python-R roundtrip tests do file checking only for expand_altreps=False (which skips all files with altreps as altreps are then unhandled in R-to-Python conversion).

Do you think this is ok?

In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.

vnmabus · 2024-09-22T16:45:51Z

rdata/conversion/_conversion.py

@@ -430,7 +482,7 @@ def dataframe_constructor(
            and isinstance(row_names, np.ma.MaskedArray)
            and row_names.mask[0]
        )
-        else tuple(row_names)
+        else row_names


Why this change?

The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment).

I'd keep this change even though I removed the altrep-to-range conversion. The reason is that if a user would like to create a custom altrep_constructor_dict that maps compact_intseq to a range object, then that range object wouldn't be expanded here but it would be passed to dataframe as such. What do you think?

I would say that such mapping is an error. However, I do not have anything against this particular change, unless the tuple call was necessary for some reason.

rdata/conversion/to_r.py

vnmabus · 2024-09-22T17:36:51Z

rdata/conversion/to_r.py

+    Returns:
+        Numpy array.
+    """
+    if isinstance(pd_array, pd.arrays.StringArray):


So, is there some reason not to be able to use masked arrays whenever there are NA values, independently of the array type?

That could be done. I think it would clarify many things, but it would require changes in parsing side too.

The main issue here could be string arrays, where NA values are None at the moment, so should parsing of string arrays return a possibly masked array instead? Also in writing, if user has strings = ["hello", None], now this can be converted to valid "string" array like array = np.array(strings) (which is of object dtype, requiring special treatment of object arrays). With masked array user would need to do e.g. array = np.ma.array(data=["" if s is None else s for s in strings], mask=[s is None for s in strings]), which might be more cumbersome for users to work with. A benefit would be though that this array is of unicode dtype, which would simplify the numpy array handling.

Do you think it would be useful to start using masked arrays in all these cases? (I feel it could fit better to a different PR though due to changes in parsing?)

I think we should probably used the new string dtype in the future, when available. This already supports missing data using sentinel values.

Good point, I agree. So let's leave string arrays as they are for now.

How about float arrays with missing values? Would it be useful to convert them to masked arrays similarly to integer arrays? (I think that change could go to another PR though as it changes parsing behavior.)

I think that it makes sense to do it for float arrays (in a different PR).

rdata/conversion/to_r.py

vnmabus · 2024-09-22T18:13:41Z

rdata/missing.py

+        raw_dtype = f"V{array.dtype.itemsize}"
+        return array.view(raw_dtype) == np.array(na).view(raw_dtype)  # type: ignore [no-any-return]


I am not sure if I understand what is happening here.

I added a comment to the code (and changed void type to unsigned int). The reason is that as R_FLOAT_NA is a NaN, so it follows NaN-logic: R_FLOAT_NA == R_FLOAT_NA is False and R_FLOAT_NA != R_FLOAT_NA is True, but also R_FLOAT_NA == np.nan is False and R_FLOAT_NA != np.nan is True, although R_FLOAT_NA and np.nan are different. So, we need to do byte-by-byte comparison of values to distinguish R_FLOAT_NA from other NaN values.

vnmabus · 2024-09-22T18:14:35Z

rdata/missing.py

+        try:
+            return is_na(np.array(array, dtype=np.int32))
+        except OverflowError:
+            return is_na(np.array(array))


I also think this deserves some explanation.

I added a comment to the code. Basically, R seems not to support larger integers than 32-bit, so this attempts to convert Python's int (64 bit or larger) to such (or proceed with larger int and fail later).

Understood.

rdata/parser/_parser.py

rdata/tests/test_write.py

Co-authored-by: Carlos Ramos Carreño <[email protected]>

trossi

@vnmabus Thank you for the review. I have pushed the changes and left some comments/questions for discussion.

rdata/conversion/to_r.py

trossi · 2024-10-02T07:19:19Z

rdata/conversion/_conversion.py

@@ -820,6 +872,9 @@ def _convert_next(  # noqa: C901, PLR0912, PLR0915

            value = None

+        elif obj.info.type == parser.RObjectType.ALTREP:
+            value = convert_altrep_to_range(obj)


This relates to the support for writing dataframes with range indices like pd.DataFrame(..., index=range(2, 5)).

The reason that this change is in this PR is that it allows testing R-to-Python-to-R roundtrip for such a dataframe (in R data.frame(..., row.names=2:4), this is test_dataframe_range_rownames.rds). Here row names is compact_intseq altrep, which would get converted to a regular array by default, which would then fail the roundtrip.

But, other than enabling that testing, this change wouldn't really belong to this PR, so it could be removed if you think that's better.

In general, I feel that it could be nicer if altreps would be expanded in the conversion stage instead of the parsing stage. The reason is that then the RData object would be a one-to-one representation of the file contents, which is not the case for altreps at the moment (unless expand_altrep=False). I have interpreted compact_intseq altrep to implement the same idea as Python's range object, so it would be useful if users would have an option to choose whether they want this altrep to be converted to a numpy array or a range object, which seems not possible (or easy?) if altreps are expanded already in parsing. This would be larger change though (beyond this PR). What do you think in general?

trossi · 2024-10-02T07:20:20Z

rdata/conversion/_conversion.py

@@ -430,7 +482,7 @@ def dataframe_constructor(
            and isinstance(row_names, np.ma.MaskedArray)
            and row_names.mask[0]
        )
-        else tuple(row_names)
+        else row_names


The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment).

trossi · 2024-10-02T07:27:04Z

rdata/conversion/to_r.py

+    Returns:
+        Numpy array.
+    """
+    if isinstance(pd_array, pd.arrays.StringArray):


That could be done. I think it would clarify many things, but it would require changes in parsing side too.

The main issue here could be string arrays, where NA values are None at the moment, so should parsing of string arrays return a possibly masked array instead? Also in writing, if user has strings = ["hello", None], now this can be converted to valid "string" array like array = np.array(strings) (which is of object dtype, requiring special treatment of object arrays). With masked array user would need to do e.g. array = np.ma.array(data=["" if s is None else s for s in strings], mask=[s is None for s in strings]), which might be more cumbersome for users to work with. A benefit would be though that this array is of unicode dtype, which would simplify the numpy array handling.

Do you think it would be useful to start using masked arrays in all these cases? (I feel it could fit better to a different PR though due to changes in parsing?)

rdata/conversion/to_r.py

trossi · 2024-10-02T08:07:28Z

rdata/missing.py

+        raw_dtype = f"V{array.dtype.itemsize}"
+        return array.view(raw_dtype) == np.array(na).view(raw_dtype)  # type: ignore [no-any-return]


I added a comment to the code (and changed void type to unsigned int). The reason is that as R_FLOAT_NA is a NaN, so it follows NaN-logic: R_FLOAT_NA == R_FLOAT_NA is False and R_FLOAT_NA != R_FLOAT_NA is True, but also R_FLOAT_NA == np.nan is False and R_FLOAT_NA != np.nan is True, although R_FLOAT_NA and np.nan are different. So, we need to do byte-by-byte comparison of values to distinguish R_FLOAT_NA from other NaN values.

trossi · 2024-10-02T08:09:59Z

rdata/missing.py

+        try:
+            return is_na(np.array(array, dtype=np.int32))
+        except OverflowError:
+            return is_na(np.array(array))


I added a comment to the code. Basically, R seems not to support larger integers than 32-bit, so this attempts to convert Python's int (64 bit or larger) to such (or proceed with larger int and fail later).

rdata/parser/_parser.py

trossi · 2024-10-02T08:22:36Z

rdata/conversion/to_r.py

+            if self.df_attr_order is not None:
+                attributes = {k: attributes[k] for k in self.df_attr_order}
+
+        else:


Good idea! I restructured these conversion to user-definable constructor functions similarly to R-to-Python conversion. I separated also some other conversion logic to functions.

For pd.Categorical this was simple, but for pd.DataFrame a bit more complex as it may contain pd.Categorical or some other more complex objects. For that reason, the constructor functions take also parameter convert_to_r_object that points back to ConverterFromPythonToR.convert_to_r_object (which then in turn can keep track of references etc).

Do you have a suggestion for naming convention? Now there is dataframe_constructor for both R-to-Python and Python-to-R conversions.

vnmabus · 2024-10-29T11:34:20Z

rdata/conversion/_conversion.py

@@ -820,6 +872,9 @@ def _convert_next(  # noqa: C901, PLR0912, PLR0915

            value = None

+        elif obj.info.type == parser.RObjectType.ALTREP:
+            value = convert_altrep_to_range(obj)


Do you think this is ok?

In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.

vnmabus · 2024-10-29T11:37:03Z

rdata/conversion/_conversion.py

@@ -430,7 +482,7 @@ def dataframe_constructor(
            and isinstance(row_names, np.ma.MaskedArray)
            and row_names.mask[0]
        )
-        else tuple(row_names)
+        else row_names


I would say that such mapping is an error. However, I do not have anything against this particular change, unless the tuple call was necessary for some reason.

vnmabus · 2024-10-29T11:37:50Z

rdata/conversion/to_r.py

+    Returns:
+        Numpy array.
+    """
+    if isinstance(pd_array, pd.arrays.StringArray):


I think that it makes sense to do it for float arrays (in a different PR).

vnmabus · 2024-11-02T11:47:00Z

rdata/_write.py

@@ -40,6 +42,7 @@ def write_rds(
        compression: Compression.
        encoding: Encoding to be used for strings within data.
        format_version: File format version.
+        constructor_dict: Dictionary mapping Python types to R classes.


That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).

vnmabus · 2024-11-02T11:50:40Z

rdata/conversion/__init__.py

@@ -1,4 +1,5 @@
 """Utilities for converting R objects to Python ones."""


This docstring should change.

vnmabus · 2024-11-02T12:35:36Z

rdata/conversion/to_r.py


+    ConstructorReturnValue = tuple[RObjectType, Any, dict[str, Any] | None]
+    ConstructorFunction1 = Callable[[Any], ConstructorReturnValue]
+    ConstructorFunction2 = Callable[[Any, Converter], ConstructorReturnValue]


For writing, as this is a new API, I think we can force the signature to always include the Converter. For reading we can't do that, as there are existing constructors without that parameter, and thus we would have to add the new signature and deprecate the old one.

vnmabus · 2024-11-02T12:40:41Z

rdata/conversion/to_r.py


+    ConstructorReturnValue = tuple[RObjectType, Any, dict[str, Any] | None]


I am not sure if it wouldn't be better (less ambiguous and more flexible) to return the constructed R objects, even if that breaks symmetry with the reading constructors.

vnmabus · 2024-11-02T12:56:58Z

rdata/conversion/to_r.py

+            if self.df_attr_order is not None:
+                attributes = {k: attributes[k] for k in self.df_attr_order}
+
+        else:


I would add that as mandatory (at least for writing, as there is not existing code that requires backwards compatibility).

vnmabus · 2024-11-02T12:57:23Z

rdata/missing.py

+        raw_dtype = f"V{array.dtype.itemsize}"
+        return array.view(raw_dtype) == np.array(na).view(raw_dtype)  # type: ignore [no-any-return]


vnmabus · 2024-11-02T12:57:33Z

rdata/missing.py

+        try:
+            return is_na(np.array(array, dtype=np.int32))
+        except OverflowError:
+            return is_na(np.array(array))


Understood.

trossi added 30 commits September 5, 2024 17:13

Add reference type to unparser

3a6664d

Add draft dataframe conversion

153f803

Add helper function for creating unicode arrays

4557559

Add more pd.Series types

6eeb992

Fix the order of symbol references

ffddf74

Add a converter class for Python-to-R conversion

eb82ff6

Fix masked values in masked array

1868d8a

Compare first string representations

8d9cb55

Fix conversion of dataframe columns

398d1e9

Add support for dataframe with string index

9cdd37c

Add assertions for strings

5084d2d

Add conversion for rangeindex and range

af0f6fe

Add conversion of integer index

1c71a86

Add unparsing altreps

8fa951e

Move build_r_data function under converter class

b205d8d

Convert range to array for old format

963a9bc

Fix ruff

61a2ea2

Set object flag explicitly

937908b

Fix mypy

8eda454

Add tests for different dataframe index types

efbb09d

Test converting expanded altrep

32a2cc6

Add only non-nil attributes to expanded altrep

1f4e8d8

Enable general rangeindex in dataframe

237bc22

Test conversion of altreps

6859b8c

Change attribute order to match test files

5ac49d0

Add comment about reordering attributes

6ad1408

Fix ruff and mypy

1c458ba

Add test for dataframe with different dtypes

92429ca

Add conversion of boolean pd arrays

5cf678d

Add test for pandas dtypes

f379fc9

vnmabus requested changes Sep 22, 2024

View reviewed changes

trossi and others added 13 commits October 2, 2024 09:57

Apply suggestions from code review

c3fe17c

Co-authored-by: Carlos Ramos Carreño <[email protected]>

Fix indentation

74a11ba

Fix mypy

5449199

Comment code

74bf9d8

Raise NotImplementedError for untested code

da9c940

Separate pandas types to constructor functions

3acc180

Separate string builders to functions

4557d99

Remove unused variable

3e330a3

Convert all built-in types via numpy type

0d45965

Raise error for non-string dictionary keys

de5a408

Raise error for non-string rda variable names

5d8187c

Use shorthand function

59f4269

Add constructor_dict to helper functions

540a59a

trossi commented Oct 2, 2024

View reviewed changes

trossi mentioned this pull request Oct 2, 2024

Fix expanding altreps with attributes #46

Merged

4 tasks

trossi added 13 commits October 25, 2024 14:36

Merge branch 'develop' into dataframe-writer

aa7239d

Recreate test files in common attribute order

df8b391

Skip altreps with attributes in test

1a00c1d

Fix ruff

ff6b6a9

Filter expected warnings

daf1e3a

Pass converter object to constructor functions

943e697

Allow constructor functions without converter

8a269ae

Convert only pandas rangeindex to altrep

9718161

Use more robust indexing

a7c7066

Add tests for rangeindex

5a430aa

Remove conversion of altrep to range

87f4c65

Clarify skip message

25a14af

Fix ruff formatting

7984089

vnmabus requested changes Nov 2, 2024

View reviewed changes

		raw_dtype = f"V{array.dtype.itemsize}"
		return array.view(raw_dtype) == np.array(na).view(raw_dtype) # type: ignore [no-any-return]

		@@ -1,4 +1,5 @@
		"""Utilities for converting R objects to Python ones."""


		ConstructorReturnValue = tuple[RObjectType, Any, dict[str, Any] \| None]

Add writing functionality for dataframes #45

Are you sure you want to change the base?

Add writing functionality for dataframes #45

Conversation

trossi commented Sep 16, 2024

References to issues or other PRs

Describe the proposed changes

Additional information

Checklist before requesting a review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trossi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment