Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add writing functionality for dataframes #45

Open
wants to merge 91 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
3a6664d
Add reference type to unparser
trossi Sep 5, 2024
153f803
Add draft dataframe conversion
trossi Sep 5, 2024
4557559
Add helper function for creating unicode arrays
trossi Sep 5, 2024
6eeb992
Add more pd.Series types
trossi Sep 5, 2024
ffddf74
Fix the order of symbol references
trossi Sep 5, 2024
eb82ff6
Add a converter class for Python-to-R conversion
trossi Sep 10, 2024
1868d8a
Fix masked values in masked array
trossi Sep 10, 2024
8d9cb55
Compare first string representations
trossi Sep 10, 2024
398d1e9
Fix conversion of dataframe columns
trossi Sep 10, 2024
9cdd37c
Add support for dataframe with string index
trossi Sep 10, 2024
5084d2d
Add assertions for strings
trossi Sep 12, 2024
af0f6fe
Add conversion for rangeindex and range
trossi Sep 12, 2024
1c71a86
Add conversion of integer index
trossi Sep 12, 2024
8fa951e
Add unparsing altreps
trossi Sep 12, 2024
b205d8d
Move build_r_data function under converter class
trossi Sep 12, 2024
963a9bc
Convert range to array for old format
trossi Sep 12, 2024
61a2ea2
Fix ruff
trossi Sep 12, 2024
937908b
Set object flag explicitly
trossi Sep 12, 2024
8eda454
Fix mypy
trossi Sep 12, 2024
efbb09d
Add tests for different dataframe index types
trossi Sep 12, 2024
32a2cc6
Test converting expanded altrep
trossi Sep 12, 2024
1f4e8d8
Add only non-nil attributes to expanded altrep
trossi Sep 12, 2024
237bc22
Enable general rangeindex in dataframe
trossi Sep 12, 2024
6859b8c
Test conversion of altreps
trossi Sep 12, 2024
5ac49d0
Change attribute order to match test files
trossi Sep 12, 2024
6ad1408
Add comment about reordering attributes
trossi Sep 12, 2024
1c458ba
Fix ruff and mypy
trossi Sep 12, 2024
92429ca
Add test for dataframe with different dtypes
trossi Sep 12, 2024
5cf678d
Add conversion of boolean pd arrays
trossi Sep 12, 2024
f379fc9
Add test for pandas dtypes
trossi Sep 12, 2024
ddabf65
Add missing conversions
trossi Sep 12, 2024
9dd2559
Set dataframe attribute order file-by-file
trossi Sep 12, 2024
8652271
Add test for dataframe with NAs
trossi Sep 12, 2024
993e2ed
Add dataframe column transformation for more types
trossi Sep 12, 2024
dc1950d
Fix NA values in dataframes
trossi Sep 12, 2024
38c80d7
Fix dataframe attribute order
trossi Sep 12, 2024
b622c9c
Add NA floats to ascii parser
trossi Sep 12, 2024
c2728e2
Add NA floats to ascii unparser
trossi Sep 12, 2024
e27217c
Fix ruff
trossi Sep 12, 2024
9da1a42
Define NA checker function close to the definition
trossi Sep 13, 2024
3dc0169
Fix mypy
trossi Sep 13, 2024
d4049ba
Simplify reference lists
trossi Sep 13, 2024
cb3c487
Simplify creation of R lists
trossi Sep 13, 2024
76bdc83
Rename build_r_sym() to convert_to_r_sym()
trossi Sep 13, 2024
87194d4
Simplify creation of RData object
trossi Sep 13, 2024
d9c3be5
Add helper functions for conversion
trossi Sep 13, 2024
22fe43d
Clarify NA values
trossi Sep 16, 2024
3bd285a
Filter expected warnings
trossi Sep 16, 2024
80a0c9b
Add test for dataframe with NA and NaN floats
trossi Sep 16, 2024
b73d4bd
Do not use pandas floating array
trossi Sep 16, 2024
e88419e
Remove unused R_INT_MIN
trossi Sep 16, 2024
6bb5d5b
Change dataframe default attribute order
trossi Sep 16, 2024
fd813e2
Move NA values and related functions to a new file
trossi Sep 16, 2024
c63dfe7
Add helper functions for handling NA values
trossi Sep 16, 2024
b8b6948
Add comment on setting mask
trossi Sep 16, 2024
e773101
Add tests for missing value functionality
trossi Sep 16, 2024
828f9da
Include checking int and float values
trossi Sep 16, 2024
0fe903f
Include ascii format in testing too large ints
trossi Sep 16, 2024
2755b3c
Fix datatype conversions in ascii unparser
trossi Sep 16, 2024
6040bae
Include testing negative end
trossi Sep 16, 2024
7c60566
Speed up range check
trossi Sep 16, 2024
d450f73
Move duplicated code to a function
trossi Sep 16, 2024
4d3e38f
Speed up range check
trossi Sep 16, 2024
ac948de
Fix docstring
trossi Sep 16, 2024
957887c
Simplify the definition of the NA value
trossi Sep 19, 2024
c3fe17c
Apply suggestions from code review
trossi Oct 2, 2024
74a11ba
Fix indentation
trossi Oct 2, 2024
5449199
Fix mypy
trossi Oct 2, 2024
74bf9d8
Comment code
trossi Oct 2, 2024
da9c940
Raise NotImplementedError for untested code
trossi Oct 2, 2024
3acc180
Separate pandas types to constructor functions
trossi Oct 2, 2024
4557d99
Separate string builders to functions
trossi Oct 2, 2024
3e330a3
Remove unused variable
trossi Oct 2, 2024
0d45965
Convert all built-in types via numpy type
trossi Oct 2, 2024
de5a408
Raise error for non-string dictionary keys
trossi Oct 2, 2024
5d8187c
Raise error for non-string rda variable names
trossi Oct 2, 2024
59f4269
Use shorthand function
trossi Oct 2, 2024
540a59a
Add constructor_dict to helper functions
trossi Oct 2, 2024
aa7239d
Merge branch 'develop' into dataframe-writer
trossi Oct 25, 2024
df8b391
Recreate test files in common attribute order
trossi Oct 25, 2024
1a00c1d
Skip altreps with attributes in test
trossi Oct 25, 2024
ff6b6a9
Fix ruff
trossi Oct 25, 2024
daf1e3a
Filter expected warnings
trossi Oct 25, 2024
943e697
Pass converter object to constructor functions
trossi Oct 25, 2024
8a269ae
Allow constructor functions without converter
trossi Oct 25, 2024
9718161
Convert only pandas rangeindex to altrep
trossi Oct 28, 2024
a7c7066
Use more robust indexing
trossi Oct 28, 2024
5a430aa
Add tests for rangeindex
trossi Oct 28, 2024
87f4c65
Remove conversion of altrep to range
trossi Oct 28, 2024
25a14af
Clarify skip message
trossi Oct 28, 2024
7984089
Fix ruff formatting
trossi Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 8 additions & 19 deletions rdata/_write.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from typing import TYPE_CHECKING

from .conversion import build_r_data, convert_to_r_object, convert_to_r_object_for_rda
from .conversion import convert_python_to_r_data
from .conversion.to_r import DEFAULT_FORMAT_VERSION
from .unparser import unparse_file

Expand All @@ -27,10 +27,7 @@ def write_rds(
"""
Write an RDS file.

This is a convenience function that wraps
:func:`rdata.conversion.convert_to_r_object`,
:func:`rdata.conversion.build_r_data`,
and :func:`rdata.unparser.unparse_file`,
This is a convenience function that wraps conversion and unparsing
as it is the common use case.

Args:
Expand All @@ -52,15 +49,12 @@ def write_rds(
>>> data = ["hello", 1, 2.2, 3.3+4.4j]
>>> rdata.write_rds("test.rds", data)
"""
r_object = convert_to_r_object(
r_data = convert_python_to_r_data(
data,
encoding=encoding,
)
r_data = build_r_data(
r_object,
encoding=encoding,
format_version=format_version,
)

unparse_file(
path,
r_data,
Expand All @@ -82,10 +76,7 @@ def write_rda(
"""
Write an RDA or RDATA file.

This is a convenience function that wraps
:func:`rdata.conversion.convert_to_r_object_for_rda`,
:func:`rdata.conversion.build_r_data`,
and :func:`rdata.unparser.unparse_file`,
This is a convenience function that wraps conversion and unparsing
as it is the common use case.

Args:
Expand All @@ -107,15 +98,13 @@ def write_rda(
>>> data = {"name": "hello", "values": [1, 2.2, 3.3+4.4j]}
>>> rdata.write_rda("test.rda", data)
"""
r_object = convert_to_r_object_for_rda(
r_data = convert_python_to_r_data(
data,
encoding=encoding,
)
r_data = build_r_data(
r_object,
encoding=encoding,
format_version=format_version,
file_type="rda",
)

unparse_file(
path,
r_data,
Expand Down
6 changes: 3 additions & 3 deletions rdata/conversion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
ts_constructor as ts_constructor,
)
from .to_r import (
build_r_data as build_r_data,
convert_to_r_object as convert_to_r_object,
convert_to_r_object_for_rda as convert_to_r_object_for_rda,
ConverterFromPythonToR as ConverterFromPythonToR,
convert_python_to_r_data as convert_python_to_r_data,
convert_python_to_r_object as convert_python_to_r_object,
)
71 changes: 63 additions & 8 deletions rdata/conversion/_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
import xarray
from typing_extensions import override

from rdata.parser._parser import get_altrep_name

from .. import parser

ConversionFunction = Callable[[Union[parser.RData, parser.RObject]], Any]
Expand Down Expand Up @@ -394,20 +396,70 @@ def convert_array(
return value # type: ignore [no-any-return]


R_INT_MIN = -2**31
def convert_altrep_to_range(
r_altrep: parser.RObject,
) -> range:
"""
Convert a R altrep to range object.

Args:
r_altrep: R altrep object

Returns:
Range object.
"""
if r_altrep.info.type != parser.RObjectType.ALTREP:
msg = "Must receive an altrep object"
raise TypeError(msg)

info, state, attr = r_altrep.value
assert attr.info.type == parser.RObjectType.NILVALUE

altrep_name = get_altrep_name(info)

if altrep_name != b"compact_intseq":
msg = "Only compact integer sequences can be converted to range"
raise NotImplementedError(msg)

n = int(state.value[0])
start = int(state.value[1])
step = int(state.value[2])
stop = start + (n - 1) * step
return range(start, stop + 1, step)


def _dataframe_column_transform(source: Any) -> Any: # noqa: ANN401

if isinstance(source, np.ndarray):
dtype: Any
if np.issubdtype(source.dtype, np.integer):
return pd.Series(source, dtype=pd.Int32Dtype()).array

if np.issubdtype(source.dtype, np.bool_):
return pd.Series(source, dtype=pd.BooleanDtype()).array
dtype = pd.Int32Dtype()
elif np.issubdtype(source.dtype, np.floating):
# We return the numpy array here, which keeps
# R_FLOAT_NA, np.nan, and other NaNs as they were originally in the file.
# Users can then decide if they prefer to interpret
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can they decide?

# only R_FLOAT_NA or all NaNs as "missing".
return source
# This would create an array with all NaNs as "missing":
# dtype = pd.Float64Dtype() # noqa: ERA001
# This would create an array with only R_FLOAT_NA as "missing":
# from rdata.missing import is_na # noqa: ERA001
# return pd.arrays.FloatingArray(source, is_na(source)) # noqa: ERA001
elif np.issubdtype(source.dtype, np.complexfloating):
# There seems to be no pandas type for complex array
return source
elif np.issubdtype(source.dtype, np.bool_):
dtype = pd.BooleanDtype()
elif np.issubdtype(source.dtype, np.str_):
dtype = pd.StringDtype()
elif np.issubdtype(source.dtype, np.object_):
for value in source:
assert isinstance(value, str) or value is None
dtype = pd.StringDtype()
else:
return source

if np.issubdtype(source.dtype, np.str_):
return pd.Series(source, dtype=pd.StringDtype()).array
return pd.Series(source, dtype=dtype).array

return source

Expand All @@ -430,7 +482,7 @@ def dataframe_constructor(
and isinstance(row_names, np.ma.MaskedArray)
and row_names.mask[0]
)
else tuple(row_names)
else row_names
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this change even though I removed the altrep-to-range conversion. The reason is that if a user would like to create a custom altrep_constructor_dict that maps compact_intseq to a range object, then that range object wouldn't be expanded here but it would be passed to dataframe as such. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that such mapping is an error. However, I do not have anything against this particular change, unless the tuple call was necessary for some reason.

)

return pd.DataFrame(obj, columns=obj, index=index)
Expand Down Expand Up @@ -820,6 +872,9 @@ def _convert_next( # noqa: C901, PLR0912, PLR0915

value = None

elif obj.info.type == parser.RObjectType.ALTREP:
value = convert_altrep_to_range(obj)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that I am missing something here. Altreps should not be present here when expand_altrep is True (and that is the default). Also, I am not sure that it is a good idea to convert to different Python objects depending on the internal way some R object is stored (AFAIK, altreps are supposed to be an implementation detail).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to the support for writing dataframes with range indices like pd.DataFrame(..., index=range(2, 5)).

The reason that this change is in this PR is that it allows testing R-to-Python-to-R roundtrip for such a dataframe (in R data.frame(..., row.names=2:4), this is test_dataframe_range_rownames.rds). Here row names is compact_intseq altrep, which would get converted to a regular array by default, which would then fail the roundtrip.

But, other than enabling that testing, this change wouldn't really belong to this PR, so it could be removed if you think that's better.

In general, I feel that it could be nicer if altreps would be expanded in the conversion stage instead of the parsing stage. The reason is that then the RData object would be a one-to-one representation of the file contents, which is not the case for altreps at the moment (unless expand_altrep=False). I have interpreted compact_intseq altrep to implement the same idea as Python's range object, so it would be useful if users would have an option to choose whether they want this altrep to be converted to a numpy array or a range object, which seems not possible (or easy?) if altreps are expanded already in parsing. This would be larger change though (beyond this PR). What do you think in general?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think altreps are intended to be an internal representation detail. That is why I made the default to expand them in the parser: I assumed that, except for very specific situations, users would prefer to ignore their existence.

That said, IMHO the default conversion routines should be able to deal with altreps (in case that expand_altreps is False). I still think they should convert it to the same object as if they weren't altreps, for consistency.

For me ranges are a very different beast than NumPy arrays, and returning such different kind of objects for the "same" underlying R object would be confusing. Moreover, there is no equivalent to range for compact sequences of float.

If we want to convert an array back to an altrep for space savings we could probably try to detect if the array is a sequence (with np.diff, a subtraction, and np.all or something like that it would be easy, at least for integer dtype). However that would convert to altrep even arrays that were not stored as altreps in the original file, for whatever reason. So, the choice would be between no altreps at all when writing, or all possible ones.

For the above reasons I think it is preferrable not to consider altreps in the roundtrip comparison (that is, comparing both files with expand_altreps=True. Any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your points on range and that it would be confusing to always convert intseq altrep to a range object as it doesn't behave similarly to numpy array (in contrast to behavior in R). So, I removed this altrep-to-range conversion from R-to-Python conversion.

In Python-to-R conversion side, I also removed the general conversion of range objects, but changed that logic to apply to pd.RangeIndex only, that is, pd.RangeIndex converts to altrep (if possible; step=1) and pd.Index converts to array. I added a few tests for this case as the roundtrip comparison can't reach this case. Do you think this is ok?

If we want to convert an array back to an altrep for space savings we could probably try to detect if the array is a sequence

I agree that could be done at some point if needed.

For the above reasons I think it is preferrable not to consider altreps in the roundtrip comparison (that is, comparing both files with expand_altreps=True. Any thoughts on this?

I agree. The R-Python-R roundtrip tests do file checking only for expand_altreps=False (which skips all files with altreps as altreps are then unhandled in R-to-Python conversion).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is ok?

In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.


else:
msg = f"Type {obj.info.type} not implemented"
raise NotImplementedError(msg)
Expand Down
Loading
Loading