Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add writing functionality for dataframes #45

Open
wants to merge 91 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
3a6664d
Add reference type to unparser
trossi Sep 5, 2024
153f803
Add draft dataframe conversion
trossi Sep 5, 2024
4557559
Add helper function for creating unicode arrays
trossi Sep 5, 2024
6eeb992
Add more pd.Series types
trossi Sep 5, 2024
ffddf74
Fix the order of symbol references
trossi Sep 5, 2024
eb82ff6
Add a converter class for Python-to-R conversion
trossi Sep 10, 2024
1868d8a
Fix masked values in masked array
trossi Sep 10, 2024
8d9cb55
Compare first string representations
trossi Sep 10, 2024
398d1e9
Fix conversion of dataframe columns
trossi Sep 10, 2024
9cdd37c
Add support for dataframe with string index
trossi Sep 10, 2024
5084d2d
Add assertions for strings
trossi Sep 12, 2024
af0f6fe
Add conversion for rangeindex and range
trossi Sep 12, 2024
1c71a86
Add conversion of integer index
trossi Sep 12, 2024
8fa951e
Add unparsing altreps
trossi Sep 12, 2024
b205d8d
Move build_r_data function under converter class
trossi Sep 12, 2024
963a9bc
Convert range to array for old format
trossi Sep 12, 2024
61a2ea2
Fix ruff
trossi Sep 12, 2024
937908b
Set object flag explicitly
trossi Sep 12, 2024
8eda454
Fix mypy
trossi Sep 12, 2024
efbb09d
Add tests for different dataframe index types
trossi Sep 12, 2024
32a2cc6
Test converting expanded altrep
trossi Sep 12, 2024
1f4e8d8
Add only non-nil attributes to expanded altrep
trossi Sep 12, 2024
237bc22
Enable general rangeindex in dataframe
trossi Sep 12, 2024
6859b8c
Test conversion of altreps
trossi Sep 12, 2024
5ac49d0
Change attribute order to match test files
trossi Sep 12, 2024
6ad1408
Add comment about reordering attributes
trossi Sep 12, 2024
1c458ba
Fix ruff and mypy
trossi Sep 12, 2024
92429ca
Add test for dataframe with different dtypes
trossi Sep 12, 2024
5cf678d
Add conversion of boolean pd arrays
trossi Sep 12, 2024
f379fc9
Add test for pandas dtypes
trossi Sep 12, 2024
ddabf65
Add missing conversions
trossi Sep 12, 2024
9dd2559
Set dataframe attribute order file-by-file
trossi Sep 12, 2024
8652271
Add test for dataframe with NAs
trossi Sep 12, 2024
993e2ed
Add dataframe column transformation for more types
trossi Sep 12, 2024
dc1950d
Fix NA values in dataframes
trossi Sep 12, 2024
38c80d7
Fix dataframe attribute order
trossi Sep 12, 2024
b622c9c
Add NA floats to ascii parser
trossi Sep 12, 2024
c2728e2
Add NA floats to ascii unparser
trossi Sep 12, 2024
e27217c
Fix ruff
trossi Sep 12, 2024
9da1a42
Define NA checker function close to the definition
trossi Sep 13, 2024
3dc0169
Fix mypy
trossi Sep 13, 2024
d4049ba
Simplify reference lists
trossi Sep 13, 2024
cb3c487
Simplify creation of R lists
trossi Sep 13, 2024
76bdc83
Rename build_r_sym() to convert_to_r_sym()
trossi Sep 13, 2024
87194d4
Simplify creation of RData object
trossi Sep 13, 2024
d9c3be5
Add helper functions for conversion
trossi Sep 13, 2024
22fe43d
Clarify NA values
trossi Sep 16, 2024
3bd285a
Filter expected warnings
trossi Sep 16, 2024
80a0c9b
Add test for dataframe with NA and NaN floats
trossi Sep 16, 2024
b73d4bd
Do not use pandas floating array
trossi Sep 16, 2024
e88419e
Remove unused R_INT_MIN
trossi Sep 16, 2024
6bb5d5b
Change dataframe default attribute order
trossi Sep 16, 2024
fd813e2
Move NA values and related functions to a new file
trossi Sep 16, 2024
c63dfe7
Add helper functions for handling NA values
trossi Sep 16, 2024
b8b6948
Add comment on setting mask
trossi Sep 16, 2024
e773101
Add tests for missing value functionality
trossi Sep 16, 2024
828f9da
Include checking int and float values
trossi Sep 16, 2024
0fe903f
Include ascii format in testing too large ints
trossi Sep 16, 2024
2755b3c
Fix datatype conversions in ascii unparser
trossi Sep 16, 2024
6040bae
Include testing negative end
trossi Sep 16, 2024
7c60566
Speed up range check
trossi Sep 16, 2024
d450f73
Move duplicated code to a function
trossi Sep 16, 2024
4d3e38f
Speed up range check
trossi Sep 16, 2024
ac948de
Fix docstring
trossi Sep 16, 2024
957887c
Simplify the definition of the NA value
trossi Sep 19, 2024
c3fe17c
Apply suggestions from code review
trossi Oct 2, 2024
74a11ba
Fix indentation
trossi Oct 2, 2024
5449199
Fix mypy
trossi Oct 2, 2024
74bf9d8
Comment code
trossi Oct 2, 2024
da9c940
Raise NotImplementedError for untested code
trossi Oct 2, 2024
3acc180
Separate pandas types to constructor functions
trossi Oct 2, 2024
4557d99
Separate string builders to functions
trossi Oct 2, 2024
3e330a3
Remove unused variable
trossi Oct 2, 2024
0d45965
Convert all built-in types via numpy type
trossi Oct 2, 2024
de5a408
Raise error for non-string dictionary keys
trossi Oct 2, 2024
5d8187c
Raise error for non-string rda variable names
trossi Oct 2, 2024
59f4269
Use shorthand function
trossi Oct 2, 2024
540a59a
Add constructor_dict to helper functions
trossi Oct 2, 2024
aa7239d
Merge branch 'develop' into dataframe-writer
trossi Oct 25, 2024
df8b391
Recreate test files in common attribute order
trossi Oct 25, 2024
1a00c1d
Skip altreps with attributes in test
trossi Oct 25, 2024
ff6b6a9
Fix ruff
trossi Oct 25, 2024
daf1e3a
Filter expected warnings
trossi Oct 25, 2024
943e697
Pass converter object to constructor functions
trossi Oct 25, 2024
8a269ae
Allow constructor functions without converter
trossi Oct 25, 2024
9718161
Convert only pandas rangeindex to altrep
trossi Oct 28, 2024
a7c7066
Use more robust indexing
trossi Oct 28, 2024
5a430aa
Add tests for rangeindex
trossi Oct 28, 2024
87f4c65
Remove conversion of altrep to range
trossi Oct 28, 2024
25a14af
Clarify skip message
trossi Oct 28, 2024
7984089
Fix ruff formatting
trossi Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 20 additions & 21 deletions rdata/_write.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,21 @@
"""Functions to perform conversion and unparsing in one step."""

from __future__ import annotations

from typing import TYPE_CHECKING

from .conversion import build_r_data, convert_to_r_object, convert_to_r_object_for_rda
from .conversion.to_r import DEFAULT_FORMAT_VERSION
from .conversion import (
DEFAULT_CONSTRUCTOR_DICT,
DEFAULT_FORMAT_VERSION,
convert_python_to_r_data,
)
from .unparser import unparse_file

if TYPE_CHECKING:
import os
from typing import Any

from .conversion.to_r import Encoding
from .conversion.to_r import ConstructorDict, Encoding
from .unparser import Compression, FileFormat


Expand All @@ -23,14 +27,12 @@ def write_rds(
compression: Compression = "gzip",
encoding: Encoding = "utf-8",
format_version: int = DEFAULT_FORMAT_VERSION,
constructor_dict: ConstructorDict = DEFAULT_CONSTRUCTOR_DICT,
) -> None:
"""
Write an RDS file.

This is a convenience function that wraps
:func:`rdata.conversion.convert_to_r_object`,
:func:`rdata.conversion.build_r_data`,
and :func:`rdata.unparser.unparse_file`,
This is a convenience function that wraps conversion and unparsing
as it is the common use case.

Args:
Expand All @@ -40,6 +42,7 @@ def write_rds(
compression: Compression.
encoding: Encoding to be used for strings within data.
format_version: File format version.
constructor_dict: Dictionary mapping Python types to R classes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).


See Also:
:func:`write_rda`: Similar function that writes an RDA or RDATA file.
Expand All @@ -52,15 +55,13 @@ def write_rds(
>>> data = ["hello", 1, 2.2, 3.3+4.4j]
>>> rdata.write_rds("test.rds", data)
"""
r_object = convert_to_r_object(
r_data = convert_python_to_r_data(
data,
encoding=encoding,
)
r_data = build_r_data(
r_object,
encoding=encoding,
format_version=format_version,
constructor_dict=constructor_dict,
)

unparse_file(
path,
r_data,
Expand All @@ -78,14 +79,12 @@ def write_rda(
compression: Compression = "gzip",
encoding: Encoding = "utf-8",
format_version: int = DEFAULT_FORMAT_VERSION,
constructor_dict: ConstructorDict = DEFAULT_CONSTRUCTOR_DICT,
) -> None:
"""
Write an RDA or RDATA file.

This is a convenience function that wraps
:func:`rdata.conversion.convert_to_r_object_for_rda`,
:func:`rdata.conversion.build_r_data`,
and :func:`rdata.unparser.unparse_file`,
This is a convenience function that wraps conversion and unparsing
as it is the common use case.

Args:
Expand All @@ -95,6 +94,7 @@ def write_rda(
compression: Compression.
encoding: Encoding to be used for strings within data.
format_version: File format version.
constructor_dict: Dictionary mapping Python types to R classes.

See Also:
:func:`write_rds`: Similar function that writes an RDS file.
Expand All @@ -107,15 +107,14 @@ def write_rda(
>>> data = {"name": "hello", "values": [1, 2.2, 3.3+4.4j]}
>>> rdata.write_rda("test.rda", data)
"""
r_object = convert_to_r_object_for_rda(
r_data = convert_python_to_r_data(
data,
encoding=encoding,
)
r_data = build_r_data(
r_object,
encoding=encoding,
format_version=format_version,
constructor_dict=constructor_dict,
file_type="rda",
)

unparse_file(
path,
r_data,
Expand Down
9 changes: 6 additions & 3 deletions rdata/conversion/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Utilities for converting R objects to Python ones."""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring should change.


from ._conversion import (
DEFAULT_CLASS_MAP as DEFAULT_CLASS_MAP,
Converter as Converter,
Expand All @@ -25,7 +26,9 @@
ts_constructor as ts_constructor,
)
from .to_r import (
build_r_data as build_r_data,
convert_to_r_object as convert_to_r_object,
convert_to_r_object_for_rda as convert_to_r_object_for_rda,
DEFAULT_CONSTRUCTOR_DICT as DEFAULT_CONSTRUCTOR_DICT,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this should be exposed here (at least with that name) as it can be easily confused with DEFAULT_CLASS_MAP. I think it would be better for the users to import the name directly from to_r/to_python submodules, and share the same nomenclature.

DEFAULT_FORMAT_VERSION as DEFAULT_FORMAT_VERSION,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is unclear: default for what? If it is the default version using for writing, maybe it should be only exposed in to_r (or maybe in the parser?).

ConverterFromPythonToR as ConverterFromPythonToR,
convert_python_to_r_data as convert_python_to_r_data,
convert_python_to_r_object as convert_python_to_r_object,
)
38 changes: 28 additions & 10 deletions rdata/conversion/_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,20 +394,38 @@ def convert_array(
return value # type: ignore [no-any-return]


R_INT_MIN = -2**31


def _dataframe_column_transform(source: Any) -> Any: # noqa: ANN401

if isinstance(source, np.ndarray):
dtype: Any
if np.issubdtype(source.dtype, np.integer):
return pd.Series(source, dtype=pd.Int32Dtype()).array

if np.issubdtype(source.dtype, np.bool_):
return pd.Series(source, dtype=pd.BooleanDtype()).array
dtype = pd.Int32Dtype()
elif np.issubdtype(source.dtype, np.floating):
# We return the numpy array here, which keeps
# R_FLOAT_NA, np.nan, and other NaNs as they were originally in the file.
# Users can then decide if they prefer to interpret
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can they decide?

# only R_FLOAT_NA or all NaNs as "missing".
return source
# This would create an array with all NaNs as "missing":
# dtype = pd.Float64Dtype() # noqa: ERA001
# This would create an array with only R_FLOAT_NA as "missing":
# from rdata.missing import is_na # noqa: ERA001
# return pd.arrays.FloatingArray(source, is_na(source)) # noqa: ERA001
elif np.issubdtype(source.dtype, np.complexfloating):
# There seems to be no pandas type for complex array
return source
elif np.issubdtype(source.dtype, np.bool_):
dtype = pd.BooleanDtype()
elif np.issubdtype(source.dtype, np.str_):
dtype = pd.StringDtype()
elif np.issubdtype(source.dtype, np.object_):
for value in source:
assert isinstance(value, str) or value is None
dtype = pd.StringDtype()
else:
return source

if np.issubdtype(source.dtype, np.str_):
return pd.Series(source, dtype=pd.StringDtype()).array
return pd.Series(source, dtype=dtype).array

return source

Expand All @@ -430,7 +448,7 @@ def dataframe_constructor(
and isinstance(row_names, np.ma.MaskedArray)
and row_names.mask[0]
)
else tuple(row_names)
else row_names
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep this change even though I removed the altrep-to-range conversion. The reason is that if a user would like to create a custom altrep_constructor_dict that maps compact_intseq to a range object, then that range object wouldn't be expanded here but it would be passed to dataframe as such. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that such mapping is an error. However, I do not have anything against this particular change, unless the tuple call was necessary for some reason.

)

return pd.DataFrame(obj, columns=obj, index=index)
Expand Down
Loading
Loading