Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing types in sink_csv/write_csv/read_csv/scan_csv. #11600

Open
2 tasks done
svaningelgem opened this issue Oct 8, 2023 · 6 comments
Open
2 tasks done

Changing types in sink_csv/write_csv/read_csv/scan_csv. #11600

svaningelgem opened this issue Oct 8, 2023 · 6 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@svaningelgem
Copy link
Contributor

svaningelgem commented Oct 8, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

As was being discussed in PR #11583 between @svaningelgem & @stijndegooijer :

#11583 (comment)

This "bug" report is to raise the issue and come to a solution (which might be to not do anything of course 😄 ...

What should we do about the types of:

  • eol_char / lineterminator
  • comment_char
  • quote_char
  • separator

My personal opinion would be that on the polars side, this should accept what it should be: a single character for the last 3.
And a single char for the eol_char, and a string for the lineterminator.

So my proposal would be to change these like:

name current (most of the time) proposal
eol_char u8 / &str Option<char>
lineterminator String Option<String>
comment_char Option<&str>/Option<u8> Option<char>
quote_char Option<&str>/Option<u8> Option<char>
separator u8 / &str Option<char>

Installed versions

branch: main

@stinodego
Copy link
Member

stinodego commented Oct 9, 2023

Intended solution

All of these parameters (except line_terminator) accept a single byte. In Python, such a type does not exist. The closest you can get is a bytes type. In Rust, there is the byte literal.

So the Python signature for these should be bytes instead of the current str. That way, specifying non-ASCII characters will result in a SyntaxError, e.g.

read_csv("test.csv", separator=b"日")
# SyntaxError: bytes can only contain ASCII literal characters

In PyO3, the corresponding type is &[u8]:

fn read_csv(
    separator: &[u8],
    ..
) {
    let separator: u8 = parse_single_byte_input(separator);  // util that asserts length 1 and extracts the single byte
    ...
}

The Rust core function should accept u8 for these types, as this is the correct type for working with a byte literal. This allows for example:

.with_separator(b';')

Current state & next steps

The Rust side already uses u8, which is great.

PyO3

The PyO3 bindings are inconsistent. In some places the type is &str, in others it's u8.
This should be standardized to &[u8], with the util to parse it into u8 with assertion that length == 1.

This can be done in a non-breaking way by converting the Python string input to bytes before passing it to PyO3 (we should write a small util for this):

separator = bytes(separator, "ascii")

Python

Python inputs are now strings. This should be changed to bytes. This would be a breaking change.

We can ease the transition by accepting both (if input is string, we convert to bytes using the util specified above) for a while.

Impact

The result will be that our type system better reflects that these parameters accept a single byte as input.

The only drawback I see is that Python users may be used to specifying string inputs, e.g. separator=";", as this is the norm in other packages like pandas. So we may be introducing some additional friction here.

Therefore, we could consider simply accepting both string and bytes, and converting to bytes before passing to PyO3.

@stinodego stinodego added python Related to Python Polars enhancement New feature or an improvement of an existing feature accepted Ready for implementation and removed rust Related to Rust Polars bug Something isn't working labels Oct 9, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 9, 2023
@svaningelgem
Copy link
Contributor Author

Additional thoughts:

read_csv("test.csv", separator=b"日")

  • Single Character Separators: Most CSV files use a single character as a separator, such as a comma (,), semicolon (;), or tab (\t). Single-character separators are widely supported by various applications and programming languages.

  • Multi-Character Separators: While less common, some CSV files use multi-character separators. These could be strings or sequences of characters, and they are typically enclosed in quotes to avoid confusion. For instance, a CSV file might use || as a separator: "value1"||"value2"||"value3".

  • Unicode Characters: CSV files can use Unicode characters as separators. Unicode characters are not limited to single-byte representations; they can be multibyte characters. However, when using multibyte characters, it's important to ensure that the CSV reader/parser used supports the specific encoding being used.

And I'm guessing the same goes for all other mentioned.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 9, 2023

Therefore, we could consider simply accepting both string and bytes, and converting to bytes before passing to PyO3.

This 😉✅💯 We can handle the conversion seamlessly for the caller here without being unnecessarily nitpicky (and while still raising good/clear errors, etc). They shouldn't really be made to think about what the Rust core is doing with types (vs Python) at quite such a low-level.

@stinodego
Copy link
Member

Multi-Character Separators / Unicode Separators

I had thought about this as well. While currently we support only single-byte separators, we might want to expand this in the future.

At that point, we'd have to accept strings everywhere instead of bytes (both in Python and Rust). And we'd probably have to rename some of our parameters (quote_char / comment_char would no longer be appropriate).

But let's worry about that when we get to it?

@stinodego
Copy link
Member

@svaningelgem Do you want to finish this one or should I pick this up?

@svaningelgem
Copy link
Contributor Author

Please feel free. I'm looking into 2 other things right now:

  • generalizing how the reads & writes happen within polars
  • exposing sink_parquet_cloud to python

Especially the first one is pretty frustrating, so I'm now picking up the 2nd and afterwards will start a crash course on rust so I have a bit more basis to continue from. Now it's just an endless trial and error...

@stinodego stinodego self-assigned this Oct 9, 2023
@stinodego stinodego moved this from Ready to In progress in Backlog Oct 9, 2023
@stinodego stinodego moved this from In progress to Ready in Backlog Oct 9, 2023
@stinodego stinodego removed their assignment Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

3 participants