Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to round trip some files with some specific column and cell values #128

Open
graingert opened this issue Apr 23, 2021 · 4 comments
Open
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat

Comments

@graingert
Copy link
Contributor

Describe the issue
A clear and concise description of what the issue is.

To Reproduce

in one cell you need 755 ascii letters followed by a non-ascii character, you need a column with 5 letters and ending in a 2, and another column ending in a 0 and starting with an ascii letter

Here's an example to generate them:

from __future__ import annotations

import pathlib
import os
import io
import tempfile

import pandas as pd
import pyreadstat


"""
numpy==1.20.2
pandas==1.2.4
pyreadstat==1.1.0
python-dateutil==2.8.1
pytz==2021.1
six==1.15.0
"""


def main():
    with tempfile.TemporaryDirectory() as tmp:
        tmp_path = pathlib.Path(tmp)
        dst_path = os.fsdecode(tmp_path / "eg.sav")

        df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
        pyreadstat.write_sav(
            dst_path=tmp_path / "eg.sav",
            df=df,
            column_labels=["x", "y", "z"],
        )
        pyreadstat.read_sav(dst_path)


if __name__ == "__main__":
    main()

this results in:

Traceback (most recent call last):
  File "foo.py", line 37, in <module>
    main()
  File "foo.py", line 33, in main
    pyreadstat.read_sav(dst_path)
  File "pyreadstat/pyreadstat.pyx", line 342, in pyreadstat.pyreadstat.read_sav
  File "pyreadstat/_readstat_parser.pyx", line 1034, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat/_readstat_parser.pyx", line 845, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat/_readstat_parser.pyx", line 775, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)

Expected behavior
I'd expect to be able to round trip it

Setup Information:
How did you install pyreadstat? pip, see pip freeze output above
Platform: Ubuntu 20.04.2 LTS
Python Version Python 3.8.5 (default, Jan 27 2021, 15:41:15)
Using Virtualenv or condaenv? python3.8 -m venv

@graingert graingert changed the title Unable to round trip some fileswith some specific column and cell values Unable to round trip some files with some specific column and cell values Apr 23, 2021
@ofajardo
Copy link
Collaborator

thanks for the reproducible report. It seems to be coming from the C library, so I filed an issue over there.

@graingert
Copy link
Contributor Author

it's also odd because changes like

-        df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
+        df = pd.read_csv(io.StringIO('aaaaa3,y,a0\n\n"' + ("a" * 755) + 'ü"'))

doesn't cause the failure

@ofajardo
Copy link
Collaborator

super strange ... I will report that in the issue in Readstat

@ofajardo ofajardo added the bug Something isn't working label May 25, 2021
@ofajardo
Copy link
Collaborator

it is possible to reproduce this error without any international character, (using only 'a's in this example) if the length of the string is at least 757 (in contrast to 756 if there is the international character). Another important thing to reproduce this is that the numerical values must be NANs. If these are let's say 1.0 then everything is fine. The issue can be reproduced in pure C code using Readstat, meaning it is not a failure caused by python or pyreadstat, see this

@ofajardo ofajardo added the requires changes in Readstat waiting for changes in the C library Readstat label Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat
Projects
None yet
Development

No branches or pull requests

2 participants