Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set field_size_limit to max long int size on csv plugin #352

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

humrochagf
Copy link
Contributor

This pull request touches issue #306 and fixes the issue for all platforms without falling back to hardcoded value.

The overflow was happening due to the fact of python CSV implementation uses a long to store max limit (https://github.com/python/cpython/blob/c88239f864a27f673c0f0a9e62d2488563f9d081/Modules/_csv.c#L21)

Using the previous implementation of sys.maxsize has a drawback on Windows 64bit LLP64 data model that even using 64bit wide pointers, they keep long as a 32bit integer to preserve type compatibility (see more in https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models)

So, to have a reliable max long size on the target system you can use ctypes.c_ulong(-1) that cycles back to ulong max value divided by 2 to get the size of long on the target platform.

Regarding memory impact, the long value for the limit is already allocated to store the max limit and it's used only to block sizes higher than that. The CSV itself is allocated on demand.

I ran the profiler on a simple load of a big file from brasil.io https://brasil.io/dataset/genero-nomes/nomes/ and I got the same values after the csv import operation:

Filename: test.py (with unicodecsv.field_size_limit commented to use default)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)

versus

Filename: test.py (using max long size)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant