-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set_types memory usage with extremely wide data #193
Comments
My new running theory is that the problem is that for every field passed into the edit: Maybe that's wrong, as I see now that only the single field that is updated is being validated... |
I created this custom processor that seems to fix the issue:
|
Hi @cschloer, would you like to PR? |
Sure - though it would require a PR in both dataflows and datapackage_pipelines. @akariv is there any reason you only implemented the singular |
I would add an 'update_field' method (to complement the 'update_package',
'update_resource' and 'update_schema').
Then your flow would contain multiple 'update_field' calls and a singular
'validate' at the end.
Nevertheless, I think that for 2500 consecutive columns (definitely an
unusual case :) ) a processor like your might be a better solution.
…On Mon, Jul 13, 2020 at 4:02 PM Conrad Schloer ***@***.***> wrote:
Sure - though it would require a PR in both dataflows and
datapackage_pipelines. @akariv <https://github.com/akariv> is there any
reason you only implemented the singular set_type in data flows instead
of a set_types?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5NK76GKQV7SMRGZXM3R3MAUPANCNFSM4OYMINNQ>
.
|
OK - should I hold off on creating a PR then, since this is specific to the unusual case of having 2500 columns? |
Hey,
I have a strange dataset here that has 2500 columns and only 60 rows. The
set_types
processor slowly gobbles up all of the memory when called with all 2500 columns.Here's the data:
Coral%20ESVs_Galaxaura.xlsx
And the pipeline-spec.yaml:
pipeline-spec.yaml.txt
Note the recursion limit parameter which does not exist in the standard load processor - You will probably have to set your python recursion limit somewhere:
I think the main problem is that it is creating ~2500
set_type
flows, which for some reason uses up a ton of memory. I'm guessing if there was a singleset_types
flow in dataflows then calling it wouldn't run out of memory.The text was updated successfully, but these errors were encountered: