Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_types memory usage with extremely wide data #193

Open
cschloer opened this issue Jul 13, 2020 · 7 comments
Open

set_types memory usage with extremely wide data #193

cschloer opened this issue Jul 13, 2020 · 7 comments

Comments

@cschloer
Copy link
Contributor

Hey,

I have a strange dataset here that has 2500 columns and only 60 rows. The set_types processor slowly gobbles up all of the memory when called with all 2500 columns.

Here's the data:
Coral%20ESVs_Galaxaura.xlsx

And the pipeline-spec.yaml:
pipeline-spec.yaml.txt

Note the recursion limit parameter which does not exist in the standard load processor - You will probably have to set your python recursion limit somewhere:

import sys

sys.setrecursionlimit(_recursion_limit)

I think the main problem is that it is creating ~2500 set_type flows, which for some reason uses up a ton of memory. I'm guessing if there was a single set_types flow in dataflows then calling it wouldn't run out of memory.

@cschloer
Copy link
Contributor Author

@roll

@cschloer
Copy link
Contributor Author

cschloer commented Jul 13, 2020

My new running theory is that the problem is that for every field passed into the set_types processor, the entire resource (including every field) is validated by the set_type flow. So you end up with (w * h) * w operations.

edit: Maybe that's wrong, as I see now that only the single field that is updated is being validated...

@cschloer
Copy link
Contributor Author

I created this custom processor that seems to fix the issue:

from dataflows import Flow, PackageWrapper, schema_validator
from dataflows.helpers.resource_matcher import ResourceMatcher
from datapackage_pipelines.wrapper import ingest
from datapackage_pipelines.utilities.flow_utils import spew_flow
import re


def set_types(parameters, resources=None, regex=None, types={}):
    def func(package: PackageWrapper):
        matcher = ResourceMatcher(resources, package.pkg)
        for resource in package.pkg.descriptor["resources"]:
            if matcher.match(resource["name"]):
                fields = resource["schema"]["fields"]
                for name, options in types.items():
                    if not regex:
                        name = re.escape(name)
                    name = re.compile(f"^{name}$")
                    for field in fields:
                        if name.match(field["name"]):
                            field.update(options)

        yield package.pkg
        for rows in package:
            if matcher.match(rows.res.name):
                yield schema_validator(rows.res, rows)
            else:
                yield rows
        yield from package

    return func


def flow(parameters):
    resources = parameters.get("resources", None)
    regex = parameters.get("regex", True)
    types = parameters.get("types", {})
    return Flow(set_types(parameters, resources=resources, regex=regex, types=types))


if __name__ == "__main__":
    with ingest() as ctx:
        spew_flow(flow(ctx.parameters), ctx)

@roll
Copy link
Member

roll commented Jul 13, 2020

Hi @cschloer, would you like to PR?

@cschloer
Copy link
Contributor Author

Sure - though it would require a PR in both dataflows and datapackage_pipelines. @akariv is there any reason you only implemented the singular set_type in data flows instead of a set_types?

@akariv
Copy link
Member

akariv commented Jul 13, 2020 via email

@cschloer
Copy link
Contributor Author

OK - should I hold off on creating a PR then, since this is specific to the unusual case of having 2500 columns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants