Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ungraceful failing when data isn't properly touched #249

Open
yimuchen opened this issue Apr 26, 2023 · 7 comments
Open

Ungraceful failing when data isn't properly touched #249

yimuchen opened this issue Apr 26, 2023 · 7 comments

Comments

@yimuchen
Copy link

yimuchen commented Apr 26, 2023

I'm attempting to implement a function over an awkward collection and exposing it to dask_awkward. I've noticed that if the correct touching is not implemented (see the arr.y.layout._touch_data line), then this following program Seg-faults.

import awkward
import dask_awkward
import numpy


def ak_f(arr):
    return arr.x + arr.y


def dak_f(arr):
    class _call_wrap:
        def __call__(self, arr):
            if awkward.backend(arr) == "typetracer":
                # Running the touch function
                arr.x.layout._touch_data(recursive=True)
                #arr.y.layout._touch_data(recursive=True) ## Intensionally disabling for demonstration

                # Getting the length-0 array for evaluation
                x = arr.layout.form.length_zero_array(behavior=arr.behavior)
                out = ak_f(x)
                return awkward.Array(
                    out.layout.to_typetracer(forget_length=True),
                    behavior=out.behavior,
                )
            else:
                return ak_f(arr)

    return dask_awkward.lib.core.map_partitions(_call_wrap(), arr, label="test_dak")


ak_arr = awkward.zip(
    {
        "x": numpy.random.random(size=(10, 10)),
        "y": numpy.random.random(size=(10, 10)),
    }
)

print(ak_f(ak_arr))

awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet")  # Making a lazy instance
print(dak_f(dak_arr).compute())

For the shorter term, I think it is important if these failures are more gracefully with some user viewable error message rather than a hard segfault.

For the longer term, it would also be nice if touching of collection inputs can be handled automatically.

@yimuchen yimuchen changed the title Ungraceful failing when data isn't loaded Ungraceful failing when data isn't properly touched Apr 26, 2023
@lgray
Copy link
Collaborator

lgray commented Apr 28, 2023

Just to comment on this - the functions being called in the real version of this example are ML frameworks where we are forced to go through length_zero_arrays in order to pass through the ML framework and then reassemble the output.

Simply making sure that no one gets hung up on the simplicity of the function being called in the repro (which could be handled by typetracing automatically).

@martindurant
Copy link
Collaborator

I can confirm that I see this issue on current main dask-awkward

@douglasdavis
Copy link
Collaborator

Is this still an issue? Things have evolved a lot w.r.t. typetracer arrays and data touching.

@yimuchen
Copy link
Author

This is still an issue for dask_awkward==2023.10.1 and awkward==2.4.6.

@lgray
Copy link
Collaborator

lgray commented Oct 24, 2023

I think this needs to be recast in some of the more modern idioms (length_zero_if_typetracer and such). That should hide some of the more touchy bits of the interface and ensure consistency.

you can now achieve the same thing via the code below, with no need for a wrapper:

import awkward
import dask_awkward
import numpy

def ak_f(arr):
    if isinstance(arr, dask_awkward.Array):
        return arr.x + arr.y
    return awkard.typetracer.length_zero_if_typetracer(arr.x) + awkward.typetracer.length_zero_if_typetracer(arr.y)

ak_arr = awkward.zip(
    {
        "x": numpy.random.random(size=(10, 10)),
        "y": numpy.random.random(size=(10, 10)),
    }
)

print(ak_f(ak_arr))

awkward.to_parquet(ak_arr, "test.parquet")
dak_arr = dask_awkward.from_parquet("test.parquet")  # Making a lazy instance
print(ak_f(dak_arr).compute())

@lgray
Copy link
Collaborator

lgray commented Oct 24, 2023

I think the segfault when data for an array isn't available is a lower level awkward issue. We should handle it more gracefully there?

@agoose77
Copy link
Collaborator

agoose77 commented Mar 6, 2024

Can we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants