Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing files with zero events #1140

Open
alexander-held opened this issue Jul 29, 2024 · 5 comments
Open

Preprocessing files with zero events #1140

alexander-held opened this issue Jul 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@alexander-held
Copy link
Contributor

Describe the bug
When running preprocessing over files containing zero events, num_entries will be 0 and a division in the step size determination will fail with ZeroDivisionError here:

n_steps_target = max(round(num_entries / target_step_size), 1)

I assume the solution would be to bypass all of

form_json = None
form_hash = None
if save_form:
form_str = uproot.dask(
tree,
ak_add_doc=True,
filter_name=no_filter,
filter_typename=no_filter,
filter_branch=partial(_remove_not_interpretable, emit_warning=False),
).layout.form.to_json()
# the function cache needs to be popped if present to prevent memory growth
if hasattr(dask.base, "function_cache"):
dask.base.function_cache.popitem()
form_hash = hashlib.md5(form_str.encode("utf-8")).hexdigest()
form_json = compress_form(form_str)
target_step_size = num_entries if step_size is None else step_size
file_uuid = str(the_file.file.uuid)
out_uuid = arg.uuid
out_steps = arg.steps
if out_uuid != file_uuid or recalculate_steps:
if align_clusters:
clusters = tree.common_entry_offsets()
out = [0]
for c in clusters:
if c >= out[-1] + target_step_size:
out.append(c)
if clusters[-1] != out[-1]:
out.append(clusters[-1])
out = numpy.array(out, dtype="int64")
out = numpy.stack((out[:-1], out[1:]), axis=1)
step_mask = (
out[:, 1] - out[:, 0]
> (1 + step_size_safety_factor) * target_step_size
)
if numpy.any(step_mask):
warnings.warn(
f"In file {arg.file}, steps: {out[step_mask]} with align_cluster=True are "
f"{step_size_safety_factor*100:.0f}% larger than target "
f"step size: {target_step_size}!"
)
else:
n_steps_target = max(round(num_entries / target_step_size), 1)
actual_step_size = math.ceil(num_entries / n_steps_target)
out = numpy.array(
[
[
i * actual_step_size,
min((i + 1) * actual_step_size, num_entries),
]
for i in range(n_steps_target)
],
dtype="int64",
)
out_uuid = file_uuid
out_steps = out.tolist()
if out_steps is not None and len(out_steps) == 0:
out_steps = [[0, 0]]
array.append(
{
"file": arg.file,
"object_path": arg.object_path,
"steps": out_steps,
"num_entries": num_entries,
"uuid": out_uuid,
"form": form_json,
"form_hash_md5": form_hash,
}
)
in this case to instead run
if len(array) == 0:
array = awkward.Array(
[
{
"file": "junk",
"object_path": "junk",
"steps": [[0, 0]],
"num_entries": 0,
"uuid": "junk",
"form": "junk",
"form_hash_md5": "junk",
},
None,
]
)
array = awkward.Array(array.layout.form.length_zero_array(highlevel=False))

and return.

To Reproduce

import uproot
import awkward as ak
from coffea import dataset_tools

arr = ak.Array([])
with uproot.recreate("f.root") as f:
    f["tree"] = {"arr": arr}

fileset = {"dummy": {"files": {"f.root": "tree"}}}
dataset_tools.preprocess(fileset)

The actual case where I saw this happen contained a tree without any arrays inside. I'm not sure if uproot supports writing that for a reproducer (I don't think so?) but in either case I think only the .num_entries property matters so the tree content should be irrelevant here.

Expected behavior
handle empty files without crash

Output

  File "[...]/lib/python3.11/site-packages/coffea/dataset_tools/preprocess.py", line 127, in get_steps
    n_steps_target = max(round(num_entries / target_step_size), 1)
                               ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
ZeroDivisionError: division by zero

Desktop (please complete the following information):
n/a

Additional context
cc @sebastien-rettie

@alexander-held alexander-held added the bug Something isn't working label Jul 29, 2024
@alexander-held
Copy link
Contributor Author

I'm happy to PR if the solution makes sense to you.

@sebastien-rettie
Copy link

Hi @alexander-held, thanks a lot! This makes sense to me.

@alexander-held
Copy link
Contributor Author

I had a brief look and believe the solution could be this

diff --git a/src/coffea/dataset_tools/preprocess.py b/src/coffea/dataset_tools/preprocess.py
index cef72ad5..69ffe3fa 100644
--- a/src/coffea/dataset_tools/preprocess.py
+++ b/src/coffea/dataset_tools/preprocess.py
@@ -76,6 +76,9 @@ def get_steps(

         num_entries = tree.num_entries

+        if num_entries == 0:
+            continue
+
         form_json = None
         form_hash = None
         if save_form:

@lgray
Copy link
Collaborator

lgray commented Aug 1, 2024

Please make a PR with this change. We should make this configurable by the user (the continue behavior should be on by default).

@lgray
Copy link
Collaborator

lgray commented Aug 26, 2024

No one ever made a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants