Skip to content

Commit

Permalink
Merge pull request columnflow#563 from haddadanas/selection_in_hists
Browse files Browse the repository at this point in the history
Enhance histogram task with selection handling
  • Loading branch information
riga authored Nov 27, 2024
2 parents 5cc3a14 + 55345ff commit 50db85d
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 7 deletions.
32 changes: 25 additions & 7 deletions columnflow/tasks/histograms.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,12 +148,17 @@ def run(self):
self.config_inst.get_variable(var_name)
for var_name in law.util.flatten(self.variable_tuples.values())
)
for inp in (
for inp in ((
[variable_inst.expression]
if isinstance(variable_inst.expression, str)
# for variable_inst with custom expressions, read columns declared via aux key
else variable_inst.x("inputs", [])
)
) + (
# for variable_inst with selection, read columns declared via aux key
variable_inst.x("inputs", [])
if variable_inst.selection != "1"
else []
))
}

# empty float array to use when input files have no entries
Expand Down Expand Up @@ -221,18 +226,31 @@ def run(self):
# enable weights and store it
histograms[var_key] = h.Weight()

# mask events and weights when selection expressions are found
masked_events = events
masked_weights = weight
for variable_inst in variable_insts:
sel = variable_inst.selection
if sel == "1":
continue
if not callable(sel):
raise ValueError(f"invalid selection '{sel}', for now only callables are supported")
mask = sel(masked_events)
masked_events = masked_events[mask]
masked_weights = masked_weights[mask]

# merge category ids
category_ids = ak.concatenate(
[Route(c).apply(events) for c in self.category_id_columns],
[Route(c).apply(masked_events) for c in self.category_id_columns],
axis=-1,
)

# broadcast arrays so that each event can be filled for all its categories
fill_data = {
"category": category_ids,
"process": events.process_id,
"shift": np.ones(len(events), dtype=np.int32) * self.global_shift_inst.id,
"weight": weight,
"process": masked_events.process_id,
"shift": np.ones(len(masked_events), dtype=np.int32) * self.global_shift_inst.id,
"weight": masked_weights,
}
for variable_inst in variable_insts:
# prepare the expression
Expand All @@ -244,7 +262,7 @@ def expr(events, *args, **kwargs):
return empty_f32
return route.apply(events, null_value=variable_inst.null_value)
# apply it
fill_data[variable_inst.name] = expr(events)
fill_data[variable_inst.name] = expr(masked_events)

# fill it
fill_hist(
Expand Down
52 changes: 52 additions & 0 deletions docs/user_guide/plotting.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,3 +363,55 @@ An example on how to implement such a plotting function is shown in the followin
:start-at: def my_plot1d_func(
:end-at: return fig, (ax,)
```

## Applying a selection to a variable

In some cases, you might want to apply a selection to a variable before plotting it.
Instead of creating a new column with the selection applied, columnflow provides the possibility to apply a selection to a variable directly when histograming it.
For this purpose, the `selection` parameter can be added in the variable definition in the config.
This may look as follows:

```python
config.add_variable(
name="hh_mass",
expression="hh.mass",
binning=(20, 250, 750.0),
selection=(lambda events: events.hh.pt > 30.0), # Select only events with a pt larger than 30 GeV
null_value=EMPTY_FLOAT, # Set the value of the variable to EMPTY_FLOAT if the selection is not passed
unit="GeV",
x_title=r"$m_{hh}$",
aux={"inputs": ["hh.pt"]}, # Add the needed selection columns to the auxiliary of the variable instance
)
```

It is important to provide the `null_value` parameter, when using the `selection` parameter, as the variable will be set to this value if the selection is not passed.
The `selection` parameter only supports functions / lambda expressions for now.
The function itself can be as complex as needed, but its signature needs to match `def my_selection(events: ak.Array) -> ak.Array[bool]` where the variable array is passed to the function and the returned value is a boolean array of the same length as the input array.
The returned array is supposed to be an one-dimensional mask applied on event level.

The used columns in the selection function are not automatically added to the required routes of the workflow.
For this reason, it is necessary to add the columns used in the selection function to variable instance auxiliary and to make sure that the columns are produced at the time of creating the histograms.

:::{dropdown} An other examble with a more complex selection on event level:

```python
def jet_selection(events):
"""select events where the sum of pt of jets with eta < 2.1 is greater than 200 GeV"""
import awkward as ak
eta_mask = events.Jet.eta < 2.1
mask = (ak.sum(events.Jet.pt[eta_mask], axis=-1) > 200)
return mask

config.add_variable(
name="jet_pt",
expression="Jet.pt",
binning=(50, 0, 300.0),
selection=jet_selection,
null_value=EMPTY_FLOAT,
unit="GeV",
x_title=r"all Jet $p_{T}$",
aux={"inputs": ["Jet.pt", "Jet.eta"]},
)
```

:::

0 comments on commit 50db85d

Please sign in to comment.