Skip to content

Commit

Permalink
implemented Philip's review
Browse files Browse the repository at this point in the history
  • Loading branch information
haddadanas committed Feb 14, 2024
1 parent 8cbcfa9 commit 3597ca6
Showing 1 changed file with 72 additions and 46 deletions.
118 changes: 72 additions & 46 deletions docs/user_guide/examples.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Examples
# Creating your own task

In this section, additional examples on the use of columnflow are presented.
In this section, additional examples on the usage of columnflow are presented.

## Writing new tasks

Expand All @@ -10,6 +10,7 @@ This section will help you create your first columnflow task and contains some t
Tasks in columnflow are based on {external+law:py:class}`~law.task.base.BaseTask` from the [law package](https://law.readthedocs.io/en/latest/index.html) by Marcel Rieger, which in turn is based on the [luigi package](https://luigi.readthedocs.io/en/stable/tasks.html) by Spotify.
Refer to the documentation of these packages for more information on the task class and a detailed description of the implemented methods.

(example_task_class_section)=
### Example: Task Class

```python showLineNumbers
Expand All @@ -24,32 +25,44 @@ class MyTask(AnalysisTask):

# specify sandbox for the task
sandbox = dev_sandbox(law.config.get("analysis", "my_columnar_sandbox"))

# Define a custom name space with which the task is called
# If not defined cf is usually used as default
task_namespace = "hbt"

# class parameters
param1 = luigi.Parameter(
default="default_value",
description="String parameter with default value",
)
default="default_value",
description="String parameter with default value",
)
param2 = law.CSVParameter(
default=("*",), # parsed similar to wildcard imports
description="Parameter accepting comma-seperated values."
"Default is a tuple with all knows values connected to this attribute."
)
default=("*",), # parsed similar to wildcard imports
description="Parameter accepting comma-seperated values."
"Default is a tuple with all knows values connected to this attribute."
)
param_bool = luigi.BoolParameter(
default=False,
description="Boolean parameter with default value `False`",
)

# container for upstream requirements for convenience
reqs = Requirements()

def requires(self):
# Branch Dependencies on other tasks
return [SomeOtherTask]
# If needed get the requirements from super class (usually defined as dict)
req = super().requires()
# Add branch Dependencies on other tasks
req["SomeOtherTask"] = SomeOtherTask

return req

def workflow_requires(self):
# Workflow dependencies on other workflows needed if pilot is not activated
reqs = super().workflow_requires()
if not self.pilot:
reqs["NewRequirement"] = SomeOtherTask

return
return reqs

def output(self):
# Path to the output file if needed
Expand All @@ -60,9 +73,9 @@ class MyTask(AnalysisTask):
pass
```

### The "super" class:
### The "super" class
Every task is defined by a Python class, which inherits from a task base or another fully functional task.
The most basic base in the framework is {py:class}`~columnflow.tasks.framework.base.Task`.
The most basic base in the framework is {py:class}`~columnflow.tasks.framework.base.BaseTask`.
However, for tasks implemented in an analysis, it is highly recommended to inherit from one of the following derived subclasses:
- {py:class}`~columnflow.tasks.framework.base.AnalysisTask`
- {py:class}`~columnflow.tasks.framework.base.ConfigTask`
Expand All @@ -80,23 +93,23 @@ For example, if you have a task that needs a calibrator or depends on a task tha
After making the choice of the superclass(es), the class is implemented at its core by the following methods (refer to the [luigi task documentation](https://luigi.readthedocs.io/en/stable/tasks.html) for more information on the methods):

- {external+law:py:meth}`~law.task.base.BaseTask.run()`: This method contains the logic for executing the task.
- `requires()`: This method returns a collection of tasks that this task depends on.
- `output()`: This method returns a collection of output targets for the task.

### The sandbox:
- {external+law:py:meth}`~law.tasks.ForestMerge.requires()`: This method returns a collection of tasks that this task depends on.
- {external+law:py:meth}`~law.task.base.WrapperTask.output()`: This method returns a collection of output targets for the task.

#TODO cross reference to Bogdan's documentation
### The sandbox

Should your task require a special software enviroment (specific version of python, special packages, or other specific software), it is recommended to define a custom sandbox for this task.
Should your task require a special software enviroment (special packages, or other specific software), it is recommended to define a custom sandbox for this task.
For the task to use the sandbox, it is to be specified using the class attribute ```sandbox```, which is inherited from {external+law:py:class}`~law.sandbox.base.Sandbox`.
The sandbox can be defined in the task class itself using the syntax shown above, elsewise the default sandbox defined in the analysis config file is used.

More details on using and setting up the sandboxed are givin in the {doc}`sandbox section <sandbox>`.

### The parameters:

### The parameters

Like the arguments of a function, the parameters of a task are the variables that are passed to the task when it is run from the command line (``law run MyTask --param1 foo``), making the task very flexible.
The parameters are defined as class attributes and are identified as such by the class method `get_params()`.
The parameters returned by this method are parsed from the command line by the class method `get_param_values(params, args, kwargs)`.
The parameters returned by this method are parsed from the command line by the class method `get_param_values(params, *args, **kwargs)`.
If a parameter is not provided in the command line, the default value is used.
Finally, the parameters are initialized in the `__init__()` method of the task with the task object itself and can be accessed across all methods of its instance.

Expand All @@ -107,41 +120,45 @@ If needed, one can also define custom parameters.
However, it is important to inherit the parameter class from the package used for the task (luigi or law).
Otherwise, the parameter will not work correctly.

### The requirements:
### The requirements

#TODO explain branches if not explained in law
<!-- #TODO explain branches if not explained in law -->

As noticed in the [code snipped](#example-task-class), a task can define three different requirement keywords, which have different functionality in the framework.
As noticed in the {ref}`code snipped <example_task_class_section>`, a task can define three different requirement keywords, which have different functionality in the framework.

The ```requires()``` methode in a task, which is inherited from {external+law:py:class}`law.task.base.Task`, defines all required outputs for a task to run on a branch level.
The ```requires()``` method in a task, which is inherited from {external+law:py:class}`law.task.base.Task`, defines all required outputs for a task to run on a branch level.
If an output does not exist, the task responsible for this output will be scheduled and executed before the desired task is run.

The {external+law:py:meth}`~law.tasks.ForestMerge.workflow_requires()` is analog to the former methode, however it defines the dependencies on a workflow level, which can be very useful, when working remotely and running jobs on remote clusters.
The {external+law:py:meth}`~law.tasks.ForestMerge.workflow_requires()` is analog to the former method.
However, it defines the dependencies on a workflow level, which can be very useful when working remotely and running jobs on remote clusters.
Every task defines a workflow, which contains all available branches of this task.
When executing the workflow, the task will run in all (specified) branches.
If the desired workflow requires the output of another workflow, the latter will be executed beforehand, which means that all needed branches (with no existing output) from the required workflow will be executed before any of the branches from the current workflow runs (Workflow runs vertically).
This behaviour can be modified, by deactivating the ```pilot``` parameter.
In this case, the workflow first run through the branches of the executed task after each other, where all dependencies of each branch are executed when needed by the branch (workflow propagates horizontally).
This is very useful, when parallelising jobs on remote clusters, since multiple branches can run simultaneously, without the need to run all branches from one task beforehand.

Last, the class attribute ```reqs```, that is a {py:class}`~columnflow.tasks.framework.base.Requirements` object, does not directly have to do with the other two requirements, but is rather for analyses specific configuration.
Strictly speaking, it is not needed for the task to run, however it offers a container for the requirements, which becomes very convenient, when for example a required task should be replaced by a customized subclass of the same task.
Finally, the class attribute ```reqs```, that is a {py:class}`~columnflow.tasks.framework.base.Requirements` object, does not directly have to do with the other two requirements, but is rather for analysis-specific configuration.
Strictly speaking, it is not needed for the task to run.
However it offers a container for the requirements, which becomes very convenient when for example a required task should be replaced by a customized subclass of the same task.
Instead of having to edit the requirements in all the class methods that depends on this requirement, the ```reqs``` attribute can be simply modified with:
```python
MyTask.reqs.RequiredTask = ModifiedRequiredTask
```
Therefore, using the class attribute ```reqs```, when working with the requirements within the class is conveninet.
Therefore, using the class attribute ```reqs```, when working with the requirements within the class is convenient.


![req plot](cf_plot.svg "image")
:::{figure} ../plots/user_guide/cf_plot.pdf
:width: 100%
:::

### RunOnceTask

The {external+law:py:class}`~law.tasks.RunOnceTask` is a special task that is used to run a task only once, which can be very useful in many situations.
For example, when a task is required by multiple tasks or multiple branches of a workflow, but the task itself should only be executed once independent of the branches, or a task does not produce any file output but is needed to be marked as done (e.g. a task only produces text output on the CLI).

This is implemented by (additionally) inheriting from {external+law:py:class}`~law.tasks.RunOnceTask`.
Moreover, to mark the task as complete, the {external=law:py:meth}`~law.tasks.RunOnceTask.mark_complete()` method is to be added at the end of the ```run()``` method after the logic of the task is executed, or the complete```run()``` method is either to be wrapped by the {external=law:py:meth}`~law.tasks.RunOnceTask.complete_on_success` decorator.
Moreover, to mark the task as complete, the {external+law:py:meth}`~law.tasks.RunOnceTask.mark_complete()` method is to be added at the end of the ```run()``` method after the logic of the task is executed, or the complete ```run()``` method is either to be wrapped by the {external+law:py:meth}`~law.tasks.RunOnceTask.complete_on_success` decorator.
One must be careful with the latter, since the decorator will mark the task as complete even if it fails.
Therefore, it is recommended to also add a check at the end and raise an error in case the task fails.

Expand All @@ -166,38 +183,40 @@ class MyPrintingTask(AnalysisTask, law.tasks.RunOnceTask):

### Add the task to the analysis

After creating a new task, make sure to add the new task to the law config file in your analysis folder, in order to run the task.
The new task has to be specified under the ```[outputs]``` section as follows:
After creating a new task, the law config file in your analysis folder may need to be adjusted, in order to run the task.
If not already there, the folder of the new task has not been added to the ```[modules]``` section of the law config file (using python dot path notation).
If the task produces any output, the new task has to be specified under the ```[outputs]``` section together with the saving location as follows:

```ini
[modules]

path.to.the.tasks.folder

[outputs]

lfn_sources: wlcg_fs_desy_store

cf.MyNewTask: wlcg
hbt.MyNewTask: wlcg
...
```
### Executing the Task

after sourcing the law environment, the task can be executed using the command:
After new sourcing the law environment or by running `law index` in a setup enviroment, the task can be executed using the command:
```bash
law run cf.MyTask --param1 foo
```
The parameters can be given using the syntax ```--param1 foo```, where ```param1``` is the name of the parameter and ```foo``` is the value of the parameter.
If the parameter's name include an underscore, the underscore is replaced by a hyphen (```--param_1 foo``` -> ```--param-1 foo```).
If the parameter is a boolean, the parameter can be set to ```True``` by only typing ```--param-boolean```.
If the parameter is a boolean, the parameter can be set to ```True``` by only typing ```--param-bool```.
If the parameter is a list, the values can be given as a comma-separated list (```--param2 foo,bar,baz```).

## Custom law.cfg file for users without grid certificate (CMS specific?)
## Custom law.cfg file for users without grid certificate (CMS specific)

Users, who do not have a grid certificate, are not able to run jobs on the grid or store and fetch data from it.
However, they can still run jobs locally.
For this purpose, a custom law config file should be created in the analysis folder, which specifies a local dcache and the path to the nanoAoDs for every required campaing, and set the output locations for the tasks to local.
It is recommended to create a copy of the default law config file and modify it as needed and then redirect the setup to the custom config file by setting the ```LAW_CONFIG_FILE``` environment variable in the `setup.sh` file to the path of the custom config file as follows:

```bash
export LAW_CONFIG_FILE="${LAW_CONFIG_FILE:-${ANALYSIS_DIR}/law.nocert.cfg}"
```
For this purpose, a custom law config file should be created in the analysis folder, which specifies a local dcache and the path to the nanoAODs for every required campaign, and set the output locations for the tasks to local.
It is recommended to apply modifications on the config file in a new custom law config file and not directly in the default one.
The custom config should inherit from the default config (as shown below) and can be modify as needed by adding or overloading fields.

The costum config file should contain the following sections:
```ini
Expand All @@ -207,7 +226,7 @@ inherit: $HBT_BASE/law.cfg

[local_dcache]
# define path to the local dcache used later for the lfn (Local File Name) sources
base: path/to/local/dcache
base: path/to/different/file/system

[local_campaigns_name]
base: path/to/local/campaigns/nanoAOD/directory
Expand All @@ -233,7 +252,7 @@ selection_modules: analysis.selection.{python_module_name}

[outputs]

lfn_sources: local_desy_dcache
lfn_sources: local_dcache

# output locations per task family
# for local targets : "local[, STORE_PATH]"
Expand All @@ -245,4 +264,11 @@ cf.Task3: /shared/path/to/store/output

```

It is important to redirect the setup to the custom config file by setting the ```LAW_CONFIG_FILE``` environment variable in the `setup.sh` file to the path of the custom config file as follows:

```bash
export LAW_CONFIG_FILE="${LAW_CONFIG_FILE:-${ANALYSIS_DIR}/law.nocert.cfg}"
```

For more details, refer to the [law config documentation](https://law.readthedocs.io/en/latest/config.html) for more information on the law config file.
A template for the custom [config file](https://raw.githubusercontent.com/uhh-cms/hh2bbtautau/dev/law.nocert.cfg) can be found under the dev branch in the repository of the [hh2bbtautau analysis](https://github.com/uhh-cms/hh2bbtautau).

0 comments on commit 3597ca6

Please sign in to comment.