Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support postprocessing of instances #116

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions docs/at_variables_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ e.g. ``experiment.py /path/to/instance1``, ``experiment.py /path/to/instance2``,
``@INSTANCE@`` variable that resolves to the respective paths of the instances. Then, we only need to
specify ``experiments.py @INSTANCE@`` as experiment arguments.

..
TODO: Add section Instances for generators and INSTANCE_FILENAME variable

Below, we list all @-variables and where they can be used.

Expand All @@ -21,9 +19,9 @@ Below, we list all @-variables and where they can be used.
- ``@EXTRA_ARGS@``: extra arguments of all variants and the instance of an experiment
- ``@INSTANCE@``: path of a :ref:`local <LocalInstances>`/:ref:`remote <RemoteInstances>` instance, i.e. ``/instance_directory/<instance_name>``
- ``@INSTANCE_DIR@``: path of the :ref:`InstanceDirectory`
- ``@INSTANCE_FILENAME@``: filename of the instance
- ``@INSTANCE:<ext>@``: path of a :ref:`MultipleExtensions` instance with extension ``<ext>``, i.e. ``/instance_directory/<instance_name>.<ext>``
- ``@INSTANCE:<idx>@``: path of an :ref:`ArbitraryInputFiles` instance with index ``<idx>`` in the ``files`` key, i.e. ``/instance_directory/files[<idx>]``
- ``@INSTANCE_FILENAME@``: filename of the instance
- ``@OUTPUT@``: path to the output file of an experiment
- ``@OUTPUT:<ext>@``: path to the output file with extension ``<ext>`` of an experiment
- ``@OUTPUT_SUBDIR@``: output subdirectory of the experiment where the output and status files are stored, i.e. ``/path_to_experiments_yml/output/``
Expand Down Expand Up @@ -148,6 +146,26 @@ Same as for the :ref:`AtVariablesExperimentsArgs` key `without` the ``@EXTRA_ARG
Instances
---------

.. _AtVariablesInstanceArgs:

args
^^^^

The following @-variables can be used in the ``args`` key:


- ``@BASE_DIR@``
- ``@INSTANCE_DIR@``
- ``@INSTANCE@``
- ``@INSTANCE:<ext>@``
- ``@INSTANCE:<idx>@``

environ
^^^^^^^

The values of the ``environ`` key will be substituted and the @-variables are the same as for
the :ref:`AtVariablesInstanceArgs` key.

extra_args
^^^^^^^^^^

Expand All @@ -170,6 +188,12 @@ The following @-variables can be used in the ``url`` key:

- ``@INSTANCE_FILENAME@``

workdir
^^^^^^^

Same as for the :ref:`AtVariablesInstanceArgs` key.


Variants
--------

Expand Down Expand Up @@ -197,3 +221,4 @@ procs_per_node
^^^^^^^^^^^^^^

Same as for the :ref:`experiments args <AtVariablesExperimentsArgs>` key `without` the ``@EXTRA_ARGS@`` variable.

4 changes: 4 additions & 0 deletions docs/experiments_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,16 @@ Instances
This entry is a list of instances that will be used for experiments. The following keys are
used for specifying instances:

- ``args``: list of postprocessing arguments
- ``environ``: dictionary of (environment variable, value)-pairs
- ``extensions``: list of extensions that the instance has
- ``files``: list of files the instance consists of
- ``items``: list of instances
- ``name``: name of the instance (used when dealing with instances that consist of unrelated files)
- ``postprocess``: list or string of postprocessing arguments
- ``repo``: source of instances
- ``set``: list of sets the instance belongs to
- ``workdir``: path of the working directory

For detailed usage examples, see the :ref:`Instances` page.

Expand Down
109 changes: 106 additions & 3 deletions docs/instances.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ list local instances that consist of zero or more files. More over simexpal can
remote instances from the `SNAP <https://snap.stanford.edu/data/>`_ repository, Git repositories
and arbitrary URLs. It is also possible to assign instances to instance sets that enable a more
efficient usage of the :ref:`command line interface <CommandLineReference>` and are useful when
defining the run matrix.
defining the run matrix. Furthermore, you can add extra arguments to instances and postprocess them.

.. _InstanceDirectory:

Expand Down Expand Up @@ -71,8 +71,9 @@ to download the instances into the instance directory.
.. note::
1st December 2020: It is no longer possible to automatically download `KONECT <http://konect.cc>`_
instances as the website is no longer publicly available. It is still possible to list them and
execute supported actions, e.g, transforming the instances to edgelist format via
``simex instances run-transform --transform='to_edgelist'`` if you already have them saved locally.
execute supported actions, e.g, transforming the instances to edge list format via
``simex instances run-transform --transform='to_edgelist'`` or :ref:`postprocess <PostprocessInstances>`
them if you already have them saved locally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put if you already have them saved locally to the beginning of the sentence. Otherwise it currently reads like: postprocessing is available if the instances are locally available (while the transform works all the time).


Instances From SNAP
^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -467,6 +468,108 @@ and ``set2``, which contains ``instance2`` and ``instance3``.
Instance sets will also be useful when using the :ref:`command line interface <CommandLineReference>` of
simexpal and when defining the :ref:`RunMatrix`.

.. _PostprocessInstances:

Postprocessing
--------------

There might be cases where you need to process the instances after installing or downloading them, before they
are ready to be used in the experiments. In order to do so, you can use the

- ``postprocess``: list or string of postprocessing arguments

key. Afterwards, you can install and postprocess the instances by calling

.. code-block:: bash

$ simex instances install

in the terminal.

Before processing an instance, simexpal copies the contents of each file belonging to an instance into separate
``<filename>.original`` files. After postprocessing an instance simexpal creates an ``<instance_name>.postprocessed``
file, signalling the successful postprocessing of an instance. If an error occurs during the postprocessing of an
instance, the original instance files will be restored and the postprocessing will be skipped.

Arbitrary Postprocessing
^^^^^^^^^^^^^^^^^^^^^^^^

You can define arbitrary postprocessing steps by setting the ``postprocess`` key to a list of dictionaries
containing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to a list of dictionaries containing the keys:


- ``args``: list of postprocessing arguments
- ``environ``: dictionary of (environment variable, value)-pairs
- ``workdir``: path of the working directory

keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line then.


Assume you want to postprocess the ``facebook_combined`` and ``cit-HepTh`` network from
`SNAP <https://snap.stanford.edu/data/>`_ using two executables ``postprocess1`` and ``postprocess2``, which
take the path of the instance as parameter. Also, you have to prepend the path for ``postprocess1`` to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does only postprocess1 needs the PATH variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to demonstrate the possibility/necessity of adding the appropriate PATH-variable in simexpal in order to find the executable for the postprocessing. There might be cases where the PATH was added beforehand, so that is isn't necessary to do so anymore.

``PATH`` environment variable. Then, your ``experiments.yml`` file could look as follows:

.. code-block:: YAML
:linenos:
:caption: How to arbitrarily postprocess instances in the experiments.yml file.

instances:
- repo: snap
items:
- 'facebook_combined'
- 'cit-HepTh'
postprocess:
- args: ['postprocess1', '@INSTANCE@']
environ:
'PATH': '/path/to/postprocess1'
- args: ['postprocess2', '@INSTANCE@']

When executing the postprocessing arguments, the :ref:`@-variable <AtVariables>` ``@INSTANCE@`` will resolve
to the respective path of the instances. For instances with :ref:`MultipleExtensions` or
:ref:`ArbitraryInputFiles`, use the @-variables ``@INSTANCE:<ext>@`` and ``@INSTANCE:<idx>@`` respectively.

.. warning::
Make sure to use the :ref:`AtVariables` to access paths of instance files and to maintain the names and
locations of each file belonging to an instance (as passed by the @-variable) after every postprocessing step.
Simexpal temporarily renames instance files while postprocessing them. Manually renaming instance files might
break the postprocessing.

Converting to Edge List Format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To convert instances from `SNAP <https://snap.stanford.edu/data/>`_ or `KONECT <http://konect.cc>`_, we can set
``postprocess: to_edgelist`` as follows:

.. code-block:: YAML
:linenos:
:caption: How to convert SNAP/KONECT instances to edge list format in the experiments.yml file.

instances:
- repo: snap
items:
- facebook_combined
- cit-HepTh
postprocess: to_edgelist
- repo: konect
items:
- dolphins
- ucidata-zachary
postprocess: to_edgelist

In this way, simexpal will use its internal mechanism to convert the instances to edge list
format after downloading them.

Re-Postprocessing
^^^^^^^^^^^^^^^^^

To re-postprocess instances, you can simply delete the respective ``<instance_name>.postprocessed`` files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name/reference one more time that instances are saved in inst_dir.

before calling

.. code-block:: bash

$ simex instances install

in the terminal.

Next
----

Expand Down
5 changes: 4 additions & 1 deletion scripts/simex
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,10 @@ def do_instances_install(args):

for instance in cfg.all_instances():
if args.overwrite:
util.try_rmfile(os.path.join(cfg.instance_dir(), instance.unique_filename))
fullpath = instance.fullpath
util.try_rmfile(fullpath)
util.try_rmfile(fullpath + '.postprocessed')
util.try_rmfile(fullpath + '.original')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is with .post0 and .post1?

instance.install()

instances_install_parser = instances_subcmds.add_parser('install')
Expand Down
Loading