Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support postprocessing of instances #116

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

duylethanh
Copy link
Collaborator

@duylethanh duylethanh commented Jul 2, 2021

This PR contains the following:

  • support arbitrary postprocessing of instances by allowing a list of dictionaries containing the args, workdir and environ keys in the postprocess key (like the build arguments in compile, configure, ...)
  • documentation on postprocessing of instances
  • tests for postprocessing of instances
  • remove assert file_size > 0 while testing the installation process (so we can have empty (dummy) instances; also it should suffice to only check for the existence of instance files)

@duylethanh duylethanh marked this pull request as draft July 2, 2021 18:55
@duylethanh duylethanh force-pushed the enhancement/postprocess_instances branch from b7190ff to dc31070 Compare July 4, 2021 16:40
@duylethanh duylethanh marked this pull request as ready for review July 4, 2021 17:33
@duylethanh
Copy link
Collaborator Author

PR ready for review

Comment on lines +775 to +818
print("simexpal: Restoring original instance files")
os.rename(partial_path + '.post0', partial_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, if we restore the instance file to the post0 file, how do we know if any given instance has already been postprocessed or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old logic here was: the post0 files are obtained through download, and the instance files are always already postprocessed, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, if we restore the instance file to the post0 file, how do we know if any given instance has already been postprocessed or not?

For each instance we create a .postprocessed file (lines 781 and 868) when the postprocessing was successful. Also, we print out status messages in the terminal. Although we probably shouldn't assume that it is enough. Maybe I could add an additional check in Instance.check_available() which looks for the .postprocessed file for instances that are supposed to be postprocessed. In this way, experiments can't be started on unprocessed instances that were supposed to be postprocessed.

The old logic here was: the post0 files are obtained through download, and the instance files are always already postprocessed, right?

Yes. With this PR .post0 files are still obtained through download and instance files are postprocessed depending on the existence of the .postprocessed file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning .postprocessed and Instance.check_available(): I think it is enough to inform the user when postprocessing has not been successful. Making successful postprocessing mandatory might lead to unexpected results for users (for example if the experiment does not really rely on it in practice).

simexpal/base.py Show resolved Hide resolved
duylethanh and others added 7 commits August 31, 2022 15:45
…en postprocessing instances

The .postprocessed file signals the successful postprocessing of an instance. The .original file is a copy of the instances before postprocessing
… types

Also 'to_edgelist' was removed from instances where we do not support edge list conversion anyway
Right now it should suffice to check the existence of the files, which is done in check_available()
@duylethanh duylethanh force-pushed the enhancement/postprocess_instances branch from 1dba381 to a83a4b2 Compare August 31, 2022 14:50
@duylethanh duylethanh marked this pull request as draft August 31, 2022 14:52
@duylethanh
Copy link
Collaborator Author

Fixed the merge conflicts for now. When the remarks above are resolved or ok with you I will mark this PR as ready for review again.

@duylethanh duylethanh marked this pull request as ready for review November 15, 2022 14:24
^^^^^^^^^^^^^^^^^^^^^^^^

You can define arbitrary postprocessing steps by setting the ``postprocess`` key to a list of dictionaries
containing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to a list of dictionaries containing the keys:

- ``environ``: dictionary of (environment variable, value)-pairs
- ``workdir``: path of the working directory

keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line then.


Assume you want to postprocess the ``facebook_combined`` and ``cit-HepTh`` network from
`SNAP <https://snap.stanford.edu/data/>`_ using two executables ``postprocess1`` and ``postprocess2``, which
take the path of the instance as parameter. Also, you have to prepend the path for ``postprocess1`` to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does only postprocess1 needs the PATH variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to demonstrate the possibility/necessity of adding the appropriate PATH-variable in simexpal in order to find the executable for the postprocessing. There might be cases where the PATH was added beforehand, so that is isn't necessary to do so anymore.

Re-Postprocessing
^^^^^^^^^^^^^^^^^

To re-postprocess instances, you can simply delete the respective ``<instance_name>.postprocessed`` files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name/reference one more time that instances are saved in inst_dir.

``simex instances run-transform --transform='to_edgelist'`` if you already have them saved locally.
execute supported actions, e.g, transforming the instances to edge list format via
``simex instances run-transform --transform='to_edgelist'`` or :ref:`postprocess <PostprocessInstances>`
them if you already have them saved locally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put if you already have them saved locally to the beginning of the sentence. Otherwise it currently reads like: postprocessing is available if the instances are locally available (while the transform works all the time).

Comment on lines +775 to +818
print("simexpal: Restoring original instance files")
os.rename(partial_path + '.post0', partial_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning .postprocessed and Instance.check_available(): I think it is enough to inform the user when postprocessing has not been successful. Making successful postprocessing mandatory might lead to unexpected results for users (for example if the experiment does not really rely on it in practice).

fullpath = instance.fullpath
util.try_rmfile(fullpath)
util.try_rmfile(fullpath + '.postprocessed')
util.try_rmfile(fullpath + '.original')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is with .post0 and .post1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants