Skip to content

Commit

Permalink
Add logging documentation, including vipin-created logs (#199)
Browse files Browse the repository at this point in the history
- *Category*:  documentation
- *JIRA issue*: [MIC-4757](https://jira.ihme.washington.edu/browse/MIC-4757)

Changes and notes
- Adds a logging page to the docs, describes all the logs in general, with sections for each "area" of logs.
  • Loading branch information
mattkappel authored Dec 19, 2023
1 parent e8e6b73 commit a222429
Show file tree
Hide file tree
Showing 5 changed files with 51 additions and 3 deletions.
2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,5 @@
}
],
"access_right": "open",
"description": "Archival release of Vivarium Cluster Tools, a Python package that makes running ``vivarium`` simulations at scale on a Univa Grid Engine cluster easy."
"description": "Archival release of Vivarium Cluster Tools, a Python package that makes running ``vivarium`` simulations at scale on a Slurm cluster easy."
}
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
**1.5.1 - 12/15/23**

- Add logging documentation for psimulate

**1.5.0 - 10/27/23**

- Remove default results directory for 'psimulate run'
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Vivarium Cluster Tools
:alt: Documentation Status

Vivarium cluster tools is a python package that makes running ``vivarium``
simulations at scale on a Univa Grid Engine cluster easy.
simulations at scale on a Slurm cluster easy.

Installation
------------
Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
Vivarium Cluster Tools Documentation
====================================
Vivarium cluster tools is a python package that makes running ``vivarium``
simulations at scale on a Univa Grid Engine cluster easy.
simulations at scale on a Slurm cluster easy.

.. toctree::
:maxdepth: 2
Expand All @@ -16,5 +16,6 @@ simulations at scale on a Univa Grid Engine cluster easy.
distributed_runner
yaml_basics
branch
logging
api_reference/index
glossary
43 changes: 43 additions & 0 deletions docs/source/logging.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. toctree::
:maxdepth: 2
:caption: Contents:

Logging
============

Sometimes, even with perfect code, things can go wrong at sufficient scale.
When they do, it's useful to look to the logs to see what happened. ``psimulate``
logs to the results directory, in a subdirectory called ``logs``. Inside that directory,
there will be a directory for each simulation run or restart. If neither
``psimulate restart`` nor ``psimulate expand`` was
ever used for the run, there will be only one directory for the run.

Top-level logs
----------------
At the top-level of the directory, there will be text and JSON-formatted main log files.
These are the log files for the runner process. There will also be a log file for each
Redis database process, which will be named ``redis.p<port>.log``. Per-worker logs are
in ``cluster_logs`` and ``worker_logs`` directories, described below.

Cluster logs
-------------
The ``cluster_logs`` directory contains logs from the the array job processes. Each worker job
has its own file. The contents of these are similar to what you will find in the ``worker_logs``
directory, but a superset. The logs in the ``cluster_logs`` directory contain information about Redis
heartbeats and other cluster-related information.

Worker logs
-------------
The ``worker_logs`` directory contains logs from the the worker processes as they relate
running simulations. Additionally this directory contains performance logs that
are described in the next section.

Performance logs
-----------------
As part of the VIPIN (VIvarium Performance INformation) feature, ``psimulate`` gathers
per-worker performance information. This information is summarized at the end of the parallel
runs and stored in the ``worker_logs`` directory as ``log_summary.csv``. This file
contains metadata identifying the run and the worker host, execution timing information, and CPU,
disk, and network performance counters. The intent of this logging is to allow users to understand the
performance characteristics of their simulations and in the event of suspicious performance,
to be able to correlate outlier performance characteristics to cluster and hardware events.

0 comments on commit a222429

Please sign in to comment.