Skip to content

LDMS Data Facilitates Analysis

oceandlr edited this page Oct 22, 2018 · 7 revisions

Why is high frequency data necessary?


IO wait profiles from LDMS data for a 64-node job: 1 sec interval (top), 60 sec interval (bottom). Higher fidelity sampling is needed to resolve details. Each line is a single node’s data (legend suppressed). The gray background shows times pre- and post-job. From Toward Rapid Understanding of Production HPC Applications and Systems @ IEEECluster 2015.

Why is whole system data necessary?

Full system data enables investigation of effects of conditions which cannot be understood from the data accessible from an application's perspective alone. For example, in shared networks conditions along the communication routes will affect the application's performance. That data is not available from the application's allocation.

Why is synchronized data necessary?

In order to get a coherent picture of conditions, the data must be collected at effectively the same time across possibly tens of thousands of disparate components.


Lustre opens per node over a day on NCSA's Blue Waters. Significant opens at across system at the same time indicated by arrow. Horizontal lines are indicative of significant and sustained level of opens from a few nodes. From The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications @ SC 2014.

Why are run time data collection and transport necessary?

Run time data collection and transport enables analysis while applications are running and while the system is experiencing conditions of interest. Thus, problems can be discovered early and remediative action can be taken. Post-processing analysis does not solve problems as they occur and is rarely performed in practice. LDMS supports streaming analysis as part of the store plugins or on the output of the store (e.g., store feeding a named pipe). Analysis can also be performed on a database while LDMS data is being fed into it.

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally