Skip to content

Tx2mon sampler development notes

Sara W edited this page Mar 15, 2021 · 3 revisions

MMAP Structure

  1. Earlier versions used mmaped structure to user space mmap

  2. Previously tried mmap die to kernel to user space mmap but system nodes were "panicking"

    a. Could be handlers, regressions, etc.

    b. Possibly revisit this and try again

Keep schema integrity

  1. Data structure from memory to ldms
  2. Anything that's manual is a bad problem
  3. Maintaining data structure read from sysdevice

Dynamically build a schema in ldms

  1. Goal: Bridge the gap between structure and ldms - set of functions that build the schema on the fly
  2. Building schema programmatically - understand tx2mon data structure (how many cores, etc)
  3. Add items to the schema - individual fields/items (no names)
  4. Helper functions that take each objects in the structure, create and remember schema entry, and build a table that use the appropriate helper function (based on uint16, etc.)
  5. Need variable length arrays (how many cores are active)

Build a work table that gets executed every time sampler runs

  1. Start down a list to work through, start helper function, handlers, offset to datastructure (2 of them), # of valid entries in an array.
  2. Build a worklist to execute with minimum overhead
  3. Helper function embodies data types
  4. Sampler/dstat/dstat.c and helpers are a possible model to follow
  5. get_throttle_cause: temp, power, or external implicit in code convert to string (message proc-hot from chassis or some such to reduce consumption is what external means)
  6. Sampler needs to produce multiple sets (as with ibnet) possibly one per socket.

Retest tx2mon data reopen needed or just seek.

  1. Check in tx2mon c code if they reopen or seek or close/open

Additional Notes

  1. Core values will stay in the system at the same index for the particular boot. double check online
  2. tx2mon limit is 32 cores. Marvell cancelled all future tx2 parts
  3. Array of 6 where only the first 3 matter (core, temp freq). Tx2mon is not aware of which ones are important
  4. freq_cpu may be a scalar instead of per-core: (do min/max?)
  5. Email Andy Warner or Kevin Pedretti for additional information about tx2mon

Functionality

TX2MON - Original Code

  • Need socinfo, node0 and node1 raw files What are in these node# raw files? Where do they come from?
  • Files are located in /sys/devices/platform/tx2mon/ directory
  • Main function checks for user arguments: -h - help/usage) -d - Delay between samples taken (0.0001-.9999) -f - Output to csv file. Must provide output file name. -T - Display throttling -x - Display extra parameters () -q - Only report the information once. No updating -t - Show the timestamp of the sample default: Display data at certain interval in interactive mode (does not save to csv file)
  • Uses lseek() to read through the node# files.

dstat.c

  • Provides overview of ldmsd
  • Finds and creates the metric sets and schemas by grabbing data from three structs : stat, io and statm
  • Count of actively used mmalloc bytes.

dstat.h

  • Defines the list of structs (stat, io and statm) as well as the functions to parse each one

dstat_parse.c

  • Parses each file and utilizes structs defines in dstat.h

Code used for building and testing Tx2mon Sampler

Function for parsing the data files

Below contains the code pulled from the read_node() function in tx2mon.c. This function reads the node file(s) using lseek and populates the found metrics into the data struct(s). The "throttling_available" variable is used only when debugging the code in the sampler log output.

rv = lseek(s->fd, 0, SEEK_SET);
        if (rv < 0)
               return rv;
        rv = read(s->fd, op, sizeof(*op));
        if (rv < sizeof(*op))
                return rv;
        if (CMD_STATUS_READY(op->cmd_status) == 0)
                return 0;
        if (CMD_VERSION(op->cmd_status) > 0)
                s->throttling_available =  1;
        else
                s->throttling_available =  0;

Functions for converting metrics

The following functions were used to convert the metrics to temperature, voltage and power. The function cpu_temp and cpu_freq are used when printing out the table similar to the tx2mon program (used for debugging).

static inline double cpu_temp(struct cpu_info *d, int c)
{
        return to_c(d->mcp.tmon_cpu[c]);
}

static inline unsigned int cpu_freq(struct cpu_info *d, int c)
{
        return d->mcp.freq_cpu[c];
}

static inline double to_v(int mv)
{
        return mv/1000.0;
}

static inline double to_w(int mw)
{
        return mw/1000.0;
}

Debugging

Variable assignments and definitions

Below are the list of variables pulled from tx2mon.c and defined in the tx2mon sampler header file (tx2mon.h). All variables are used for debugging the tx2mon sampler in the log output.

static struct termios *ts_saved;
static int display_extra = 0;
static int display_throttling = 1;
FILE    *fileout;
int     samples;
unsigned int throttling_available:1;

Functions for outputting data to log file

Additionally, the functions listed below were pulled from the tx2mon.c code for debugging purposes. These functions allow the programmer to debug the sampler by dumping the metric values into a table format identical to the one provided by the tx2mon program.

static void term_init_save(void)
{
        static struct termios nts;

        if (!isatty(1)) {
                term_seq.cl = "";
                term_seq.nl = "\n";
                return;
        }
        ts_saved = malloc(sizeof(*ts_saved));
        if (tcgetattr(0, ts_saved) < 0)
                goto fail;

        nts = *ts_saved;
        nts.c_lflag &= ~(ICANON | ECHO);
        nts.c_cc[VMIN] = 1;
        nts.c_cc[VTIME] = 0;
        if (tcsetattr (0, TCSANOW, &nts) < 0)
                goto fail;

        term_seq.nl = "\r\n";
        return;
fail:
        if (ts_saved) {
                free(ts_saved);
                ts_saved = NULL;
        }
        msglog(LDMSD_LERROR, SAMP ": Failed to set up  terminal %i", errno);
}
static void dump_cpu_info(struct cpu_info *s)
{
        struct mc_oper_region *op = &s->mcp;
        struct term_seq *t = &term_seq;
        int i, c, n;
        char buf[64];

        printf("Node: %d  Snapshot: %u%s", s->node, op->counter, t->nl);
        printf("Freq (Min/Max): %u/%u MHz     Temp Thresh (Soft/Max): %6.2f/%6.2f C%s",
                op->freq_min, op->freq_max, to_c(op->temp_soft_thresh),
                to_c(op->temp_abs_max), t->nl);
        printf("%s", t->nl);
        n = tx2mon->n_core < CORES_PER_ROW ? tx2mon->n_core : CORES_PER_ROW;
        for (i = 0; i < n; i++)
                printf("|Core  Temp   Freq ");
        printf("|%s", t->nl);
        for (i = 0; i < n; i++)
                printf("+------------------");
        printf("+%s", t->nl);
        for (c = 0;  c < tx2mon->n_core; ) {
                for (i = 0; i < CORES_PER_ROW && c < tx2mon->n_core; i++, c++)
                        printf("|%3d: %6.2f %5d ", c,
                                        cpu_temp(s, c), cpu_freq(s, c));
                printf("|%s", t->nl);
        }
        printf("%s", t->nl);
        printf("SOC Center Temp: %6.2f C%s\n", to_c(op->tmon_soc_avg), t->nl);
        printf("Voltage    Core: %6.2f V, SRAM: %5.2f V,  Mem: %5.2f V, SOC: %5.2f V%s",
                to_v(op->v_core), to_v(op->v_sram), to_v(op->v_mem),
                to_v(op->v_soc), t->nl);
        printf("Power      Core: %6.2f W, SRAM: %5.2f W,  Mem: %5.2f W, SOC: %5.2f W%s",
                to_w(op->pwr_core), to_w(op->pwr_sram), to_w(op->pwr_mem),
                to_w(op->pwr_soc), t->nl);
        printf("Frequency    Memnet: %4d MHz", op->freq_mem_net);
        if (display_extra)
                printf(", SOCS: %4d MHz, SOCN: %4d MHz", op->freq_socs, op->freq_socn);
        printf("%s%s", t->nl, t->nl);
        if (!display_throttling)
                return;

        if (s->throttling_available) {
                printf("%s", t->nl);
                printf("Throttling Active Events: %s%s",
                         get_throttling_cause(op->active_evt, ",", buf, sizeof(buf)), t->nl);
                printf("Throttle Events     Temp: %6d,    Power: %6d,    External: %6d%s",
                                op->temp_evt_cnt, op->pwr_evt_cnt, op->ext_evt_cnt, t->nl);
                printf("Throttle Durations  Temp: %6d ms, Power: %6d ms, External: %6d ms%s",
                                op->temp_throttle_ms, op->pwr_throttle_ms,
                                op->ext_throttle_ms, t->nl);
        } else {
                printf("Throttling events not supported.%s", t->nl);
        }
        printf("%s", t->nl);
}
static char *get_throttling_cause(unsigned int active_event, const char *sep, char *buf, int bufsz)
{
        const char *causes[] = { "Temperature", "Power", "External", "Unk3", "Unk4", "Unk5"};
        const int ncauses = sizeof(causes)/sizeof(causes[0]);
        int i, sz, events;
        char *rbuf;

        rbuf = buf;
        if (active_event == 0) {
                snprintf(buf, bufsz, "None");
                return rbuf;
        }

        for (i = 0, events = 0; i < ncauses && bufsz > 0; i++) {
                if ((active_event & (1 << i)) == 0)
                        continue;
                sz = snprintf(buf, bufsz, "%s%s", events ? sep : "", causes[i]);
                bufsz -= sz;
                buf += sz;
                ++events;
        }
        return rbuf;
}

Main

LDMSCON

Tutorials are available at the conference websites

D/SOS Documentation

LDMS v4 Documentation

Basic

Configurations

Features & Functionalities

Working Examples

Development

Reference Docs

Building

Cray Specific
RPMs
  • Coming soon!

Adding to the code base

Testing

Misc

Man Pages

  • Man pages currently not posted, but they are available in the source and build

LDMS Documentation (v3 branches)

V3 has been deprecated and will be removed soon

Basic

Reference Docs

Building

General
Cray Specific

Configuring

Running

  • Running

Tutorial

Clone this wiki locally