In many cases Manufacturers-specific IPMI Platfrom Events are stored in binary form in System Event Log making it very difficult to easily understand platfrom state. This document specifies a solution for presenting Manufacturer Spcific IPMI Platform Events in a human readable form by defining a generic framework for parsing and defining new messages in an easy and scallable way. Example of events originating from Intel Management Engine (ME) is used as a case-study. General design of the solution is followed by tailored-down implementation for OpenBMC described in detail.
- IPMI - Intelligent Platform Management Interface; standarized binary
protocol of communication between endpoints in datacenter
[1]
- Platform Event - specific type of IPMI binary payload, used for encoding
and sending asynchronous one-way messages to recipient
[1]-29.3
- ME - Intel Management Engine, autonomous subsystem used for remote
datacenter management
[5]
- Redfish - modern datacenter management protocol, built around REST
protocol and JSON format
[2]
- OpenBMC - open-source BMC implementation with Redfish-oriented interface
[3]
IPMI is designed to be a compact and efficient binary format of data exchanged between entities in data-center. Recipient is responsible to receive data, properly analyze, parse and translate the binary representation to human-readable format. IPMI Platform Events is one type of these messages, used to inform recipient about occurence of a particular well defined situation.
Part of IPMI Platform Events are standarized and described in the specification
and already have an open-source implementation ready [6]
, however this is only
part of the spectrum. Increasing complexity of datacenter systems have multipled
possible sources of events which are defined by manufacturer-specirfic
extenstions to platform event data. One of these sources is Intel ME, which is
able to deliver information about its own state of operation and in some cases
notify about certain erroneous system-wide conditions, like interface errors.
These OEM-specific messages lacks support in existing open-source
implementations. They require manual, documentation-based [5]
implementation,
which is historically the source of many interpretation errors. Any document
update requires manual code modification according to specific changes which is
not efficient nor scalable. Furthermore - documentation is not always clear on
event severity or possible resolution actions.
Generic OEM-agnostic algorithm is proposed to achieve human-readable output for binary IPMI Platform Event.
In general, each event consists of predefined payload:
[GeneratorID][SensorNumber][EventType][EventData[2]]
where:
GeneratorID
- used to determine source of the event,SensorNumber
- generator-specific unique sensor number,EventType
- sensor-specific group of events,EventData
- array with detailed event data.
One might observe, that each consecutive event field is narrowing down the
domain of event interpretations, starting with GeneratorID
at the top, ending
with EventData
at the end of a decision tree
. Software should be able to
determine meaning of the event by using the divide and conquer
approach for
predefined list of well known event definitions. One should notice the fact,
that such decision tree might be also needed for breakdown of EventData
, as in
many OEM-specific IPMI implementations that is also the case.
Implementation should be therefore a series of filters with increasing specialization on each level. Recursive algorithm for this will look like the following:
+-------------+ +*Step 1* +
| +---------+ | | |
| |Currently| | |Analyze and choose |
+----> |analyzed +------------>+proper 'subtree' parser|
| | |chunk | | | |
| | +---------+ | + + +---------+
| | +---------+ | |Remainder|
| | |Remainder| | | |
| | | | | +*Step 2* + | |
| | | | | | | | |
| | | +---------------------------------------------->+ +---+
| | | | | |'Cut' the remainder | | | |
| | | | | |and go back to Step 1 | | | |
| | | | | + + | | |
| | +---------+ | | | |
| +-------------+ +---------+ |
| |
| |
+------------------------------------------------------------------------------+
Described process will be repeated until there is nothing to break-down and
singular unique event interpretation will be determined (an EventId
).
Not all event data is a decision point - certain chunks of data should be kept
as-is or formatted in certain way, to be introduced in human-readable Message
.
Parser operation should also include a logic for extracting Parameters
during
the traversal process.
Effectively, both EventId
and an optional collection of Parameters
should be
then used as input for lookup mechanic to generate final Event Message
. Each
message consists of following entries:
EventId
- associated unique event,Severity
- determines how severely this particular event might affect usual datacenter operation,Resolution
- suggested steps to mitigate possible problem,Message
- human-readable message, possibly with predefined placeholders forParameters
.
Example of such message parsing process is shown below:
+-------------+
|[GeneratorId]|
|0x2C (ME) |
+------+------+
|
+------v---------+
|[SensorNumber] |
. . . . |0x17 (ME Health)|
+------+---------+
|
+------v---------+
|[EventType] |
. . . . |0x00 (FW Status)|
+------+---------+
|
+------v-------------------+
|[EventData[0]] | +-------------------------------------------+
. . . . |0x0A (FlashWearoutWarning)+------+ |ParsedEvent| |
+------+-------------------+ | +-----------+ |
| +---->'EventId' = FlashWearoutWarning |
+------v----------+ +---->'Parameters' = [ toDecimal(EventData[1]) ] |
|[EventData[1]] | | | |
|0x## (Percentage)+---------------+ +-------------------------------------------+
+-----------------+
, determined ParsedEvent
might be then passed to lookup mechanism, which
contains human-readable information for each EventId
:
+------------------------------------------------+
|+------------------------------------------------+
||+------------------------------------------------+
||| EventId: FlashWearoutWarning |
||| Severity: Warning |
||| Resolution: No immediate repair action needed |
||| Message: Warning threshold for number of flash |
||| operations has been exceeded. Current |
||| percentage of write operations |
+|| capacity: %1 |
+| |
+------------------------------------------------+
Proposed algorithm is delivered as part of open-source OpenBMC project [3]
. As
this software stack is built with micro-service architecture in mind, the
implementation had to be divided into multiple parts:
- IPMI Platform Event payload unpacking (
[7]
)openbmc/intel-ipmi-oem/src/sensorcommands.cpp
openbmc/intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp
- Intel ME event parsing
openbmc/intel-ipmi-oem/src/me_to_redfish_hooks.cpp
- Detected events storage (
[4]
)systemd journal
- Human-readable message lookup (
[2], [8]
)MessageRegistry in bmcweb
openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp
- IPMI driver notifies
intel-ipmi-oem
about incomingPlatform Event
(NetFn=0x4, Cmd=0x2)- Proper command handler in
intel-ipmi-oem/src/sensorcommands.cpp
is notified
- Proper command handler in
- Message is forwarded to
intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp
as call tosel::checkRedfishHooks
sel::checkRedfishHooks
analyzes the data,BIOS
events are handled in-place, whileME
events are delegated tointel-ipmi-oem/src/me_to_redfish_hooks.cpp
me::messageHook
is called with the payload. Parsing algorithm determines finalEventId
andParameters
me::utils::storeRedfishEvent(EventId, Parameters)
is called, it stores event securely insystem journal
Each IPMI Platform Event is parsed using aforementioned me::messageHook
handler. Implementation of the proposed algorithm is the following:
Based on EventType
proper designated handler is called.
namespace me {
static bool messageHook(const SELData& selData, std::string& eventId,
std::vector<std::string>& parameters)
{
const HealthEventType healthEventType =
static_cast<HealthEventType>(selData.offset);
switch (healthEventType)
{
case HealthEventType::FirmwareStatus:
return fw_status::messageHook(selData, eventId, parameters);
break;
case HealthEventType::SmbusLinkFailure:
return smbus_failure::messageHook(selData, eventId, parameters);
break;
}
return false;
}
}
Example of handler for FirmwareStatus
, tailored down to essential distinctive
use cases:
namespace fw_status {
static bool messageHook(const SELData& selData, std::string& eventId,
std::vector<std::string>& parameters)
{
// Maps EventData[0] to either a resolution or further action
static const boost::container::flat_map<
uint8_t,
std::pair<std::string, std::optional<std::variant<utils::ParserFunc,
utils::MessageMap>>>>
eventMap = {
// EventData[0]=0
// > MessageId=MERecoveryGpioForced
{0x00, {"MERecoveryGpioForced", {}}},
// EventData[0]=3
// > call specific handler do determine MessageId and Parameters
{0x03, {{}, flash_state::messageHook}},
// EventData[0]=7
// > MessageId=MEManufacturingError
// > Use manufacturingError map to translate EventData[1] to string
// and add it to Parameters collection
{0x07, {"MEManufacturingError", manufacturingError}},
// EventData[0]=9
// > MessageId=MEFirmwareException
// > Use a function to log specified byte of payload as Parameter
// in chosen format. Here it stores 2-nd byte in hex format.
{0x09, {"MEFirmwareException", utils::logByteHex<2>}}
return utils::genericMessageHook(eventMap, selData, eventId, parameters);
}
// Maps EventData[1] to specified message
static const boost::container::flat_map<uint8_t, std::string>
manufacturingError = {
{0x00, "Generic error"},
{0x01, "Wrong or missing VSCC table"}}};
}
Cascading calls of functions, logging utilities and map resolutions are
resulting in populating both std::string& eventId
and
std::vector<std::string>& parameters
. This data is then used to form a valid
system log and stored in system journal.
Event data is accessible as Redfish
resources in two places:
MessageRegistry
- stores all event 'metadata' (severity, resolution notes, messageId)EventLog
- lists all detected events in the system in processed, human-readable form
Implementation of bmcweb
MessageRegistry
contents can be found at
openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp
.
Intel-specific events have proper prefix in MessageId: either 'BIOS' or 'ME'.
It can be read by the user by calling GET
on Redfish resource:
/redfish/v1/Registries/OpenBMC/OpenBMC
. It contains JSON array of entries in
standard Redfish format, like so:
"MEFlashWearOutWarning": {
"Description": "Indicates that Intel ME has reached certain threshold of flash write operations.",
"Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: %1",
"NumberOfArgs": 1,
"ParamTypes": [
"number"
],
"Resolution": "No immediate repair action needed.",
"Severity": "Warning"
}
System-wide EventLog is
implemented in bmcweb
at openbmc/bmcweb/redfish-core/lib/log_services.hpp
.
It can be read by the user by calling GET
on Redfish resource:
/redfish/v1/Systems/system/LogServices/EventLog
. It contains JSON array of log
entries in standard Redfish format, like so:
{
"@odata.id": "/redfish/v1/Systems/system/LogServices/EventLog/Entries/37331",
"@odata.type": "#LogEntry.v1_4_0.LogEntry",
"Created": "1970-01-01T10:22:11+00:00",
"EntryType": "Event",
"Id": "37331",
"Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: 50",
"MessageArgs": ["50"],
"MessageId": "OpenBMC.0.1.MEFlashWearOutWarning",
"Name": "System Event Log Entry",
"Severity": "Warning"
}