diff --git a/reri_contributors.adoc b/reri_contributors.adoc index 455c443..d9e92c9 100644 --- a/reri_contributors.adoc +++ b/reri_contributors.adoc @@ -3,4 +3,4 @@ This RISC-V specification has been contributed to directly or indirectly by (in alphabetical order): [%hardbreaks] -Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Nicasio Canino, Petar Radojkovic, Shubu Mukherjee, Vedvyas Shanbhogue, Xiaohan Ma +Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Mark Hill, Nicasio Canino, Paul Donahue, Petar Radojkovic, Shubu Mukherjee, Vedvyas Shanbhogue, Xiaohan Ma diff --git a/reri_err_reporting.adoc b/reri_err_reporting.adoc index a275578..210c02e 100644 --- a/reri_err_reporting.adoc +++ b/reri_err_reporting.adoc @@ -1,6 +1,6 @@ == Error Reporting -Components (e.g., a RISC-V hart, a memory controller, etc.) in a system that +Components, such as a RISC-V hart or a memory controller, in a system that support error detection may implement one or more banks of error records. Each error bank may implement one or more error records. Each error record corresponds to one or more hardware units of the component and reports errors @@ -301,15 +301,15 @@ is recommended that implementations continue performing error correction even when error reporting is disabled. It is recommended that a hardware component continue to produce error detection -and correction codes on data generated by or stored in the hardware component even -when error reporting is disabled. It is recommended hardware components +and correction codes on data generated by or stored in the hardware component +even when error reporting is disabled. It is recommended hardware components continue to use containment techniques like data poisoning even when error reporting is disabled. ==== -The `ces`, `ueds`, and `uecs` are WARL fields used to enable signaling of CE, UED, -and UEC respectively when they are logged (i.e. when `else` is 1). Enables for -unsupported classes of errors may be hardwired to 0. The encodings of these +The `ces`, `ueds`, and `uecs` are WARL fields used to enable signaling of CE, +UED, and UEC respectively when they are logged (i.e. when `else` is 1). Enables +for unsupported classes of errors may be hardwired to 0. The encodings of these fields are specified in <>. [[ERR_SIG_ENABLES]] @@ -333,8 +333,8 @@ information carried by the signal is `UNSPECIFIED` by this specification. ==== The error signaling enables typically default to 0 - disabled - at reset to allow a RAS handler an opportunity to initialize itself for handling RAS signals and to -initialize the hardware units that generate the RAS signals before error reporting -is enabled. +initialize the hardware units that generate the RAS signals before error +reporting is enabled. The signal generated by the error record may in addition to causing an interrupt/event notification be also used to carry additional information to aid @@ -383,8 +383,8 @@ the `rdip` (read-in-progress) bit of the associated `status_i` register to be set. The `srdp` field always returns 0 on read. The `rdip` field in the `status_i` register is set to 1 by hardware when an error is recorded in an invalid error record causing the `v` field to change from 0 to 1. The `rdip` -field is cleared to 0 by hardware when a new error updates any field of a valid (`v=1`) -error record. +field is cleared to 0 by hardware when a new error updates any field of a valid +(`v=1`) error record. The status-register-invalidate (`sinv`) bit, when written with a value of 1, causes the `v` (valid) field of the associated `status_i` register to be @@ -397,21 +397,19 @@ while reading of the error record is in progress. If the `sinv` and `srdp` are both written to 1 together then the `rdip` bit is set and the `v` bit is cleared to 0. -<<< - [NOTE] ==== Software may determine if the error record was read atomically by first reading the registers of the error record, then clearing the valid in `status_i` by writing 1 to `control_i.sinv` and then reading the `status_i` register again to determine if the `v` field was cleared to 0. If the `v` field is still 1 but -the `rdip` field is 0 then it is indicative of an overwrite that may have occurred -during the process of reading the error record. If the `v` field is 1 and the -`rdip` is also 1 then it indicates a new error was recorded after the `v` field -was cleared; but the read of the error record to collect the previous error was -atomic. If an overwrite occurred during the process of reading the error record -then the process may be repeated, after setting the `rdip` field, to read the -latest reported error. +the `rdip` field is 0 then it is indicative of an overwrite that may have +occurred during the process of reading the error record. If the `v` field is 1 +and the `rdip` is also 1 then it indicates a new error was recorded after the +`v` field was cleared; but the read of the error record to collect the previous +error was atomic. If an overwrite occurred during the process of reading the +error record then the process may be repeated, after setting the `rdip` field, +to read the latest reported error. ==== The error-injection-delay (`eid`) is a WARL field used to control error record @@ -490,8 +488,8 @@ hardwired to 0. If the bits corresponding to more than one error class are set to 1 then the error record holds information about the highest severity error class among the bits set. The error record may be used to provide an informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and -`uec` bits to 0. Such informational updates are lower severity than a CE but are signaled using the signal -configured in `control_i.ces`. +`uec` bits to 0. Such informational updates are lower severity than a CE but are +signaled using the signal configured in `control_i.ces`. When `v` is 1, if more errors of the same class as the error currently logged in the error record occur then the multiple-occurrence (`mo`) bit is set to indicate @@ -635,8 +633,8 @@ a fabric component to implicitly access a routing table data structure. <<< If the detected error reports additional information in the `info_i` register -then the information-valid (`iv`) field is set to 1. If the detected error reports -additional supplemental information in the `suppl_info_i` register then +then the information-valid (`iv`) field is set to 1. If the detected error +reports additional supplemental information in the `suppl_info_i` register then supplemental-information-valid (`siv`) field is set to 1. The `iv` and/or `siv` fields may be hardwired to 0 if the error record does not provide information in `info_i` and/or `suppl_info_i` registers. When `iv` is 0, the value in `info_i` @@ -668,10 +666,10 @@ differentiate them from the standard encodings. The read-in-progress (`rdip`) field is set to 1 by hardware when a new error is recorded in an invalid status register and is cleared to 0 by hardware when a -valid status register is overwritten. When the `control_i.sinv` field is written to -1, the `v` field is cleared to 0 only if the `rdip` field is 1. Gating the clearing -of the `v` field by the `rdip` field being 1 allows software to detect an -overwrite that may occur while it is in process of reading an error record. +valid status register is overwritten. When the `control_i.sinv` field is written +to 1, the `v` field is cleared to 0 only if the `rdip` field is 1. Gating the +clearing of the `v` field by the `rdip` field being 1 allows software to detect +an overwrite that may occur while it is in process of reading an error record. An error record that supports the 1 setting of the `cece` field in `control_i`, implements a corrected-error-counter in the `cec` field. The `cec` is a WARL @@ -711,8 +709,8 @@ writing a new error into the record and setting the `v` field to 1, then softwar should repeat this process. ==== -When an UEC or UED error is logged in an error record, the `cec` and `ceco` fields -of the error record are not modified and retain their values. +When an UEC or UED error is logged in an error record, the `cec` and `ceco` +fields of the error record are not modified and retain their values. ==== Address-or-Information Register (`addr_info_i`) @@ -796,13 +794,13 @@ detected but unprocessed error, the decision to overwrite the error record with new error information is determined by the new error's severity and/or priority. The overwrite rules allow a higher severity error to overwrite a lower severity -error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two -errors have the same severity the priority of the errors (as determined by -`status_i.pri`) is used to determine if the error record is overwritten. Higher -priority errors overwrite the lower priority errors. When an error record is -overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by -UEC/UED), the status bits indicating the severity of the older errors are -retained (i.e., are sticky). +error. UEC has the highest severity, followed by UED, then CE, and finally, +informational. When the two errors have the same severity the priority of the +errors (as determined by `status_i.pri`) is used to determine if the error +record is overwritten. Higher priority errors overwrite the lower priority +errors. When an error record is overwritten by a higher severity error (UED/CE +by UEC, UED by UEC, or CE by UEC/UED), the status bits indicating the severity +of the older errors are retained (i.e., are sticky). When an error writes or overwrites an error record, the `status_i.cec` and `status_i.ceco` fields update from CEs and retain value for errors of other diff --git a/reri_intro.adoc b/reri_intro.adoc index 5d19ed7..b59cb57 100644 --- a/reri_intro.adoc +++ b/reri_intro.adoc @@ -9,11 +9,11 @@ interface to enable error reporting, provide the facility to log the detected errors (including their severity, nature, and location), and configuring means to signal the error to a RAS handler component. The RAS handler may use this information to determine suitable recovery actions that may include terminating -the computation (e.g., terminating a process, etc.), restarting parts or all of +the computation (e.g., terminating a process), restarting parts or all of the system, etc. to recover from the errors. Additionally, this specification shall support software-initiated error logging, reporting, and testing of RAS handlers. Lastly, this specification shall provide maximal flexibility to -implement error handling and shall co-exist with RAS frameworks defined by other +implement error handling and coexists with RAS frameworks defined by other standards such as PCIe cite:[PCI] and CXL cite:[CXL]. A system is an entity that interacts with other entities such as other systems,