Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit updates #50

Merged
merged 2 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion reri_contributors.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
This RISC-V specification has been contributed to directly or indirectly by (in alphabetical order):

[%hardbreaks]
Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Nicasio Canino, Petar Radojkovic, Shubu Mukherjee, Vedvyas Shanbhogue, Xiaohan Ma
Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Mark Hill, Nicasio Canino, Paul Donahue, Petar Radojkovic, Shubu Mukherjee, Vedvyas Shanbhogue, Xiaohan Ma
70 changes: 34 additions & 36 deletions reri_err_reporting.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
== Error Reporting

Components (e.g., a RISC-V hart, a memory controller, etc.) in a system that
Components, such as a RISC-V hart or a memory controller, in a system that
support error detection may implement one or more banks of error records. Each
error bank may implement one or more error records. Each error record
corresponds to one or more hardware units of the component and reports errors
Expand Down Expand Up @@ -301,15 +301,15 @@ is recommended that implementations continue performing error correction even
when error reporting is disabled.

It is recommended that a hardware component continue to produce error detection
and correction codes on data generated by or stored in the hardware component even
when error reporting is disabled. It is recommended hardware components
and correction codes on data generated by or stored in the hardware component
even when error reporting is disabled. It is recommended hardware components
continue to use containment techniques like data poisoning even when error
reporting is disabled.
====

The `ces`, `ueds`, and `uecs` are WARL fields used to enable signaling of CE, UED,
and UEC respectively when they are logged (i.e. when `else` is 1). Enables for
unsupported classes of errors may be hardwired to 0. The encodings of these
The `ces`, `ueds`, and `uecs` are WARL fields used to enable signaling of CE,
UED, and UEC respectively when they are logged (i.e. when `else` is 1). Enables
for unsupported classes of errors may be hardwired to 0. The encodings of these
fields are specified in <<ERR_SIG_ENABLES>>.

[[ERR_SIG_ENABLES]]
Expand All @@ -333,8 +333,8 @@ information carried by the signal is `UNSPECIFIED` by this specification.
====
The error signaling enables typically default to 0 - disabled - at reset to allow
a RAS handler an opportunity to initialize itself for handling RAS signals and to
initialize the hardware units that generate the RAS signals before error reporting
is enabled.
initialize the hardware units that generate the RAS signals before error
reporting is enabled.

The signal generated by the error record may in addition to causing an
interrupt/event notification be also used to carry additional information to aid
Expand Down Expand Up @@ -383,8 +383,8 @@ the `rdip` (read-in-progress) bit of the associated `status_i` register to be
set. The `srdp` field always returns 0 on read. The `rdip` field in the
`status_i` register is set to 1 by hardware when an error is recorded in an
invalid error record causing the `v` field to change from 0 to 1. The `rdip`
field is cleared to 0 by hardware when a new error updates any field of a valid (`v=1`)
error record.
field is cleared to 0 by hardware when a new error updates any field of a valid
(`v=1`) error record.

The status-register-invalidate (`sinv`) bit, when written with a value of 1,
causes the `v` (valid) field of the associated `status_i` register to be
Expand All @@ -397,21 +397,19 @@ while reading of the error record is in progress. If the `sinv` and `srdp` are
both written to 1 together then the `rdip` bit is set and the `v` bit is cleared
to 0.

<<<

[NOTE]
====
Software may determine if the error record was read atomically by first reading
the registers of the error record, then clearing the valid in `status_i` by
writing 1 to `control_i.sinv` and then reading the `status_i` register again to
determine if the `v` field was cleared to 0. If the `v` field is still 1 but
the `rdip` field is 0 then it is indicative of an overwrite that may have occurred
during the process of reading the error record. If the `v` field is 1 and the
`rdip` is also 1 then it indicates a new error was recorded after the `v` field
was cleared; but the read of the error record to collect the previous error was
atomic. If an overwrite occurred during the process of reading the error record
then the process may be repeated, after setting the `rdip` field, to read the
latest reported error.
the `rdip` field is 0 then it is indicative of an overwrite that may have
occurred during the process of reading the error record. If the `v` field is 1
and the `rdip` is also 1 then it indicates a new error was recorded after the
`v` field was cleared; but the read of the error record to collect the previous
error was atomic. If an overwrite occurred during the process of reading the
error record then the process may be repeated, after setting the `rdip` field,
to read the latest reported error.
====

The error-injection-delay (`eid`) is a WARL field used to control error record
Expand Down Expand Up @@ -490,8 +488,8 @@ hardwired to 0. If the bits corresponding to more than one error class are set
to 1 then the error record holds information about the highest severity error
class among the bits set. The error record may be used to provide an
informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and
`uec` bits to 0. Such informational updates are lower severity than a CE but are signaled using the signal
configured in `control_i.ces`.
`uec` bits to 0. Such informational updates are lower severity than a CE but are
signaled using the signal configured in `control_i.ces`.

When `v` is 1, if more errors of the same class as the error currently logged in
the error record occur then the multiple-occurrence (`mo`) bit is set to indicate
Expand Down Expand Up @@ -635,8 +633,8 @@ a fabric component to implicitly access a routing table data structure.
<<<

If the detected error reports additional information in the `info_i` register
then the information-valid (`iv`) field is set to 1. If the detected error reports
additional supplemental information in the `suppl_info_i` register then
then the information-valid (`iv`) field is set to 1. If the detected error
reports additional supplemental information in the `suppl_info_i` register then
supplemental-information-valid (`siv`) field is set to 1. The `iv` and/or `siv`
fields may be hardwired to 0 if the error record does not provide information in
`info_i` and/or `suppl_info_i` registers. When `iv` is 0, the value in `info_i`
Expand Down Expand Up @@ -668,10 +666,10 @@ differentiate them from the standard encodings.

The read-in-progress (`rdip`) field is set to 1 by hardware when a new error is
recorded in an invalid status register and is cleared to 0 by hardware when a
valid status register is overwritten. When the `control_i.sinv` field is written to
1, the `v` field is cleared to 0 only if the `rdip` field is 1. Gating the clearing
of the `v` field by the `rdip` field being 1 allows software to detect an
overwrite that may occur while it is in process of reading an error record.
valid status register is overwritten. When the `control_i.sinv` field is written
to 1, the `v` field is cleared to 0 only if the `rdip` field is 1. Gating the
clearing of the `v` field by the `rdip` field being 1 allows software to detect
an overwrite that may occur while it is in process of reading an error record.

An error record that supports the 1 setting of the `cece` field in `control_i`,
implements a corrected-error-counter in the `cec` field. The `cec` is a WARL
Expand Down Expand Up @@ -711,8 +709,8 @@ writing a new error into the record and setting the `v` field to 1, then softwar
should repeat this process.
====

When an UEC or UED error is logged in an error record, the `cec` and `ceco` fields
of the error record are not modified and retain their values.
When an UEC or UED error is logged in an error record, the `cec` and `ceco`
fields of the error record are not modified and retain their values.

==== Address-or-Information Register (`addr_info_i`)

Expand Down Expand Up @@ -796,13 +794,13 @@ detected but unprocessed error, the decision to overwrite the error record with
new error information is determined by the new error's severity and/or priority.

The overwrite rules allow a higher severity error to overwrite a lower severity
error. UEC has the highest severity, followed by UED, then CE, and finally, informational. When the two
errors have the same severity the priority of the errors (as determined by
`status_i.pri`) is used to determine if the error record is overwritten. Higher
priority errors overwrite the lower priority errors. When an error record is
overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by
UEC/UED), the status bits indicating the severity of the older errors are
retained (i.e., are sticky).
error. UEC has the highest severity, followed by UED, then CE, and finally,
informational. When the two errors have the same severity the priority of the
errors (as determined by `status_i.pri`) is used to determine if the error
record is overwritten. Higher priority errors overwrite the lower priority
errors. When an error record is overwritten by a higher severity error (UED/CE
by UEC, UED by UEC, or CE by UEC/UED), the status bits indicating the severity
of the older errors are retained (i.e., are sticky).

When an error writes or overwrites an error record, the `status_i.cec` and
`status_i.ceco` fields update from CEs and retain value for errors of other
Expand Down
4 changes: 2 additions & 2 deletions reri_intro.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ interface to enable error reporting, provide the facility to log the detected
errors (including their severity, nature, and location), and configuring means
to signal the error to a RAS handler component. The RAS handler may use this
information to determine suitable recovery actions that may include terminating
the computation (e.g., terminating a process, etc.), restarting parts or all of
the computation (e.g., terminating a process), restarting parts or all of
the system, etc. to recover from the errors. Additionally, this specification
shall support software-initiated error logging, reporting, and testing of RAS
handlers. Lastly, this specification shall provide maximal flexibility to
implement error handling and shall co-exist with RAS frameworks defined by other
implement error handling and coexists with RAS frameworks defined by other
standards such as PCIe cite:[PCI] and CXL cite:[CXL].

A system is an entity that interacts with other entities such as other systems,
Expand Down
Loading