Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Linux_Memory_Management_Essentials.md #38

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

igor-stoppa
Copy link
Collaborator

Description of internal memory management in Linux, from safety perspective.

igor-stoppa and others added 2 commits July 12, 2024 21:31
Co-authored-by: Paul Albertella <[email protected]>
Co-authored-by: Daniel Weingaertner <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
1. Because of the way pages, fractions and multiples of them are allocated, freed, cached, recovered, there is a complex interaction between system components at various layers.
2. Even using cgroups, it is not possible to segregate interaction at the low level between components with differnet level of safety qualification (e.g. a QM container can and most likely will affect the recirculation of pages related to an ASIL one)
3. Because of the nature of memory management, it must be expected that memory management mechanisms will interfere with safe processes, either due to a bug or due to the interference toward the metadata they rely on. For example, the memory management might hand over to a requesting entity a memory page that is currently already in use either by a device driver or by a userspace process playing a role in a safety use case.
4. Still due to te complex interaction between processes, kernel drivers and other kernel code, it is practically impossible to qualify the kernel as safe through positive testing alone, because it is impossible to validate all the possible combinations, and it is equally impossible to assess the overall test coverage and the risk associated with not reaching 100%. The only reliable way to test is to use negative testing (simulating a certain type of interference) and confirming that the system response is consistent with expectations (e.g detect the interference, in case of ASILB requirements). And even then, the only credible claim that can be made is that, given the simulated type of interference, on the typology of target employed, the reaction is aligned with the requirements. Other types of targets will require further ad-hoc negative testing.
Copy link
Contributor

@paolonig paolonig Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few points that are not clear above:

  1. Why you are associating positive testing with "all the possible combinations" (the expectations is to cover the appropriate equivalence classes)
  2. Your definition of negative testing should be further contextualised. For example I could interpret negative testing as in testing the behavior of kmalloc in case of error (e.g. in case no memory is available)

In general I do not agree with your statement that mix the MM systematic capability claim and the interference from other Kernel components. Also I do not like the fact that you impose only runtime safety measures as the only way to claim FFI within the Kernel.

what I would say is that:
a) the memory management subsystem is very complex and defining a set of FuSa qualification activities (including comprehensive testing) to achieve a target systematic capability is very costly
b) Even if a systematic capability claim is achieved on the memory management, the same does not cover interference from other components of the Kernel. In order to claim freedom from such interferences dedicated runtime safety measures and/or dedicated verification activities shall be defined.
The verification of runtime safety measures shall be tested by injecting interference modes and testing the their effectiveness

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. equivalence classes tend to be subjective, and prone to errors. If you go down the path of "equivalence classes" you have to prove both correctness and completeness of these. Which seems intreatable.
  2. if I am talking about interference, isn't it clear that it erfers to whatever interference is being analysed?

I'm sorry you do not like my statement about runtime FFI.
Perhaps you can prove that other approaches will yield the same results?

Regarding "costly" I'm not sure if that is the right word, because it implies that one might achieve the result, spending enough. But even if money was not a concern, time would make it impossible.

And I do not beleive that verification activity will be enough, because you always have the problem of proving completeness.

If you cannot say what has been left out, how can you say that what was done is sufficient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main substance of this point can be made without some of the potentially contentious references. What about:

Suggested change
4. Still due to te complex interaction between processes, kernel drivers and other kernel code, it is practically impossible to qualify the kernel as safe through positive testing alone, because it is impossible to validate all the possible combinations, and it is equally impossible to assess the overall test coverage and the risk associated with not reaching 100%. The only reliable way to test is to use negative testing (simulating a certain type of interference) and confirming that the system response is consistent with expectations (e.g detect the interference, in case of ASILB requirements). And even then, the only credible claim that can be made is that, given the simulated type of interference, on the typology of target employed, the reaction is aligned with the requirements. Other types of targets will require further ad-hoc negative testing.
4. These complex interaction between processes, kernel drivers and other kernel code mean that it is practically impossible to qualify the kernel through positive testing alone, because the specification task and set of required test cases would be enormous.
1. We can reduce the scope by specifying a kernel version, configuration, target system, software integration, and set of application use cases, but the number of required test cases would still be very large.
2. However, the use of negative tests in combination with positive tests could make a verification-based approach viable for a specific integration, target system and use case, by specifying a set of failure modes for which the behaviour has been verified.
3. This might be achieved by e.g. simulating an identified type of interference for a range of positive test cases, and confirming that the overall system response is consistent with specified expectations (e.g interference is detected and a mitigation is triggered).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case "enormous" is not enormous enough :-D
Only negative testing provides valid evidence, because using positive testing alone is equivalent to claim that the code is devoid of a class of bugs (interference bugs).

With negative testing, it is easy to verify whatever claim that might be made.
Without it, there is a completeness problem.

Whoever thinks they can provecompleteness (or even give a verifiable measurement of confidence, like "reliable at 95% because of X, Y and Z") is welcome to come forward.

Till then, the completeness problem remains unsolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reiterative if you want to have anohter stab at it, while keeping the points I listed in the previous comment, we might be close to a resolution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second attempt:

Suggested change
4. Still due to te complex interaction between processes, kernel drivers and other kernel code, it is practically impossible to qualify the kernel as safe through positive testing alone, because it is impossible to validate all the possible combinations, and it is equally impossible to assess the overall test coverage and the risk associated with not reaching 100%. The only reliable way to test is to use negative testing (simulating a certain type of interference) and confirming that the system response is consistent with expectations (e.g detect the interference, in case of ASILB requirements). And even then, the only credible claim that can be made is that, given the simulated type of interference, on the typology of target employed, the reaction is aligned with the requirements. Other types of targets will require further ad-hoc negative testing.
4. These complex interaction between processes, kernel drivers and other kernel code mean that it is practically impossible to qualify the kernel through positive testing alone.
1. Specifying requirements and implementing a credible set of tests to cover all of the kernel's functions for the general case is certainly infeasible (and arguably impossible), because the range of potential applications and integrations for Linux is too broad.
2. We can constrain this scope by specifying a kernel version and configuration, for a given set of target systems and software integrations, and specifying the set of functions it is intended to provide. However, this would still not be sufficient to assert that the kernel is devoid of certain classes of bug (e.g. for bugs caused by interference).
3. Negative tests derived from credible analysis of the kernel could be used to address this, by verifying its behaviour (and/or mitigations provided by other components of a target system) for a documented set of failure modes.
4. This might be achieved, for example, by simulating an identified type of interference for a range of positive test cases, and confirming that the overall integrated system's response is consistent with a specified set of expectations (e.g. the interference is detected and a kernel- or system-level mitigation is triggered).
5. This, in combination with requirements-based functional testing, could be a viable approach for qualifying a specific integration and configuration of Linux, for a given set of target systems and use cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point is to say that it must be possible to confirm that the claim is correct

given the nature of the problem (one or more classes of bugs), it is not realistic to say that positive testing can cover them all, period, it is the same as claiming that the code is bug free, for those classes

the system subject to testing can comprise also components external to the kernel, but there needs to be something that can prove kernel errors will be detected, or else we are saying that we think it is possible to remove completely certain classes of errors form the kernel, which would be preposterous

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third attempt:

Suggested change
4. Still due to te complex interaction between processes, kernel drivers and other kernel code, it is practically impossible to qualify the kernel as safe through positive testing alone, because it is impossible to validate all the possible combinations, and it is equally impossible to assess the overall test coverage and the risk associated with not reaching 100%. The only reliable way to test is to use negative testing (simulating a certain type of interference) and confirming that the system response is consistent with expectations (e.g detect the interference, in case of ASILB requirements). And even then, the only credible claim that can be made is that, given the simulated type of interference, on the typology of target employed, the reaction is aligned with the requirements. Other types of targets will require further ad-hoc negative testing.
4. These complex interaction between processes, kernel drivers and other kernel code mean that it is not feasible to qualify the kernel through positive, requirement-based testing alone.
1. Safety-related claims made for the kernel must be specified and verified with a credible set of tests, which must necessarily include negative tests to demonstrate the effectiveness of any detection, mitigation and exception-handling mechanisms provided in support of these claims.
2. Credible analysis of the kernel could be used to devise a suitable set of tests for this, to verify the kernel's behaviour (and/or mitigations provided by other components of a target system) for a documented set of failure modes.
3. Specifying requirements and implementing a credible set of tests to cover all of the kernel's functions for the general case is certainly infeasible (and arguably impossible), because the range of potential applications and integrations for Linux is too broad.
4. Demonstrating that claims are supported with a combination of analysis and testing does not allow us to argue that the kernel is devoid of a certain class of bug, only that our verification mechanisms are sufficient to detect bugs affecting the specified claims, and for the identified failure modes.
5.. Claims must therefore be verified for each iteration of the kernel intended for use in a safety-related context, and for the specific configuration and system integration of a target context; the sufficiency of the set of failure modes considered, and of the analysis informing these, must be also confirmed for the target context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, while I'm happy to keep iterating on this, it might also be prudent to consider doing as Daniel suggested and land the safety considerations as a distinct set of changes from the (less contentious) technical analysis...

2. some limited in extent, but hard or even practicaly impossible to detect, like a rogue write to process physical memory
3. some of systemic nature, like some form of use-after free, where a process page is accidentally in use also by another component
4. some of indirect nature, like for example when the page table of the process address space is somehow corrupted
4. again, because of the extremely complex nature of the system, positive testing is not sufficient, but it needs to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above I would say "if a non-interference claim cannot be made, due to the extremely complex ...."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally made a stronger statement.
Also to make it easier to prove me wrong :-)

Copy link
Collaborator

@dweingaertner dweingaertner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general comment is that we should separate what is knowledge about the kernel memory management system, provide it with reference anchors, and separately discuss the safety implications, as these are not of the same nature.

Contributions/Linux_Memory_Management_Essentials.md Outdated Show resolved Hide resolved
Contributions/Linux_Memory_Management_Essentials.md Outdated Show resolved Hide resolved
Contributions/Linux_Memory_Management_Essentials.md Outdated Show resolved Hide resolved
Contributions/Linux_Memory_Management_Essentials.md Outdated Show resolved Hide resolved
Contributions/Linux_Memory_Management_Essentials.md Outdated Show resolved Hide resolved
9. All of the mechanisms described above for memory management are not only memory providers, but memory users as well, because they rely on metadata that cannot be pre-allocated and must be adjusted accordingly to the memory transactions happening, as the system evolves over time; therefore they also consume memory, generating an overhead, and being exposed themselves to interference.
10. The Linux kernel provides means to limit certain requests a process might present; for example with cgroups it is possible to create logical memory "bubbles" that cannot grow beyond a set size, and associate processes to them, that share the collective limit within the bubble. But this does not much toward separating how the kernel might use the underlying memory, besides setting the constraint as described.

#### **Safety-Oriented consideration**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until this point, you give an objective explanation of the way memory management works in Linux. One can say these are the premises on which we can build our deductive reasoning considering the MM in Linux. Perhaps we should numerate them in order to be able to refer to them later, as I think we will have to get a bit more formal if we want to proceed.

I would suggest this new chapter is separated from the Essentials, as the safety considerations you make are intended to be deductive, and thus must be based on premises understood and agreed upon to be true.

I do not think we will all agree on the conclusions below without going the painful way of detailing every premise. And I think this is a good exercise we should do, because it will force us to think of less complex assertions, easier to prove, and then build the argument up, until we eventually can state something like "it must be expected that memory management mechanisms will interfere with safe processes"

The idea of this working group is to help the reader to address the safety of a Linux-based system. If we can build a deductive inference, then the reader might evtl. choose which premises to address in order to avoid the dreaded conclusion

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm open to debate on this idea, but I would prefer to first exhaust the comments gotten so far, by keeping this as one single doc

reading more pages than requested, under the assumption that more requests might be coming soon


#### **Safety-Oriented consideration**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. We should to separate the premises from these considerations, and explicitly state the premises each conclusion builds on. Take for example the 1 argument below. Uncertainty in the timing might be a problem, but might not, depending on the requirements or on the size of the uncertainty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to reconsider later the overall posture of the document.
My motivation is to try to keep together statements which are kind of obvious with others that might be less obvious, but derivable.
We could consider - in general - to introduce also the notion of which aspects are likely to be unavoidable, and which ones might be relevant only for some use case scenarios.

igor-stoppa and others added 8 commits September 19, 2024 15:16
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
added PXA, PXN examples

Signed-off-by: Igor Stoppa <[email protected]>
monolithic -> boot image

Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
igor-stoppa and others added 9 commits September 20, 2024 15:34
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Daniel Weingaertner <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Daniel Weingaertner <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Daniel Weingaertner <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Co-authored-by: Paul Albertella <[email protected]>
Signed-off-by: Igor Stoppa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants