Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloca for Rust #1808

Closed
wants to merge 8 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions text/0000-alloca.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
- Feature Name: alloca
- Start Date: 2016-12-01
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary
[summary]: #summary

Add a builtin `fn core::mem::reserve<'a, T>(elements: usize) -> StackSlice<'a, T>` that reserves space for the given
number of elements on the stack and returns a `StackSlice<'a, T>` to it which derefs to `&'a [T]`.

# Motivation
[motivation]: #motivation

Some algorithms (e.g. sorting, regular expression search) need a one-time backing store for a number of elements only
known at runtime. Reserving space on the heap always takes a performance hit, and the resulting deallocation can
increase memory fragmentation, possibly slightly degrading allocation performance further down the road.

If Rust included this zero-cost abstraction, more of these algorithms could run at full speed – and would be available
on systems without an allocator, e.g. embedded, soft-real-time systems. The option of using a fixed slice up to a
certain size and using a heap-allocated slice otherwise (as afforded by
[SmallVec](https://crates.io/crates/smallvec)-like classes) has the drawback of decreasing memory locality if only a
small part of the fixed-size allocation is used – and even those implementations could potentially benefit from the
increased memory locality.

As a (flawed) benchmark, consider the following C program:

```C
#include <stdlib.h>

int main(int argc, char **argv) {
int n = argc > 1 ? atoi(argv[0]) : 1;
int x = 1;
char foo[n];
foo[n - 1] = 1;
}
```

Running `time nice -n 20 ionice ./dynalloc 1` returns almost instantly (0.0001s), whereas using `time nice -n 20 ionice
./dynalloc 200000` takes 0.033 seconds. As such, it appears that just by forcing the second write further away from the
first slows down the program (this benchmark is actually completely unfair, because by reducing the process' priority,
we invite the kernel to swap in a different process instead, which is very probably the major cause of the slowdown
here).

Still, even with the flaws in this benchmark,
[The Myth of RAM](http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-part-i) argues quite convincingly for the
benefits of memory frugality.


# Detailed design
[design]: #detailed-design

The standard library function can simply `panic!(..)` within the `reserve(_)` method, as it will be replaced when
translating to MIR. The `StackSlice` type can be implemented as follows:

```Rust
/// A slice of data on the stack
pub struct StackSlice<'a, T: 'a> {
slice: &'a [T],
}

impl<'a, T: 'a> Deref for StackSlice<'a, T> {
type Target = [T];

fn deref(&self) -> &[T] {
return self.slice;
}
}
```

`StackSlice`'s embedded lifetime ensures that the stack allocation may never leave its scope. Thus the borrow checker
can uphold the contract that LLVM's `alloca` requires.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see why the type of reserve would prevent StackSlice to escape the scope where reserve is called. More precisely, It seems like the user could define:

fn reserve_and_zero<'a>(elements : usize) -> &'a[u32] {
    let s = *reserve(usize);
    for x in s.iter_mut() {  *x = 0 }
    return s
}

Which would be invalid if it is not inlined.

Copy link

@hanna-kruppe hanna-kruppe Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the signatures proposed here is completely unsound. Since 'a is a lifetime parameter not constrained by anything, any call to reserve can simply pick 'a = 'static.


MIR Level: We need a way to represent the dynamic stack `alloca`tion with both the number of elements and the concrete
type of elements. Then while building the MIR, we need to replace the `Calls` from HIR with it.

Low-level: LLVM has the `alloca` instruction to allocate memory on the stack. We simply need to extend trans to emit it
with a dynamic `<NumElements>` argument when encountering the aforementioned MIR.

With a LLVM extension to un-allocate the stack slice we could even restrict the stack space reservation to the lifetime
of the allocated value, thus increasing locality over C code that uses alloca (which so far is suboptimally implemented
by some compilers, especially with regards to inlining).

# How to teach this

Add the following documentation to libcore:

```
*** WARNING *** stay away from this feature unless you absolutely need it.
Using it will destroy your ability to statically reason about stack size.

Apart from that, this works much like an unboxed array, except the size is
determined at runtime. Since the memory resides on the stack, be careful
not to exceed the stack limit (which depends on your operating system),
otherwise the resulting stack overflow will at best kill your program. You
have been warned.

Valid uses for this is mostly within embedded system without heap allocation.
```

Also add an example (perhaps a sort algorithm that uses some scratch space that will be heap-allocacted with `std` and
stack-allocated with `#[no_std]` (noting that the function would not be available on no-std systems at all were it not
for this feature).

Do not `pub use` it from `std::mem` to drive the point home.

# Drawbacks
[drawbacks]: #drawbacks

- Even more stack usage means the dreaded stack limit will probably be reached even sooner. Overflowing the stack space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flip side, this might reduce stack usage for users of ArrayVec and those manually allocating overly-large arrays on the stack (I occasionally do this when reading small files).

leads to segfaults at best and undefined behavior at worst. On unices, the stack can usually be extended at runtime,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we not be able to correctly probe the stack for space when alloca-ing?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to do stack probes. If they aren't used, though, this must be marked unsafe, as it's trivial to corrupt memory without them.

Speaking of which, are stack probes still not actually in place for regular stack allocations? Just tried compiling a test program (that allocates a large array on the stack) with rustc nightly on my system and didn't see any probes in the asm output.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to do stack probes. If they aren't used, though, this must be marked unsafe, as it's trivial to corrupt memory without them.

Can't one already overflow the stack with recursion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, which is why stack probes are inserted so that the runtime can detect that and abort with a fatal runtime error: stack overflow message rather than just run off into whatever memory is below.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except they actually aren't, since it hasn't been implemented yet (or maybe it only works on Windows?). OSes normally provide a 4kb guard page below the stack, so most stack overflows will crash anyway (and trigger that error, which comes from a signal handler), but a function with a big enough stack frame can skip past that page, and I think it may actually be possible in practice to exploit some Rust programs that way... I should try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on integrating stack probes, which are required for robustness on our embedded system (ARTIQ). I expect to implement them in a way generic enough to be used on all bare-metal platforms, including no-MMU, as well as allow for easy shimming on currently unsupported OSes.

whereas on Windows stack size is set at link time (default to 1MB).

- Adding this will increase implementation complexity and require support from possible alternative implementations /
backends (e.g. Cretonne, WebASM).

# Alternatives
[alternatives]: #alternatives

- Do nothing. Rust works well without it (there's the issue mentioned in the "Motivation" section though). `SmallVec`s
work well enough and have the added benefit of limiting stack usage.

- `mem::with_alloc<T, F: Fn([T]) -> U>(elems: usize, code: F) -> U` This has the benefit of reducing API surface, and
introducing rightwards drift, which makes it more unlikely to be used too much. However, it also needs to be
monomorphized for each function (instead of only for each target type), which will increase compile times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only solution that will really work, since otherwise you can always use the reserve intrinsic to create a slice with a 'static lifetime


- dynamically sized arrays are a potential solution, however, those would need to have a numerical type that is only
fully known at runtime, requiring complex type system gymnastics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generalization of this features is to allow unsized locals (which simply allocate stack space for the size given by the memory which they are moved out of)

To prevent accidental usage of this, a language item StackBox<T: !Sized> could be created, which alloca's all its memory depending on the size_of_val. The problem is that this would need to run code at moving, which is unprecedented in Rust


- use a macro instead of a function (analogous to `print!(..)`), which could insert the LLVM alloca builtin.

- mark the function as `unsafe` due to the potential stack overflowing problem.

- Copy the design from C `fn alloca()`, possibly wrapping it later.

- Use escape analysis to determine which allocations could be moved to the stack. This could potentially benefit even
more programs, because they would benefit from increased allocation speed without the need for change. The deal-breaker
here is that we would also lose control to avoid the listed drawback, making programs crash without recourse. Also the
compiler would become somewhat more complex (though a simple incomplete escape analysis implementation is already in
[clippy](https://github.com/Manishearth/rust-clippy).

# Unresolved questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C's alloca can't be used inside of a function argument list - would we need the same restriction or would we handle that properly?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that that limitation exists primarily to allow naive single pass compilers to exist (along with some "interesting" ways of implementing alloca()). I don't think that concern would apply to rust.

[unresolved]: #unresolved-questions

- Could we return the slice directly (reducing visible complexity)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it, but I'm not sure I understand why we wouldn't be able to just return the slice?

Copy link

@codyps codyps Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess:

Because the size of the returned value could be variable and would need to be copied (or preserved on the stack). Not sure rust has anything that does that right now. Suppose it could be done, could just become a detail of the calling convention. Can't have caller allocate unless it knows how much memory to allocate, and making it aware of the memory needed by the callee could complicate things (and would likely be impossible to do in some cases without running the callee twice).

If these stack allocations were parameterized with type-level numbers, it would be fairly straight forward (ignoring all the general complexity of type level numbers), but this RFC doesn't propose that.


- Bikeshedding: Can we find a better name?