Skip to content
Matt Basta edited this page Feb 17, 2015 · 2 revisions

Memory is managed automatically in BType, but because the language is compiled statically, many memory operations can be inferred at compile time and optimized accordingly. Specifically, malloc calls are generated at appropriate locations, and free calls are generated by the garbage collection dereferencing routine.

Garbage collection is outlined in GC, so it will not be discussed in great detail here.

Allocators

There are two allocators included with BType. The first is a buddy allocator, which uses a buddy system to allocate memory. The second is a more traditional free-list allocator (the chain allocator), which uses a linked list of free memory blocks to perform bookkeeping.

The default allocator is the chain allocator. This decision was made to improve performance of application which perform a large number of allocations at application initialization.

Buddy Allocator

malloc() and free() are the two critical memory management operations are built using custom implementations in BType. They use a buddy allocation approach. The custom implementation actually stores all bookkeeping information related to memory allocation outside of the heap area. This is done by designating a portion of the heap provided by the operating system (or in some cases, browser) and storing buddy and allocation information in that chunk of space.

When malloc() is called, the bookkeeping chunk of memory is recursively traversed to find the smallest possible unallocated block of memory that satisfies the request. malloc() always prefers to return the first block of memory that is available sequentially. That is, a block of memory that satisfies the requirements of the request that is physically located at a lower pointer value will always be preferred to another equally satisfactory block of memory at a higher pointer value. If a block of memory is found that is at least twice as large as the requested block of memory, it will be split and the process will be repeated.

A limit exists for the smallest block of memory that can be allocated. This is done in order to minimize the amount of space occupied by the bookkeeping chunk of memory.

When free() is called, the block of memory being freed is marked as unallocated. The block of memory and its buddy will then be coalesced back into a single block (un-split). This is repeated with the newly coalesced block and its buddy recursively until either the buddy is found to be split (meaning it contains allocated memory), allocated, or no memory has been allocated.

Bookkeeping

The bookkeeping block contains no wasted bytes. Each byte represents four possible memory locations (two bits each). The second bit (bit 1) represents whether a block of memory was previously allocated at a pointer represented by that bit pair. The first bit (bit 0) represents whether the block whose midpoint exists at the pointer represented by that pair has been split. That is, bit 0 for the heap_size / 2 bit pair will be set to true if the heap has been split into two buddies (one memory allocation has occurred).

Downsides

Unfortunately, this approach is very slow for large numbers of memory allocations, especially in browsers that do not specifically optimize for the "use asm" pragma. The reason for this is the large number of recursive function calls that are made by the malloc() function. The approach also requires a significant number of memory lookups.

Additionally, the current approach has poor cache locality: the block of memory that stores allocation information is arranged in the actual order of memory in the heap. That is, the bit pairs are ordered in the following way, where each number represents the value of the corresponding possible pointer:

0 1 2 ... N

These bit pairs are accessed as 8-bit unsigned integers, meaning four are fetched at a time. The actual memory groupings look like this:

0-3 4-7 8-11 ... (N-4)-(N-1)

If there are 1024 possible pointers given a particular heap size, that means the lookups to this block of memory would work like this for an empty heap (the value representing the "Nth pointer" rather than "pointer N" for the sake of simplicity):

512 = position 64
256 = position 32
128 = position 16
64 = position 8
32 = position 4
16 = position 2
8 = position 1
4 = position 1
2 = position 1
1 = position 1

Notice how each lookup (which corresponds with a recursive call) hits a different byte in memory. The only time a memory lookup might be cached is at the end of the recursive chain, where all of the lookups hit a single byte of memory. Moreover, this particular example assumes that the minimum amount of memory is requested. If a larger block of memory is requested, it's even less likely that the same byte of memory will be requested, further decreasing cache efficiency.

A more appropriate approach for the future would be to instead organize these lookups such that the bit pairs favor cache locality:

512 = position 1
256 = position 1
128 = position 1
64 = position 2
32 = position 2
16 = position 4
...

This done by organizing the memory region logarithmically, so that the first bit pair is the middle-most, the second is the pointer at one quarter of the heap, the third is the pointer at three quarters of the heap, the fourth is the pointer at one eighth of the heap, etc. This approach ensures that the most commonly accessed bits are stored in the same byte of memory, which is a significant improvement for cache locality.

Chain Allocator

The default memory management functions use a very different approach. They simply uses a naive linked-list approach. This approach works in the following way:

  • At the start of the application, the first four bytes of the heap are used as a pointer to point at the second four bytes of the heap.
  • The second four bytes of the heap are set to an unsigned 32-bit integer representing the length of the heap in bytes. The third four bytes of the heap are set to a null pointer (zero).
  • When malloc() is called, the pointer at the start of the heap is read. The unsigned integer at that pointer is compared to the requested amount of memory. If the requested amount of memory is less than or equal to the unsigned integer, the last accessed "free-pointer" (be it the first four bytes of the heap or the second four bytes of the previously iterated block of free memory) is set to one of the following:
    • If the requested amount of memory is greater than or equal to the size of the free block minus 32 bytes, the pointer is set to the value of the currently iterated free block's pointer (the first four bytes).
    • If the requested amount of memory is less than the size of the free block minus 32 bytes, the pointer is set to the byte following the last allocated byte of memory for the currently allocated block. Additionally, the first four bytes after the current block are set to point at the currently iterated free block's pointer (the first four bytes) and the second four bytes after the current block are set to the size of the currently iterated free block minus the size of the currently allocated block minus eight.
  • If malloc() encounters a null pointer when attempting to access the next free block, it should return null, as there is no remaining unallocated memory that satisfies the requested number of bytes.
  • When free() is called on a block of memory, the program should start at the first four bytes of the heap and iterate free blocks.
    • If the next pointer is greater than the current pointer, set the previously iterated pointer to the pointer being freed. If the currently freeing pointer plus the size of the currently freeing block is equal to the next free pointer, set the pointer of the currently freeing block to the value of the pointer of the next free block and combine the sizes to form a single large free block.
    • If a null pointer is encountered when accessing the next free block, create a new free block by setting the pointer of the currently iterating block to the freeing block, and set the free pointer of the freeing block to null.

Choosing an Allocator

Both memory allocation methods have strengths and weaknesses, and both have room for substantial optimization. The use cases of each are valid for most code in general, but certain workloads may benefit from the properties of one allocator over another.

In order to satisfy these use cases, both allocators will be available. In order to expose this functionality, a command line argument to the btype script is available:

btype src/main.bt --target=asmjs --asmjs-memory=foo > target.js

The value of asmjs-memory may be either buddy, which uses the legacy memory allocator, or chain (the default) to use the new memory allocator. In the future, more memory allocators may be made available to satisfy additional use cases.