1.1 Abstract
1.1.1 FPGAs are complex, programming at a higher level of abstraction is desirable
1.1.2 First complete GC in hardware, not hardware-assisted
1.1.3 Reasonable performance
- ever more than %1 of resources on most high-end FPGAs
- Uses completely concurrent snapshot algorithm
- Single-cycle access to hte heap, mutator not stalled for even a single cycle
- Can achieve 100%MMU with heaps as small as 1.01-1.04 absolute minimum
1.2 Introduction
- Raising level of abstraction for FPGA usage highly desirable
- This collector is capable of collecting a heap of uniform objects(called a miniheap in this paper)
- Single-cycle memory access, actually allows multiple simultaneous mutations per cycle
- Can be used with VHDL/Verilog or programs cross-compiled from a higher-level language
1.3 Memory Architecture
1.3.1 MiniHeaps
- Heap is divided into minheaps, each of which has objects of same size and same internal pointer/data layout
- Same design as the big-bag-of-pages
- Paper confined to 1-to-2-pointer objects
1.3.2 MiniHeaps with Malloc/Free
- Memory requests are broken down into Block RAMS, each allocated as efficiently as possible
- Uses one BRAM set for each field type in an object
- Ex: An object with 3 fields has 3 bram sets
- Uses one BRAM set for each field type in an object
- Can exploit the fact that if a minheap is of size N, the pointer bit size if log2(N).
- Need only use exactly the amount of memory needed for pointers, since custom hardware
- Downside: Larger heaps mean larger pointers
- Need only use exactly the amount of memory needed for pointers, since custom hardware
- Allocation for each BRAM set
- Allocates from a stack of free cells
- Allocates from a stack of free cells
1.4 Garbage Collector Design
- Stop the world and concurrent collection architectures not as different on FPGAs
1.4.1 Snapshot at the beginning
- No need to pause to take snapshot because of hardware parallelism
- Uses a GC signal to mark start
- When on, registers copy themselves into 'shadow registers'
- Stack top is marked. Since mutator can't pop stack faster than collector can read, correctness guaranteed
- If multiple stacks needed, a shadow stack would also be
1.4.2 Tracing
- Mark queue is a FIFO
- Objects have mark bit to tell if already traversed
- Write/Read barriers implemented in hardware steps of memory allocation
- Because 3 BRAMs used in markind, takes 3 cycles to process one object.
- This is only a latency cost because it is pipelined. One pointer per cycle throughput.
- This is only a latency cost because it is pipelined. One pointer per cycle throughput.
- For objects with two pointers, one tracing engine used for each pointer, hardware parallelism.
- Mark queues are MUXed together and balanced, so will never overflow/run out of work
1.4.3 Sweeping
- Sweep engine also handles allocation requests
- Linear scan, on each cycle the mark and used maps given the register with the address of the sweep pointer
- If free, adds to free stack
- Allocates white, not black
- Not problem since freeing and allocating into the freed region doesn't happen concurrently with tracing
- Not problem since freeing and allocating into the freed region doesn't happen concurrently with tracing
1.5 Evaluation
1.5.1 Logic usage
- GC requires more logic than malloc, but is so small that it's a non-issue
1.5.2 Memory usage
- Fragmentation on par with malloc
- Memory usage overhead high for small heaps, but reasonable for larger heaps.
1.5.3 Throughput
- Realtime is "faster, more deterministic, and requires less heap space" than stop the world
1.5.4 Energy
- Malloc < Realtime < STW
- Highly application dependent, as the program becomes more complex the differences shrink.
No comments:
Post a Comment