Adventures In Garbage Collection: Garbage Collection for FPGAs

Tuesday, December 2, 2014

Garbage Collection for FPGAs

1.1 Abstract

1.1.1 FPGAs are complex, programming at a higher level of abstraction is desirable

1.1.2 First complete GC in hardware, not hardware-assisted

1.1.3 Reasonable performance

ever more than %1 of resources on most high-end FPGAs

Uses completely concurrent snapshot algorithm

Single-cycle access to hte heap, mutator not stalled for even a single cycle

Can achieve 100%MMU with heaps as small as 1.01-1.04 absolute minimum

1.2 Introduction

Raising level of abstraction for FPGA usage highly desirable

This collector is capable of collecting a heap of uniform objects(called a miniheap in this paper)

Single-cycle memory access, actually allows multiple simultaneous mutations per cycle

Can be used with VHDL/Verilog or programs cross-compiled from a higher-level language

1.3 Memory Architecture

1.3.1 MiniHeaps

Heap is divided into minheaps, each of which has objects of same size and same internal pointer/data layout

Same design as the big-bag-of-pages

Paper confined to 1-to-2-pointer objects

1.3.2 MiniHeaps with Malloc/Free

Memory requests are broken down into Block RAMS, each allocated as efficiently as possible
- Uses one BRAM set for each field type in an object
- Ex: An object with 3 fields has 3 bram sets

Can exploit the fact that if a minheap is of size N, the pointer bit size if log₂(N).
- Need only use exactly the amount of memory needed for pointers, since custom hardware
- Downside: Larger heaps mean larger pointers

Allocation for each BRAM set
- Allocates from a stack of free cells

1.4 Garbage Collector Design

Stop the world and concurrent collection architectures not as different on FPGAs

1.4.1 Snapshot at the beginning

No need to pause to take snapshot because of hardware parallelism

Uses a GC signal to mark start

When on, registers copy themselves into 'shadow registers'

Stack top is marked. Since mutator can't pop stack faster than collector can read, correctness guaranteed

If multiple stacks needed, a shadow stack would also be

1.4.2 Tracing

Mark queue is a FIFO

Objects have mark bit to tell if already traversed

Write/Read barriers implemented in hardware steps of memory allocation

Because 3 BRAMs used in markind, takes 3 cycles to process one object.
- This is only a latency cost because it is pipelined. One pointer per cycle throughput.

For objects with two pointers, one tracing engine used for each pointer, hardware parallelism.

Mark queues are MUXed together and balanced, so will never overflow/run out of work

1.4.3 Sweeping

Sweep engine also handles allocation requests

Linear scan, on each cycle the mark and used maps given the register with the address of the sweep pointer

If free, adds to free stack

Allocates white, not black
- Not problem since freeing and allocating into the freed region doesn't happen concurrently with tracing

1.5 Evaluation

1.5.1 Logic usage

GC requires more logic than malloc, but is so small that it's a non-issue

1.5.2 Memory usage

Fragmentation on par with malloc

Memory usage overhead high for small heaps, but reasonable for larger heaps.

1.5.3 Throughput

Realtime is "faster, more deterministic, and requires less heap space" than stop the world

1.5.4 Energy

Malloc < Realtime < STW

Highly application dependent, as the program becomes more complex the differences shrink.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)