Tuesday, December 2, 2014

Garbage Collection for FPGAs

1.1 Abstract

1.1.1 FPGAs are complex, programming at a higher level of abstraction is desirable

1.1.2 First complete GC in hardware, not hardware-assisted

1.1.3 Reasonable performance

  • ever more than %1 of resources on most high-end FPGAs
  • Uses completely concurrent snapshot algorithm
  • Single-cycle access to hte heap, mutator not stalled for even a single cycle
  • Can achieve 100%MMU with heaps as small as 1.01-1.04 absolute minimum

 1.2 Introduction

  • Raising level of abstraction for FPGA usage highly desirable
  • This collector is capable of collecting a heap of uniform objects(called a miniheap in this paper)
  • Single-cycle memory access, actually allows multiple simultaneous mutations per cycle
  • Can be used with VHDL/Verilog or programs cross-compiled from a higher-level language

1.3 Memory Architecture

1.3.1 MiniHeaps

  • Heap is divided into minheaps, each of which has objects of same size and same internal pointer/data layout
  • Same design as the big-bag-of-pages
  • Paper confined to 1-to-2-pointer objects

1.3.2 MiniHeaps with Malloc/Free

  • Memory requests are broken down into Block RAMS, each allocated as efficiently as possible
    • Uses one BRAM set for each field type in an object
    • Ex: An object with 3 fields has 3 bram sets
  • Can exploit the fact that if a minheap is of size N, the pointer bit size if log2(N).
    • Need only use exactly the amount of memory needed for pointers, since custom hardware
    • Downside: Larger heaps mean larger pointers
  • Allocation for each BRAM set
    • Allocates from a stack of free cells

1.4 Garbage Collector Design

  • Stop the world and concurrent collection architectures not as different on FPGAs

1.4.1 Snapshot at the beginning

  • No need to pause to take snapshot because of hardware parallelism
  • Uses a GC signal to mark start
  • When on, registers copy themselves into 'shadow registers'
  • Stack top is marked. Since mutator can't pop stack faster than collector can read, correctness guaranteed
  • If multiple stacks needed, a shadow stack would also be

1.4.2 Tracing

  • Mark queue is a FIFO
  • Objects have mark bit to tell if already traversed
  • Write/Read barriers implemented in hardware steps of memory allocation
  • Because 3 BRAMs used in markind, takes 3 cycles to process one object.
    • This is only a latency cost because it is pipelined. One pointer per cycle throughput.
  • For objects with two pointers, one tracing engine used for each pointer, hardware parallelism.
  • Mark queues are MUXed together and balanced, so will never overflow/run out of work

1.4.3 Sweeping

  • Sweep engine also handles allocation requests
  • Linear scan, on each cycle the mark and used maps given the register with the address of the sweep pointer
  • If free, adds to free stack
  • Allocates white, not black
    • Not problem since freeing and allocating into the freed region doesn't happen concurrently with tracing

1.5 Evaluation

1.5.1 Logic usage

  • GC requires more logic than malloc, but is so small that it's a non-issue

1.5.2 Memory usage

  • Fragmentation on par with malloc
  • Memory usage overhead high for small heaps, but reasonable for larger heaps.

1.5.3 Throughput

  • Realtime is "faster, more deterministic, and requires less heap space" than stop the world

1.5.4 Energy

  • Malloc < Realtime < STW
  • Highly application dependent, as the program becomes more complex the differences shrink.

No comments:

Post a Comment