Tuesday, December 2, 2014

Schism Real-Time Collection

1.1 Abstract

1.1.1 Provides fragmentation tolerance with worst case time/space guarantees

1.1.2 Implementation in the Fiji VM, a JVM for mission-critical systems

1.2 Introduction

1.2.1 Knowing and limiting worst-case execution time is a primary priority for real-time systems

1.2.2 Real-time programming moving to managed languages because of larger code bases

  • Managed languages need to be predictable for both space and time to be accepted

1.2.3 Collectors must not cause enough fragmentation to violate space/time requirements

1.2.4 Paper introduces new collector which tolerates external fragmentation, is concurrent/preemptable

1.2.5 Main idea is to embrace fragmentation by combining fragmented allocation with concurrent replication

1.3 Schism and Fragmentation

1.3.1 Without consideration for fragmentation, worst-case space utilization cannot be guaranteed

1.3.2 Three main approaches to handle this:

  • Fragmented Allocation - Allocate objects as combinations of fragments of fixed size.
    • Problems: Large objects have access overhead
    • Array access uses a trie usually, requiring at worst 10 levels of indirection for what should be fast, O(1) array access.
  • Replication - Perform semispace copying while mutator runs, write barrier keeps copies in sync
    • Attractive when writes rare, or when coherence not required
    • Entails 2x space overhead, as it is a copying collector.
  • Defragmentation - When needed, move to seperate defragmentation phase. Usually have expensive operations to ensure lock-freedom
    • Introduced during metronome
    • Requires stopping all threads during defragmentation, can cause a hard deadline to be missed
    • Assumes that fragmentation rare. While true for typical programs, has worst case of 2x space overhead

1.3.3 Schism combines fragmented allocation with replication for array metadata

  • Fragments never move, are of fixed size.
  • Concurrent mark-sweep for non-moving fragments
  • Seperate replicated semispace collector for array metadata
  • Constant time heap access achieved
    • Uses arraylets rather than tries
    • Arrays become a spine(contiguous array) of pointers to array fragments.
      • Have twice as high access cost as standard array indexing in worst case
    • Objects represented as linked list of fixed-sized fragments, works because objects are small usually and of statically-known size
  • Since spines have varying size, and can thus have external fragmentation, copying collector used for them
    • Replication doesn't need synchronization since the spines contain pointers to non-moving fragments and will thus never change
    • Space overhead small because spines themselves small

1.4 Fiji VM Implementation

1.4.1 Designed to bring the jvm to embedded devices

1.4.2 Converts java into C

  • Provides accurate stack maps

1.4.3 Was designed specifically to support realtime collection

  • Safepoints
    • Threads check to see if they should do work on the collector's behalf
  • Ragged Safepoints
    • Asynchronous request that threads perform an action. Threads aren't stopped.

1.4.4 Only pauses are for allocation/barrier slow paths + synchronous collections when mutator outpaces collector

1.5 Schism Concurrent Mark-Region RTGC

1.5.1 Concurrent Mark-Region GC

  • Based on Immix, but doesn't perform opportunistic defragmentation
  • Idea is to allocate and reclaim memory in contiguous regions at coarse granularity
  • Scheduling is mostly slack-based. GC thread has lower priority than critical run-time threads.
  • Allocation is bump-pointer, like Immix.
    • Sweep phase finds lines of free bytes in pages
    • Allocation request finds first page with enough free space(first-fit) and then finds smallest line to allocate from(best-fit)
    • Bump pointer continues along line until new line is needed.
      • Most of the time, bump pointer is used unless memory very constrained.
  • Fast, only need to pause is to scan stack when low-priority threads run

1.5.2 Adding Schism

  • Both collector and object representation modified
  • Heap is split into CMR space and pair of spine semi-spaces
    • Semi-spaces 30% the size of CMR space.
    • Lockless copying because immutable
  • If contiguous allocation fails, spine reference and array header put in CMR space
  • Accessing an object reference with offset m bytes takes m/32 linked-list steps

1.5.3 Fragmented Allocation

  • Object header has 3 words
    • Fragmentation header - points to spine or next fragment
    • GC word for marking
    • Type header that holds type and locking information
  • Arrays also either have length if contiguous
    • If not, then bounds check against 0 fails and it uses spine
  • Remaining fragments only have fragmentation header
  • If array or spine small enough, either may be placed inside first object

1.5.4 Updating pointers to spines

  • No fromspace invariant, CMR space points into fromspace. Must update pointers somehow
  • Want to avoid another CMR trace.
  • Chose to add extra layer of indirection by adding arraylet sentinel fragment to CMR space.
    • Mutator never has spine references on stack, only reference to CMR forwarding pointer
    • Removes need to update roots in mutator threads.
    • NOTE: No global stop the world needed to fix mutator references ever.
    • Only one heap trace per collection cycle
  • Algorithm
    • Mutator threads switch to allocating in to-space
    • CMR trace allocates to-space spines, but defers copying. When sentinel found, spine allocated and zeroed, sentinel placed on live sentinel list.
    • Ragged safepoint to acknowledge writing new arraylet pointers to both spaces
    • Spines copied
    • Ragged safepoint to acknowledge lack of need to write to both again.
    • From-space zeroed.

1.5.5 Predictability

  • Can be told to optimize for throughput, predictability, or worst-case(add costs of fast-path failing)
  • Converge to constant, reasonable overheads

1.6 State of the Art in RTGC

1.6.1 <Discussion of how work-based and time-based scheduling operate in general>

1.6.2 Concurrent Collection

  • If there are spare CPUs to perform collection, and no work is offloaded to mutators, no need to modify algorithm.
  • Requires knowing allocation and collection rate, and object lifetimes, to avoid mutators outpacing collectors

1.6.3 Websphere SRT performs well, but cannot bound space usage as aggressively.

1.6.4 Java RTS and Jamaica are fragmentation tolerant, but have O(log(heap size)) access times in worst case

1.6.5 Azul is efficient due to specialized hardware.

  • Azul may need 100% copy reserve in worst case, but has smallest object headers(one word) so better for small objects.

1.7 Evaluation

1.7.1 Fragmentation

  • Benchmarking proves previous observation of tight space bounds.

1.7.2 Throughput

  • Performance is competitive with other JVMs
  • Some benchmarks better with/without schism

1.7.3 Predictability

  • Execution time is always identical on demo application(collision detection for air traffic controllers)
  • Plots of execution times on differing application match those of a C implementation, showing that the differences are algorithmic and not due to GC
  • Worst case execution time is lowest for schism on benchmark with many planes
Paper

No comments:

Post a Comment