Wednesday, October 22, 2014

Read Barrier Elimination Through Concurrency

Table of Contents

1 Eliminating Read Barriers through Procrasination and Cleanliness

1.1 Abstract

1.1.1 Goal is to avoid need for a read barrier

1.1.2 Use concurrency and stalling to avoid need for immediate copy

1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known

1.2 Introduction

1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce

  1. No inter-local-heap pointers
  2. No pointers from shared heap to local heap

1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.

1.2.3 Needs a read barrier because reference might refer to newly-exported object.

1.2.4 Read barrier cost is high, but removal is nontrivial.

1.2.5 Idea: Delay operation

  1. Read barriers only need to exist if there are forwarding pointers which exist between move and collection
  2. Can delay operations that create them
  3. We can stall threads which need to cause copy to shared heap, procrastinating the move
  4. GC is informed of these procrasinated operations at start of collection
  5. As long as there are enough threads, cost of procrastination simply cost of context switch.

1.2.6 Idea: Avoid stalling using language semantics

  1. Terminology: Cleanliness = A clean object can be copied without creating a forwarding pointer without interfering with language semantics.
  2. Ex:
    1. Immutable object
    2. Objects only referenced by the stack in creating thread
    3. Objects with known set of incoming references which can inexpensively be fixed on copy.
      1. Specialized write barrier can be used to make this less costly.
  3. Effectiveness depends on
    1. Most object being clean
    2. Rarely have many popular, mutable objects
    3. Sufficient concurrency to make stalling acceptable.
  4. Implementation: MultiMlton. Doesn't require programmer intervention and change can be isolated to runtime.

1.3 Motivation

1.3.1 Implementation considered is multimlton.

1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.

1.3.3 Each core has local heap, and there's a single shared heap.

1.3.4 Max heap set at 3x min heap.

1.3.5 Extensive optimization to avoid heap usage

1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.

1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.

1.3.8 Mutator spends ~20% of execution time in read barrier.

1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.

1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.

1.4 GC Design and Implementation

1.4.1 Threading system

  1. Makes one thread per core, pins it to it.
  2. Lightweight threads have core affinity.
  3. Threads yield at GC safe point if preempted.

1.4.2 Baseline collector(stop the world)

  1. Single shared heap
  2. Tlabs
  3. GC requires thread synchronization on barrier
  4. Cheny two-space copying, but switches to mark-compact if ends up using more than half of heap.
  5. Uses appel-style generational collection
  6. Runtime switches to generational if amount of live objects drops below a threshold.

1.4.3 Local collector(split heap)

  1. Similar to baseline collection, but doesn't need to synchronize for local GC
  2. Shared heap collected by single core after sychronization
  3. Shared-heap collections infrequent because shared objects tend to live much longer than a tenured object in a generational collection scheme

1.4.4 Remembered stacks

  1. Threads communicate over message-passing channels
  2. If no reciever, sender blocks on channel
  3. If channels in shared heap, would eventually pull most objects into shared heap. Undesirable
  4. Channel stacks never moved to shared heap, so references in objects exported stored in a set of remembered stacks.

1.5 Cleanliness Analysis

1.5.1 Heap Session

  1. Local allocation is bump-a-pointer
  2. We have a reference into the heap called sessionStart. Before sessionStart is previous session. After sessionStart is current session.
  3. Only need to scan current session in cleanliness analysis.
  4. Sessions start on context switch, local GC, or export.

1.5.2 Reference count

  1. We steal 4 bits in each object header to record number of references within the current session.
  2. 4 bits encode zero, one, localmany, and globalmany
  3. Zero => Only references from stacks, globalmany => at least one reference outside current session.
  4. Implemented as part of write barrier. Non-decreasing, so represents a maximum.

1.5.3 Cleanliness

  1. An object is clean if:
    1. It is immutable
      1. Trivially safe to copy
    2. Object is a root, has zero references
    3. Object is not root, has one reference
    4. Object not root, in current session, and has localmany
      1. Can't know number of references without walking local heap(expensive) so we ask if all pointers originate from current session(LOCALMANY).
  2. Lifting and fixing references in transitive closure in current session cheaper than walking whole local heap.

1.6 Write barrier

1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.

1.6.2 Delaying writes

  1. If source of exporting write not clean, context switch to another thread.
  2. Source is added to queue of objects to lift
  3. If all threads associated with core blocked, trigger local GC and perform all lifting and reference fixing
  4. Mutator never encounters forward reference.

1.6.3 Lifting objects to shared heap

  1. Triggers shared heap GC if lacks memory for copy
  2. Immutable objects copied, location added to immutable set, mutable objects forwarded
  3. Stack walked to fix references
  4. If localmany set for an object, current session walked to fix reference.

1.6.4 Remote spawns

  1. Spawning a new thread can cause function closures to escape local heaps
  2. Environment of function moved to shared heap, function closure added to target core's scheduler.
  3. If closure clean, we move it immediately. Else, we procrasinate.

1.6.5 Barrer Implementation

  1. Two local collectors implemented. One with read barriers, one which instead uses cleanliness/procrasination.

1.7 Target Architectures

1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.

1.8 Results

1.8.1 Benchmarks

  1. Input size and threads tuned to ensure adequate core utilization

1.8.2 Performance

  1. Ported Bohem-Demers-Weiser conservative GC to test against
  2. Consecutive tests done by shrinking maximum allowed heap size and recording running time and actual heap utilization.
  3. Read-barrier-less had better performance, but had higher minimum heap size.
  4. Stop-the-world gc has faster mutator time than alternatives, but impacted scalability because collection was serial when world stopped.
  5. Bohem GC slow.

1.8.3 Impact of cleanliness

  1. Without cleanliness:
    1. Had significantly more stalls/preemptions when making write
    2. 49% of local collections were forced when concurrency exhausted, as opposed to 1% otherwise.
    3. 28.2% slower execution time

1.8.4 Impact of Immutability

  1. Benchmark done with all objects treated as mutable
  2. 11.4% negative performance impact

1.8.5 Impact of heap session

  1. Without the local heap session(treat all localmany as reason to stall), significant increase in objects being tagged as non-clean

1.8.6 Related work

  1. GHC - Multicore GC with local heaps - SPJ
    1. Export only part of transitive closure to shared heap.
    2. Mandates need for read barrier, but GHC already has read barrier to test for thunk evaluation.
  2. Manticore

Created: 2014-10-22 Wed 17:05

Emacs 24.4.1 (Org mode 8.2.10)

Validate

No comments:

Post a Comment