Adventures In Garbage Collection: Read Barrier Elimination Through Concurrency

1. Eliminating Read Barriers through Procrasination and Cleanliness

1 Eliminating Read Barriers through Procrasination and Cleanliness

https://www.cs.purdue.edu/homes/suresh/papers/ismm12.pdf

1.1 Abstract

1.1.1 Goal is to avoid need for a read barrier

1.1.2 Use concurrency and stalling to avoid need for immediate copy

1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known

1.2 Introduction

1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce

No inter-local-heap pointers
No pointers from shared heap to local heap

1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.

1.2.3 Needs a read barrier because reference might refer to newly-exported object.

1.2.4 Read barrier cost is high, but removal is nontrivial.

1.2.5 Idea: Delay operation

Read barriers only need to exist if there are forwarding pointers which exist between move and collection
Can delay operations that create them
We can stall threads which need to cause copy to shared heap, procrastinating the move
GC is informed of these procrasinated operations at start of collection
As long as there are enough threads, cost of procrastination simply cost of context switch.

1.2.6 Idea: Avoid stalling using language semantics

Terminology: Cleanliness = A clean object can be copied without creating a forwarding pointer without interfering with language semantics.
Ex:
1. Immutable object
2. Objects only referenced by the stack in creating thread
3. Objects with known set of incoming references which can inexpensively be fixed on copy.
  1. Specialized write barrier can be used to make this less costly.
Effectiveness depends on
1. Most object being clean
2. Rarely have many popular, mutable objects
3. Sufficient concurrency to make stalling acceptable.
Implementation: MultiMlton. Doesn't require programmer intervention and change can be isolated to runtime.

1.3 Motivation

1.3.1 Implementation considered is multimlton.

1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.

1.3.3 Each core has local heap, and there's a single shared heap.

1.3.4 Max heap set at 3x min heap.

1.3.5 Extensive optimization to avoid heap usage

1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.

1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.

1.3.8 Mutator spends ~20% of execution time in read barrier.

1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.

1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.

1.4 GC Design and Implementation

1.4.1 Threading system

Makes one thread per core, pins it to it.
Lightweight threads have core affinity.
Threads yield at GC safe point if preempted.

1.4.2 Baseline collector(stop the world)

Single shared heap
Tlabs
GC requires thread synchronization on barrier
Cheny two-space copying, but switches to mark-compact if ends up using more than half of heap.
Uses appel-style generational collection
Runtime switches to generational if amount of live objects drops below a threshold.

1.4.3 Local collector(split heap)

Similar to baseline collection, but doesn't need to synchronize for local GC
Shared heap collected by single core after sychronization
Shared-heap collections infrequent because shared objects tend to live much longer than a tenured object in a generational collection scheme

1.4.4 Remembered stacks

Threads communicate over message-passing channels
If no reciever, sender blocks on channel
If channels in shared heap, would eventually pull most objects into shared heap. Undesirable
Channel stacks never moved to shared heap, so references in objects exported stored in a set of remembered stacks.

1.5 Cleanliness Analysis

1.5.1 Heap Session

Local allocation is bump-a-pointer
We have a reference into the heap called sessionStart. Before sessionStart is previous session. After sessionStart is current session.
Only need to scan current session in cleanliness analysis.
Sessions start on context switch, local GC, or export.

1.5.2 Reference count

We steal 4 bits in each object header to record number of references within the current session.
4 bits encode zero, one, local_many, and global_many
Zero => Only references from stacks, global_many => at least one reference outside current session.
Implemented as part of write barrier. Non-decreasing, so represents a maximum.

1.5.3 Cleanliness

An object is clean if:
1. It is immutable
  1. Trivially safe to copy
2. Object is a root, has zero references
3. Object is not root, has one reference
4. Object not root, in current session, and has local_many
  1. Can't know number of references without walking local heap(expensive) so we ask if all pointers originate from current session(LOCAL_MANY).
Lifting and fixing references in transitive closure in current session cheaper than walking whole local heap.

1.6 Write barrier

1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.

1.6.2 Delaying writes

If source of exporting write not clean, context switch to another thread.
Source is added to queue of objects to lift
If all threads associated with core blocked, trigger local GC and perform all lifting and reference fixing
Mutator never encounters forward reference.

1.6.3 Lifting objects to shared heap

Triggers shared heap GC if lacks memory for copy
Immutable objects copied, location added to immutable set, mutable objects forwarded
Stack walked to fix references
If local_many set for an object, current session walked to fix reference.

1.6.4 Remote spawns

Spawning a new thread can cause function closures to escape local heaps
Environment of function moved to shared heap, function closure added to target core's scheduler.
If closure clean, we move it immediately. Else, we procrasinate.

1.6.5 Barrer Implementation

Two local collectors implemented. One with read barriers, one which instead uses cleanliness/procrasination.

1.7 Target Architectures

1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.

1.8 Results

1.8.1 Benchmarks

Input size and threads tuned to ensure adequate core utilization

1.8.2 Performance

Ported Bohem-Demers-Weiser conservative GC to test against
Consecutive tests done by shrinking maximum allowed heap size and recording running time and actual heap utilization.
Read-barrier-less had better performance, but had higher minimum heap size.
Stop-the-world gc has faster mutator time than alternatives, but impacted scalability because collection was serial when world stopped.
Bohem GC slow.

1.8.3 Impact of cleanliness

Without cleanliness:
1. Had significantly more stalls/preemptions when making write
2. 49% of local collections were forced when concurrency exhausted, as opposed to 1% otherwise.
3. 28.2% slower execution time

1.8.4 Impact of Immutability

Benchmark done with all objects treated as mutable
11.4% negative performance impact

1.8.5 Impact of heap session

Without the local heap session(treat all local_many as reason to stall), significant increase in objects being tagged as non-clean

1.8.6 Related work

GHC - Multicore GC with local heaps - SPJ
1. Export only part of transitive closure to shared heap.
2. Mandates need for read barrier, but GHC already has read barrier to test for thunk evaluation.
Manticore

Created: 2014-10-22 Wed 17:05

Emacs 24.4.1 (Org mode 8.2.10)

Validate

Wednesday, October 22, 2014

Read Barrier Elimination Through Concurrency

Table of Contents

1 Eliminating Read Barriers through Procrasination and Cleanliness

1.1 Abstract

1.1.1 Goal is to avoid need for a read barrier

1.1.2 Use concurrency and stalling to avoid need for immediate copy

1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known

1.2 Introduction

1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce

1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.

1.2.3 Needs a read barrier because reference might refer to newly-exported object.

1.2.4 Read barrier cost is high, but removal is nontrivial.

1.2.5 Idea: Delay operation

1.2.6 Idea: Avoid stalling using language semantics

1.3 Motivation

1.3.1 Implementation considered is multimlton.

1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.

1.3.3 Each core has local heap, and there's a single shared heap.

1.3.4 Max heap set at 3x min heap.

1.3.5 Extensive optimization to avoid heap usage

1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.

1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.

1.3.8 Mutator spends ~20% of execution time in read barrier.

1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.

1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.

1.4 GC Design and Implementation

1.4.1 Threading system

1.4.2 Baseline collector(stop the world)

1.4.3 Local collector(split heap)

1.4.4 Remembered stacks

1.5 Cleanliness Analysis

1.5.1 Heap Session

1.5.2 Reference count

1.5.3 Cleanliness

1.6 Write barrier

1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.

1.6.2 Delaying writes

1.6.3 Lifting objects to shared heap

1.6.4 Remote spawns

1.6.5 Barrer Implementation

1.7 Target Architectures

1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.

1.8 Results

1.8.1 Benchmarks

1.8.2 Performance

1.8.3 Impact of cleanliness

1.8.4 Impact of Immutability

1.8.5 Impact of heap session

1.8.6 Related work

No comments:

Post a Comment