Sunday, September 28, 2014

Chapters 13-19

Chapter 13: Concurrency Preliminaries

13.1 Hardwarse



Threads

  • Thread states = ready, blocked, or running
  • Threads can move processors when descheduled
    • Can use processor affinity
  • Simultaneous multithreading or hyperthreading (multiple processors with same execution pipleline) not actual threading, but similarly named
  • Multiprocessor = More than one processor
  • Multicore/CMP/Many-core = many processors on same integrated circuit
  • Mutithreaded software runs concurrently on a multiprocessor
  • Multiprogrammed = software executing multiple threads on a single processor

Interconnect

  • Shared memory differentiates local concurrent software from distributed systems
  • Access mediated by bus or interconnect
    • Bus = single point that all requests pass through
    • Granularity of memory acccess a tradeoff between bandwidth and number of busses
    • Busses a bottleneck on 8-16+ core systems
  • SMP vs NUMA
    • SMP = All processors have equal access time to memory
    • NUMA = Each processor has a small bank of memory that they have faster access to

Memory

  • Because of concurrent access to memory, hard to describe a single state
  • Alright because we only need to access each unit of memory, and that access is consistent

Caches

  • Memory access is slow relative to processors, so we use faster and smaller caches
  • If needed address in cache, called a cache hit
  • If needed address not in cache, called a cache miss
    • On cache miss, we bring memory's values into cache
    • If cache full, we must evict a cache line. Evicted line called victim.
  • If we write the data to memory immediately on write, is called write-through. Otherwise called write-back if only when flushed.
  • Cache Organization
    • Cache Location
      • Fully Associative
        • Any memory addresses lines can reside in the cache together at the same time
      • Direct-Mapped
        • Each address has a specific place in cache it must reside
      • k-way Set Associative Caches
        • Each address has k possible cache locations in memory to reside
        • Can evict however many it needs
      • Victim caches
        • On eviction, a line is put in a fully associative victim cache
        • Allows to easier recalling of victims
    • Cache level relationship
      • Strictly inclusive if must be held by higher level if held by lower level
      • Strictly exclusive if may not be held by higher level if held by lower level
      • Can be neither, may allow anything

Coherence

  • Multiple levels of cache may get out of sync with one another, hold different valuse for same address
  • Common cache coherency protocol: MESI
    StateDescription
    ModifiedThe cache is the only one holding a copy of hte line, has been modified but not written back to main memory.
    ExclusiveOnly one cache holds the address, not modified.
    SharedOther caches may hold the line, all have same value
    InvalidCache does not hold line
    • To satisfy read, must hold in MES state. How to satisfy write depends on state of address.
    • Point is that there is only one writer for a given line, and no caches hold disagreeing values.
    • Does not scale well to many processors.
  • False sharing = if multiple processors frequently access and update different data in the same cache line, the line will be moved and invalidated between processors frequently. "Ping-ponging"
  • Performance example: Spin locks
    • Poor because threads on different processors will contend for line with lock and invalidate it frequently.

13.2 Hardware Memory Consistency

Prologue

  • Writes to a given address have a total order, and processors won't see a state change without a write having happened between both accesses.
  • Order of writes to more than one address do not correspond to actual changes in memory or cache
    • Ex: Write buffer that summarizes multiples writes by choosing later writes rather than writing both earlier and later values. A read will fetch the value from write buffer if it's in there.
    • Ex: Compilers may reorder instructions
    • Almost universally done for performance gains
  • Non-atomic, relaxed ordering can break program invariants
  • Three access types: read, write, atomic.
    • Atomic uses test-and-set to ensure atomic read-modify-write

Fences and happens-before

  • Fences prevent certain types of access happening on one side of a point to happen on the other
  • Alternative is aquire-release pair which prevents things from happening before the aquire or after the release.
    • Useful for protecting critical sections

Consistency models

  • Strict consistency => All operations have a total ordering everywhere in system
  • Sequential consistency => A partial ordering between processor program orders exist.
  • Weak consistency => All atomic operations are total fences
  • Causal consistency => Enforces that reads not be reordered across writes that may alter the read's value.

13.3 Hardware Primitives

AtomicExchange and Test-and-set are both fairly simple. They set a value unconditionally and return the old value. This can be seen as setting a value conditionally if the semantics are "only proceed if a certain value was returned."

Compare and Set

  • Takes an old and new value and if the old value is equal to the value currently in the address, write the new value. Returns whether or not updated.

Compare and Swap = Same as compare and set but returns old value

  • Trap is that the value might have changed multiple times and is now equal to the older value being checked for
  • Ex: Misreading an int for a pointer and finding it the same bits and proceeding.
  • Called the ABA problem

Load-linked / Store Conditionally

  • Use processor's cache coherence mechanism to detect changes at a memory address.
  • Solves the ABA problem very well

Atomic Arithmetic: Atomic increments and decrements

  • Not as powerful at other primitives. Have finite consensus numbers
  • A primitive with a consensus number of N allows for correct creation and selection of a value between N threads

More powerful primitives

  • Multi-word compare-and-swap can use a counter in the second word to avoid the ABA problem
  • More useful to update two non-adjacent words atomically
    • Some chips like 88000 have such primitives

Overheads of atomic primitives

  • Expensiveness of atomic primitives lead to disuse by programmers, and error
  • Expensive for two reasons:
    • Cache Coherence
      • Must aquire exclusive access to line
    • Memory fences sacrifice optimizations by reordering

13.4 Progress Guarantees

Important to guarantee that the threads contending will make progress, bound wait times

Guarantees:

  • Wait-free: Every thread can always make progress regardless of actions of other threads.
    • Typically involves threads helping eachother along
    • If a thread's action will make another thread already holding resources unable to proceed, the first thread will wait until the other thread leaves its critical section
  • Obstruction-free: If given a period of isolated execution, each thread will complete in a finite number of steps.
    • Easier than wait-freedom, but may require scheduler cooperation
    • Threads can tell if they are contending and will back off
  • Lock-free: Infinitely often, a thread will finish in a finite number of steps
    • Requires only that at least one contender makes progress at any given time.

Guarantees and Concurrent Collection

  • Parallel collectors use multiple threads to collect, but stop mutation threads to collect.
  • Concurrent collectors perform some actions while mutation threads run
  • Concurrent collection has implementation edge cases
    • When objects are freed or become garbage during collection
    • Proof of termination
      • Mutators may need to be throttled back to avoid making more work than the collector threads can complete
      • Need to prevent mutators from using all memory before collection finishes.
  • May offer different guarantees for different portions.

13.5 Notation Used for Concurrent Algorithms: Not relevant. Used for describing pseudocode.

13.6 Mutual Exclusion

Need to balance progress guarantees and mutual exclusion

  • Peterson's Algorithm good tool

13.7 Work Sharing and Termination Detection

Parallel algorithms need a way to detect termination of parallel algorithm

Simple if processing a finite amount of data, but sometimes threads create more work to process

  • Can push work to other threads or pull from other threads. Pulling called work-stealing
  • No work may be lost
  • Counters may become bottlenecks, cause cache contention

May use a seperate thread whose only job is to detect terminations

  • Workers set a busy flag. Detector finds.
  • Needs to track jobs pushed or pulled to atomically scan all flags.

Rendevous Barriers

  • Common to need all threads to hit same phase of collection.
  • May simply need all therads to reach a given point, called a rendevous barrier

13.8 Concurrent Data Structures

If structure accessed rarely, can simply lock it entirely with a mutex, but won't scale

Concurrent structures exist that allow a interleaving of certain actions on structure.

  • Is linearizable if overlapping actions has same effect as seperating them.
  • Each action has a point at which it can be observed to have occurred. Called the linearization point.
  • Variety of locking patterns that can be used:
    NameDescription
    Coarse-grained lockingOne large lock for whole structure
    Fine-grained lockingLock subsets or individual elements. Needs to ensure that locking order consistent or will cause deadlock.
    Optimistic lockingSearches the structure without locks, and then locks, and verifies that the structure hasn't changed since searching, and right elements locked.
    Lazy updateUse metadata to mark an element as being removed or altered, release lock, and then aquire lock to make change. This prevents blocking reading threads.
    Non-blockingUse of atomic primitives to remove need for locking.

Concurrent Stacks

13.9 Transactional Memory

Implementation:

  • All reads and writes in a transaction should appear to not interleave with any other threads'.
  • Should allow over many independent words.
  • Attempts to commit the transaction may fail. If succeeds, changes seen. If fails, software may respond by retrying or doing other steps.
    • Allows transactions to execute speculatively.
  • Atomic, Consistent, and Isolated.
  • If two transactions will interfere with eachother, one must be killed to allow the other to proceed.
  • Atomicity methods
    • Undoing
      • Creates an undo log, actually makes writes to main memory.
      • On failure, undoes instructions in log.
    • Buffering
      • Create a scratch buffer and store all writes to it. On success, write to memory. On failure, discard.
      • More work for successful commits, and success is common case.
  • Conflict detection
    • Can use eager or lazy methods
    • Eager:
      • Checks all accesses for conflicts
    • Lazy:

Use for collection

  • Caveats:
    • STM has a significant overhead
    • Hardware transactional memory may have gotchas, not obvious how code maps to transactions, may have strange cache behavior.
    • May interfere with guarantees because the retry behavior is less understandable than hand-written locking
    • May require very careful performance tuning.

Interaction with languages with transaction memory

  • Memory manager may cause more transactions to fail, cause many retries
  • May have collector and mutator fight for transaction, constantly restart
    • Harder to handle with hardware transactional memory
  • Transaction logs need to be part of the root set, or garbage created during a failed transaction would cause premature freeing
  • Undoing allocation may require a lock on allocation pointers or free-list metatdata, undesirable
  • If non-transaction actions allowed during transaction, need to make sure that allocated memory is initialized.

13.10 Issues to Consider

Hard to get right

What guarantees are actually necessary?

What atomic primitives are supported on our architectures?

Chapter 14: Parallel Garbage Collection

14.1 Is There Sufficient Work to Parallelize?

Goal is to reduce overhead by exploiting hardware

In stop-the-world, will shorten pause times

In concurrent schemes, will shorten GC cycle time

Must be worth it in the tradeoff of the overhead of synchronization primitives

In practice, not many stalls because the object graph is not very deep. This is amenable to parallelism.

Harder to parallelize tracing, easier to parallelize "sweep" activities like fixing references.

14.2 Load Balancing

Needs to properly allocate work to threads, or will be mostly sequential execution

Has coordination cost

Methods

  • May schedule all work ahead of time, such as with a sweep that may work by dividing by region. Called static balacing
  • Most tasks need dynamic balancing. Use a heuristic to divide labor.
    • Good solution is to over-partition the work into more pieces than threads and allow threads to take them off a queue.
      • More accurately partitions based on real work than heuristics

14.3 Sychronization

Over-diving into too small granules of work will have synchronization overhead, despite better processor use

Need to prevent double-copying because of problems like future updates only affecting one version, inconsistent state.

Synchronization may be removed where actions are idempotent and repeating work not too costly.

14.4 Taxonomy

Processor-centric algorithms

  • Gets work granules of different sizes
  • Usually by work-stealing
  • No consideration of locality

Memory-centric algorithms

  • Take locality into account.

14.5 Parallel Marking

Prologue

  • Marking is 3 actions: Getting an object, marking it, and adding children to a work-list
  • All known-algorithms processor-centric
    • No synchronization required if work-list is thread-local
    • Marking can be idemponent(set a bit)
    • Synchronization required for work-stealing
  • If marks are stored in bitmap, access needs synchronization.
  • May need to move work from thread-local store

Processor-centric Techniques

  • Work stealing
    • If work-list empty, try to steal from other thread's
    • Needs to lock, but can give up if unable to aquire lock and move to next thread's queue
    • Problem with processing large array of pointers. Local work stack will overflow.
    • Solution is to push arrays onto global work queue as smaller lists of pointers
    • Only requires sychronization on taking of work, faster than other algorithms which always use synchronization
    • Termination for work stealing
      • Use of an atomic counter possible, but contention can lead to serialization
      • Thread-local counters allow for introspection without locks
      • Detection involves setting a "No threads stole work" flag, trying to steal work, on failure checking if the "no threads stole work" flag has been unset by thread on aquiring work. If so, terminates.
      • May be worthwhile to have lock for termination detection
  • Grey Packets
    • Divide work into packets and have threads contend for work packets
    • Each thread has an output packet, and steals work to add to input packet
      • Adds children of marked object to output packet
      • Called grey packets because input and output packets are both grey in tricolor abstraction
      • Allows for packet object prefetching.
      • Only require synchronization to get input packet and put output packet
      • Termination state is that the total number of empty packets(outputs with no grey elements) is equal to the total number of packets.
  • Channel
    • Create a channel(queue with single reader and writer) between each thread to each other thread
    • Store channels in circular buffer
    • Push children of marked object onto other thread's queue
      • If no children, take work from global pool and push instead.
    • No synchronization needed, and performs better than work stealing on computers with many processeors

14.6 Parallel Copying

Differs from marking in that may only copy once

Processor-Centric Techniques

  • Work dividing
    • Each copying thread given a stack of work to do.
      • Stacks offer easier synchronization
    • Synchronization works with "rooms," divided into pushing rooms and popping rooms
      • Must always have one empty pushing and popping room
      • New thread moves to non-empty popping room and pops work out of it, putting work generated onto it's stack. Leaves pop room when done.
      • Once all threads leave pop room, locks pop room to future thread enterance and empties queue into push room's shared stack. Leaves push room when done.
      • Last thread to leave push room unlocks pop room.
    • Problem with waiting. Because work time varies between threads, may have significant wait time.
      • Can fix by making threads leave pop room as soon as they get work, since working doesn't depend on shared stack.
    • Terminates when shared stack is empty when last thread leaves push room, and no objects working outside of the rooms.
  • Copying in parallel
    • If all alocation done in same region, need to synchronize on allocation.
    • May race to write "busy" pointer into object's forwarding slot.
      • If use thread-local buffers for allocation, can race to write forwarding location directly.
        • May need to un-allocate if another thread has already copied in the interim.
    • If local allocation buffer bounded, NUMA architectures may suffer due to allocation on another processor
      • NUMA should be baked into the processor

Memory-Centric Techniques

  • Per-thread fromspace and to-space
    • Risks poor load-balancing and space overflow
  • Block-structured heaps
    • Over-partition fromspace and tospace and have threads compete for granules of fromspace
    • Small blocks a good idea
    • A good way to fight wasted space due to fragmentation in each subset of tospace is to use big-bag-of-pages.
    • Naive implementation uses breadth-first copying.
      • Seperates parents from children, poor locality
  • Channels
    • Removes need for atomic synchronization primitives
    • Designed for NUMA
    • Over-partition heap, consider each subset a work-list
    • If a thread sees a refernce in it's own partition, adds to work list
    • Otherwise sends to other thread through channel
  • Card Tables
    • Difficult to balance load with card tables
    • Straightforward partition has problems because cards tend to have uneven amount of live data, work
    • Solution is to overpartition

14.7 Parallel Sweeping

Embarassingly parallel

Either partition heap statically or over-partition

  • Adding garbage to free list becomes bottleneck
  • Segregated fits a good solution, also thread-local allocation

Giving each thread multiple blocks a good way to reduce contention

14.8 Parallel Compaction

Easy to parallelize since we know the future location of a piece of garbage after marking(since we know amount of live data seen before it during marking.)

Multiple sweeps required:

  • 1. Calculate forwarding addresses and insert into headers
  • 2. Update all references in objects
  • 3. Copy objects

Can do each in parallel, but need all threads to reach rendevous before next phase starts.

Chapter 15: Concurrent Garbage Collection

15.1 Correctness Of Concurrent Collection

Must be able to retain all reachable objects, and must eventually complete collection cycle

Tricolor Abstraction

  • Easy to reason about correctness with tricolor abstraction
  • Collection can be seen as turning a portion of grey(unscanned) objects into white(dead) and black(live) ones.
  • Concurrent mutation can make mutator and collector's views of world inconsistent
  • If mutator changes white or grey objects it is fine because collector still needs to scan
  • If mutator changes black(already scanned) objects, is alright as long as new references already known as reachable.

Can still cause problems by freeing live objects

Lost Object problem

  • Two failure scenarios
    • A mutator can wide a white objects directly reachable from grey object by inserting reference into fields of already-scanned objects and deleting link from grey object
    • Can hide object transitively reachable by inserting pointer into chain into already-scanned objects, and then breaking pointer chain in unscanned objects.
  • Only possible to lose objects if two conditions hold:
    • Mutator stores a pointer to white object in black object
    • All paths from any grey objects to the white object is destroyed
  • Strong and weak tricolor invariants:
    • Weak triolor invariant: All white objects pointed to by a black object are grey protected.
      • To prevent both from holding, we must make sure that all references from black objects to white objects are tracked
      • Non-copying collectors ensure this because of traversal pattern, all references from black objects are black
    • Strong tricolor invariant: There are no pointers from black objects to white objects.
    • Strong implies weak, but not inverse
    • Ensuring strong is sufficient for copying collector
  • Precision
    • Varying precision of live object tracking mean holding onto some garbage for a duration
    • Stop the world has maximal precision, concurrent schemes typically trade precision for concurrency
  • Mutator color
    • Useful to consider mutator thread as if it were an object, and roots like references.
    • If a mutator is allowed to become grey, need to be rescanned
    • May be multiple mutator threads, need to handle
  • Allocation color
    • Need to consider which color to color new allocations
    • Must not let mutator's references to allocated cell break tricolor invariant.
  • Incremental update solutions
    • Need a write barrier to inform collector of mutator actions which would violate the invariants
  • Snapshot-at-the-beginning solutions
    • If a reference to an object removed during collection, need to conservatively treat as root
    • Requires write barrier

15.2 Barrier Techniques for Concurrent Collection

Collector can respond to concurrent mutation in only three useful ways

  1. Add to wavefront(black/white divide) by shading an object grey if it was white (Adding a root essentially)
  2. Advance wavefront by scanning object to make black
  3. Retreat the wavefront by marking black object grey ('unscanning')
  • Any other actions(reverting object to white or shading black without scanning) would break invariants.

Grey Mutator Barrier Techniques

  • Assumes a grey mutator
  • Use an interstion barrier to prevent storing white pointers in black objects
  • Because grey, no read barrier needed
  • Techniques:
    • On writing a pointer into a black object, make object grey to require re-traversal. (Steele)
    • Can mark any object with a pointer written into it grey since idempotent if grey and harmless if white. Removes need to check state of object being modified. (Bohem)
    • Always shade target of inserted pointer black, regardless if reference later removed. Ensures progress of concurrent threads. (Dijkstra)

Black Mutator Techniques

  • Can use a read barrier to mark any white objects grey as soon as their reference is read, written, or deleted.
  • Can use virtual memory hardware to implement access barrier of black fromspace.
  • Can do converse and mark deleted references black to prevent removal of last observed pointer freeing live objects.

Completeness of mutator techniques

  • No other organization schemes can preserve the tricolor invariant under concurrent collection(Pirinen 1998)
  • Can make variations by coarsening granularity and promptness of these techniques.
    • Ex: Mark targets of altered references grey and re-scan
    • May cause difficulties in ensuring termination
  • Strong invariant techniques need only to ensure that one of the two conditions cannot hold, don't need to do any more
  • Weak invariant / snapshot techniques need black mutator

Concurrent write barrier mechanisms

  • Prologue
    • Mutator barriers need to store references to grey objects in a structure, which collector needs to use to color relevant objects
    • Structures need to be safe for both mutator-collector concurrency and mutator-mutator concurrency
    • Can use a log to record grey objects, also card tables.
    • Card tables
      • Dirties a bit to state the grey color an object in a range of addresses
      • What is grey(targets of interesting pointers as well as simply transitive references through stack's roots) varies with collection technique
      • Card table becomes collector's work list
        • Mutator can re-dirty cards so collector needs to scan repeatedly
      • One level card tables
        • Card is in one of three states: dirty, refining, or clean
        • Mutator barriers set to dirty using store instruction, no expensive atomic primitives
        • Collector sets to refining without atomic instruction, starts cleaning
        • Collector tries to set to clean with atomic primitive. (test and set with test being a check if it is still refining). If unable to write, mutator must have set to dirty while collector was refining, must re-scan.
      • Two level card tables
        • Overhead of card table is in scanning cost, since most will be clean.
        • Cheaper to keep a card table which records location of dirtied cards at a much lower(2 ** n) granularity.
        • Concurrency issue: Mutator dirties fine grained and then coarse-grained cards while collector reads coarse-grained and then fine-grained. May consider all clean in race. Preserving atomicity may need primitives with overhead in mutator.
    • Reducing Work
      • Don't need to recursively mark children in mutator if we know that during collection we will have to trace through children on the dirty card.
        • Avoids traversing graph twice
      • Reduce number of dirty cards by only marking dirty if untraced.
        • Can keep a card table of traced objects
        • Can use local allocation buffer to record allocations in a buffer to trace. When the buffer overflows, trace all and mark all as clean.
          • Leaves some garbage for next cycle, but faster and ensures progress of threads.

Issues to consider

In some systems, mutator barriers do not harm throughput more than lower collection times add

  • Ex: A multi-phase payment system where a GC pause means a timeout, which requires reprocessing.

Costs vary by collection scheme.

  • Ex: References counting = high, while mark-sweep is low

Large impact on pause time is whether precision must be high, or if we can allow garbage to "float" from one sweep into the next, where it will be freed in the next sweep.

Very useful for large heaps where the amount of live objects is very high. Even with a parallel collector, pause times would be long.

Downsides:

  • Needs more headroom because allocating during collection.

Variants:

  • Generational partially-concurrent hybrid collector
    • Manage older generation frequently
    • Since most objects die young, nursery collection is quick enough
    • Avoids mutator barrier to younger objects because stop-the-world
    • Biggest performance loss in concurrent collectors is that they mark objects allocated during collection black always, but most objects die young
      • Hybrid is solution

Chapter 16: Concurrent Mark-sweep

Prologue

Mark-sweep very easy to parallelize

Also fairly performant because don't require read barrier.

  • Write barrier is 2-3% overhead in benchmarks. Read barrier is 20% overhead.

16.1 Initialization

Rather than running at exhaustion, collectors can run concurrently with allocation

When to trigger new mark phase critical

  • Must not run out of memory before collection finished

Simple implementation is to require all allocation to do a bit of work(hijack mutator thread)

Needs to stop the world to get mutator roots usually

  • Can use grey reference access barrier to avoid stopping all mutators, incrementally add work as mutators yield to collection work, but at least one mutator needs to be scanned to start work list.

16.2 Termination

If all mutator threads black, can terminate since no grey pointers held, so anything not scanned is garbage.

Grey mutator termination requires scanning mutator roots to ensure that no new references introduced to unmarked objects.

16.3 Allocation

New objects colored according to mutator

Since black mutator could not have known white/grey object reference, we allocate black objects.

Grey mutator has options:

  • Can simply color black during marking and add to roots, and white otherwise
  • If object initialized with all fields(immutable languages) we can color based on references. If none white, can allocate black. Must allocate white.
  • Allocating white may lead to high allocation rate in mutator preventing prompt termination

16.4 Concurrent Marking and Sweeping

Need way to differentiate between newly allocated data and marked garbage.

Solution = new color for abstraction: purple

  • After marking, set unmarked objects to be pending freedom
  • Allocate new objects without this mark
  • On sweeping, only removed marked objects
  • Problem is that this requires touching all objects in heap before ending phase of paused mutators
  • Solution is to interpret same colors differently during different collection cycles
    • Easy solution is an 'even-odd' allocation scheme. Live data switches sign, allocation created new sign-ed objects, and sweep frees old sign.

16.5 On-The-Fly Marking

Prologue

  • Alternative to stopping world to scan roots is to scan each mutator independently
  • Uses handshake to ask each mutator to stop at next safe point
  • Scans roots of mutator, and then sends on way
  • Can use stack barrier to limit mutator's actions to lower frames, so that collector can scan higher roots
    • Makes handshake and pause much shorter

Write barriers for on-the-fly collection

  • Existance of white and black mutator threads intermixing allows for deletions from white mutator and addition of reference by black mutator, to lead to black mutator being only reference to white object/violation of strong tricolor invariant

Doligez-Leroy-Gonthier

  • Leveraged ML immutability to allocate immutable data locally and mutable data in shared heap
  • Use on-the-fly marking with series of handshakes for phases of collection.
  • Java Implementation
    • No thread-local heaps, single shared heap
    • Used queues between mutators and collectors to insert grey objects into to avoid rescanning thread's roots

Sliding views

  • Uses sliding views and thread-local buffer of all fields of an object before mutator first modified by collector since marking started.
  • All mutators handshake on phase change, and then disable buffers.

16.7 Issues to Consider

Much more difficult than non-concurrent collection usually

While generational collection improves expected pause times, can't help with worst-case pause times. Concurrent collectors offer shorter and more predictable pause times.

To truly bound pause times, must be both concurrent and on-the fly.

Concurrent mark-sweep has same problems with fragmentation and slower free-list allocation that serial mark-sweep has

Mark sweep has easier heap coherency model

Has many possible variations of implementation, each choice with performance impacts

Chapter 17: Concurrent Copying & Compaction

Prologe

Must protect collector and mutator threads from eachother both

Black mutator tospace invariant: Black mutator must hold only fromspace pointers

  • Otherwise we'd never scan them

In absence of read barrier, grey mutator cannot find pointers in to tospace objects in fromspace objects(since copying collector doesn't alter fields)

  • Called grey mutator fromspace invariant
  • At collection cycle termination, we must only have tospace pointers, so roots must be forwarded somehow.

Mutator updates to fromspace objects must be propagated to tospace.

17.1 Mostly-Concurrent copying: Baker's algorithm

Maintaining a tospace invariant requires only replacing roots with addresses of tospace objects

  • Roots now black, but point to grey objects with fromspace references.

Introduce a black mutator read barrier to prevent a mutator accessing a fromspace pointer

  • On access, atomically copy target of pointer and return forward reference.

Mostly concurrent, mostly-copying

  • Mostly-copying collectors fit well with mostly-copying, we can pin ambiguous root references in mutator threads.
  • Used historically with page protection to allow concurrent copying for compilers without stack maps
  • Read barrier only needs to aquire atomic roots when forwarding, so can avoid costs of synchronization until last minute.

17.2 Brooks's Indirection Barrier

Hold forwarding reference in every object header. Points to either self or tospace copy.

Uses write barrier to prevent modification of object when copying.

Since mutator now grey, needs to scan stack for fromspace references and replace before swapping semispaces.

17.3 Self-Erasing Read Barriers

Read barriers more costly than write because more common

Read barriers conditional, forward object if not in tospace

Have closure contain a code pointer, and compile two copies of code. One does a baker-style copy while the other directly chases references.

When collecting, set bit in closures to execute baker code. When not, don't.

Beauty is that when code executed if object has already been copied, never enters read barrier.

Downside: Duplication of code. 25% larger binaries in ghc.

17.4 Replication Copying

Mutator references old objects, never pays instruction/cache/space overhead of brook's indirection pointers

When copied, write barrier logs modifications of the old object, modifies new

When all of tospace scanned, mutator's roots scanned, and write log empty, terminates

Requires pausing to ensure termination conditions hold for all threads when termination signalled.

  • Also needs to stop to rewrite pointers in roots.

17.5 Multi-Version Copying

Both previous versions must still pause world

Processors have own fromspace and tospace subset.

One tospace, multiple fromspaces. Accumulate old, invalid copies in fromspaces.

  • Barrier on access, so mutators must search for new reference and update root.

Secret is to decouple fromspace reclaimation from phase change, or 'flip'.

Each processor scans tospace for fromspace references and fixes.

Processor flips when it's tospace is full. Can't free fromspace until all references have left it.

  • Upon collecting a processor checks bits set by other processor to determine if they've flipped
  • If so, it updates all of the fromspace references it has in it's to/from spaces to point to new tospace
  • When all processors have done so, fromspace discarded.
  • This has the effect of ensuring termination because if the mutator outpaced the collector, the mutators would all scan and eventually flip until enough fromspace was released.

17.6 Sapphire

Prologue

  • If from/tospace divided equally between threads/cores, may end up with multiple threads unable to allocate until one thread frees.
    • Largest problem on non-parallel systems.
  • Solution is to place collection work asymmetrically on one or more collector threads.
    • Can modify priority to set mutator-collector balance
  • New objects allocated as black in seperate newspace that isn't scanned
    • Has write barrier to track references written into it.

Collector Phases

  • MarkCopy
    • Overview
      • All reads are from fromspace, all writes are duplicated
        • Buffered changes, eventually consistent
      • At application-level thread synchronization points, all made consistent
  • Flip
    • Sets all roots to point to tospace copies
    • Until this point, all mutators can refer to tospace/fromspace interchangeably
      • Need barrier for all pointer comparison operations.
  • Volatile Fields(Java, etc)
    • Require synchronization at every read/write between fromspace and tospace

17.7 Concurrent Compaction

Compressor

  • Scan roots, mark objects, and compute new compaction address of all live objects from knowledge of live data size and offset of object
  • Set mutator roots to fromspace and unpause
  • Use access barrier on tospace to copy an object if not alredy there.
  • Concurrently copy fromspace to tospace, and update references, but don't copy children. Don't need to in order to know their new reference, since we know all future addresses.
  • Extensive use of double mapping and virtual memory hardware, which is expensive so batching required

Pauseless

  • More intelligently only protects objects actually being copied, rather than all tospace pages.
  • Protect them incrementally while copying and updating references(known without having to copy children)

17.8 Issues to Consider

Pauses/Poor real-time behavior of baker

  • On any access with a baker-style access barrier of tospace, every access may take unknowable amount of time because many have to recursively copy object and references through it.

Chapter 18: Concurrent Reference Counting

Prologue

Previous reference counting optimizations(including elimination of cycles) involved stopping the world for tracing or batch changes of references.

18.1 Simple Reference Counting Revisited

Difficult to do concurrently because increment/decrement of references might interleave in the mutator even if reads/writes atomic

One solution to lock each object before altering fields.

Lockless solution would be nice

  • Primitives for single memory location not enough because another mutator can drop reference to zero and free between pointer load and reference increment. Attempt to increment might corrupt another object allocated in that location.
  • If we can alter two words, that's sufficient because we can access address and count concurrently.

18.2 Buffered Reference Counting

All reference count changes placed in a buffer that a single thread scans and executes.

  • All increments executed before decrements to prevent premature freeing

Doesn't fix problem of making increment/decrement and pointer load/store atomic.

Can do so by giving each thread a local buffer and an epoch count and using ragged epochs. When local buffer full, get new one and increment epoch. When all epochs reach a level, local buffers of that level and below processed(all increments first) together.

Variant of deferred reference counting.

18.3 Concurrent, Cyclic Reference Counting

Trial deletion has problems

  1. Can't guarantee that the subgraph it retraces hasn't changed since last trial deletion.
  2. Pointer deletions may seperate graph
  3. Reference counts may go out of date

Only current good solution is to run reference count concurrently, and set a mark, and then free that next cycle if it's still garbage.

In theory, may leave garbage around indefinitely

18.4 Taking a Snapshot of the Heap

Refinement of coalesced reference counting

Problem with previous description is that while we can restart threads once we get their local buffers, if we're going to decrement children's counts before freeing object, we need to know state of object at the time the change happened, not what's left from the mutator.

Solution:

  • If object clean, old reference still valid
  • If object not, mutator's log buffer has the change in it, so we can find children by reading state of object before reference change in buffer.

18.5 Sliding views

Don't need to stop all mutator threads.

Only collect states of a subset of objects each time, fix subset's references each time

Applications:

  • Age-based
    • Use sliding reference counting for long-lived data and mark-sweep for nursery

With only a subset of the heap in view, we can use trial deletion and use graph made from local allocation buffers, prevent graph from changing under it's feed.

18.5 Issues to Consider

Exist large literature on ways to safely reclaim memory for insertion/deletion with certain data structures, might want to specialize for language.

Chapter 19: Real-Time Garbage Collection

Prologue

All earlier algorithms not truly real-time because they didn't guarantee progress of mutator

Thread scheduling may starve mutator thread when many collector threads

19.1 Real-Time Systems

Impose operational deadline on tasks

May be soft: example is video. Pause degrades performance

  • Should use goal of maximum collector time, time slice, and failure rate

May be hard: Pause causes failure. Example: ignition system

  • Performance less important than predictability of performance
  • WCET: Worst case execution time of a task
  • Embedded realtime systems have tight space constraints

Needs specific collection scheme, different than previous concurrent gc schemes.

19.2 Scheduling Real-Time Collection

Incremental collectors will impose cost on all mutation

Concurrent collectors as before

To maintain steady-state space usage needs to free and recycle objects at rate of mutation.

  • Must avoid fragmentation to provide constant space as well
  • Moving has costs

Alternative methods

NameDescription
Work-basedImposes cost on mutator actions
Slack-basedRuns work during slack portions of collector time-slice. Can be significant in periodic real-time systems.
Time-basedSchedules a constant time slice of collection while world stopped

19.3 Work-Based Real-Time Collection

Classic baker algorithm

  • Fixed-size objects assumed
  • Read barrier
    • Reading copies into tospace. Bounded by object size.
    • Also marks a bounded number of objects
  • Has specific bounds:
    • Space = 2R(1 + 1/K)
    • R = reachable space
    • K = Adjustable time bound, defined as number of fields scanned each allocation

Parallel, concurrent replication

  • Adds concurrent and parallel collector
  • Because work-based, can suffer from variation in distribution of pauses.
    • Difficult to obtain bounded mutation times
      • Solutions exist, 19.5
  • Algorithm
    • On read, copy to tospace. Also copy concurrently.
    • All copies to tospace leave a forwarding address in header of previous object. All reads by mutator in future go through this address
    • Fromspace and tospace divided between threads
    • When collecting, all allocation allocates into both fromspace and tospace, with forward.

Uneven work and its impact on work-based scheduling

  • If we use worst-case execution time, we have the problem that the worst-case is much worse than the expected case
    • Array of things with references pathalogical worst case
  • We can bound worst case by breaking down objects and arrays into fixed-size units
    • Places overhead on access, space
  • Considered a nail in coffin of work-based scheduling
    • Opinion is that overly starves mutator
  • Problem is that even when non-copying and fairly constant, allocation storms cause massive variance in time of GC
    • Response has been to treat mutator work as something that must be scheduled like any other real-time job.
    • Leads to better worst-case and expected-case ratios

19.4 Slack-Based Real-Time Collection

Implementation: Hendriksson

  • Forbids collection during high-priority tasks
  • GC work delayed until no high-priority tasks executing
    • Also called the slack-period
  • Low-priority tasks do collector work during allocation
  • During high-priority tasks, record omitted GC work and perform it during slack
  • Threads for collecting high-priority tasks have lower priority than high-priority work but higher priority than low-priority work
  • Incremental copying is done by low-priority work in standard concurrent copying collector way
    • All from-space accesses copy into tospace. All allocation done in tospace.
    • Enforces constraint that no tospace object refers to a fromspace object.
  • High priority tasks do less mutation cost
    • Objects evacuated lazily. allocate in fromspace but don't actually forward through fromspace header
    • High-priority collection threads during slack time do copying
    • Uses unconditional read indirection barrier and points to fromspace header through forwarding address, avoids mutator overhead
  • Collector works as coroutine through well-defined yield points.

Sceduling the collector work

  • Let Wmax be the maximum amount of work that must be done starting at the start of slack time to allow the next high-priority interval to not run out of memory
  • Let Fmin be the amount of memory that the high-priority interval will need
  • Minimum GC ratio = GCRmin = Wmax / Fmin
  • Current GC ratio = GCR = Performed GC work / Amount of new objects in fromspace = W / A
    • Allocation increases A, collection increases W
    • A throttle by yielding to GC until GCRmin >= GCR
    • Ability to arbitrarily enter a high-priority period means that space overhead may be required to ensure that all tospace allocation is satisfiable.
      • Programmer must know worst-case allocation needed by high-priority task.
      • Not as bad as it sounds because high-performance, hard real-time tasks typically use little memory

Execution overheads

  • Since caches alter memory access time, instruction count not accurate measure of time
    • Needs to be tested repeatedly to know predictable cache behavior, or assume worst-case time
    • Collection in mutation has synchronization overhead.
    • Forwarding pointer overhead on reads.
    • Allocation is simple bump-a-pointer allocation.

Programmer input

  • Programmer must inform GC about worst-case allocation requirements and period in which it runs.

19.5 Time-Based Real-Time Collection: Metronome

Mutator utilisation

Supporting predictability

Analysis

Prologue

  • Treats minimum mutator utilisation as an input
  • For java
  • Concurrent incremental mark-sweep, with occasional compaction to avoid fragmentation
  • Uses deletion write barrier to enforce weak tricolor abstraction
  • Uses brooks-style forwarding pointers

Mutator utilisation

  • Ensures minimum mutator utilisation. All other time to discretion of collector, can give to mutator or use.
  • Is a trade-off between space and mutator utilisation.
  • Split time into quanta, ensure that for a given window the MMU ratio is maintained
  • Scheduler will usually schedule collection until MMU about to be violated and then back off
  • When considered over 'windows' of time between deadline, leads to mini-pauses or jitter where it collects for a while and then backs off.

Supporting Predictability

  • Fragmentation a threat to predictability. Many considerations
    NameDescription
    ArrayletsBreak arrays and variable-sized objects into fixed sizes to fight fragmentation.
    Read barrierTo avoid performance cost of brooks-style read barrier, eagerly change references in roots and on stack to point to tospace when copied.
    Scheduling CollectorUse 'alarm thread' to check if it should switch to a collection quantum and if so then tell mutator threads to reach safe-point and yield to it. Colector thread collects.
    Suspending mutator threadsAt pause will reach safe point, mutator threads store all references on stack/in roots in specific location. On reload will read them back in, to allow faster changes.
    Ragged root scanningDeeply nested stacks a problem, may not allow collector to scan during quanta. Solution is concurrent scans and a write barrier into scanned stacks to prevent overlooking references.

Analysis

  • Metronome gave formalized model for concurrent, realtime GC
  • Parameters
    • A(t) is instantaneous allocation rate at time t
    • G(t) is instantaneous gabage generation rate at time t
    • P is GC processing rate as function of live data
    • Memory allocated, garbage created is simple integral
    • Mutator quantom and collector quantum are time that each can run without yielding
    • Only works if parameters have accurate estimate.
  • Minium mutator utilization = ( mutator quantum * (time elapsed / Mutator quantum + collector quantum) + x ) / (time elapsed)
    • x = Size of remaining mutator quantum
  • Can be used to analyze work-based and other scheduling approaches

Robustness

  • If parameters are poorly known, only graceful way to degrade is to slow down allocation rate
  • Can slow down allocation rate is to use generational scheme
    • Nursery filters into primary heap
    • Removes short-lived data, reduces total floating memory
    • Unacceptable in naive approach due to variation in pause time when nursery overfills during mutator slice
    • Syncopation a solution
      • Always copy out of nursery during collector slice
      • Use nursery survival rate as a parameter so to size nursery so only one evacuation

19.6 Combining Scheduling Approaches: Tax-and-Spend

Prologue

  • Metronome best on uniprocessors because of need to stop all mutation
  • Work-based collectors have order of magnitude worse latency than time-based
  • Slack-based fragile, will break when no slack; under load
  • Tax-and-spend combines all of them.
  • Performant, cut metronome latencies down to a third of previously
  • Mutators must do an amount of collection work dependent on desired MMU.
  • Slack time used by collector to give MMU 'credit'.
  • Taxation can be scheduled during any safe point, but always checks when thread-local allocation buffer exhausted.

Tax-and-spend scheduling

  • MMU surperior to maximum pause time because it allows for computation of clustering of pauses that would cause a missed deadline.
  • MMU can vary by thread to prevent missing deadlines with tax-and-spend
  • Can also use other cores to offload collection work
  • Per-thread scheduling
    • Must track mutator activity to determine tax to be paid
    • Must avoid letting tax get too high, such that a deadline would be missed.
    • By interleaving collection and mutation in same thread, you prevent the OS scheduler from placing some other mutator or an unrelated process on the core after the mutator yields.
    • Allowing MMU to vary between threads important when allocation rates vary
  • Tax-based vs slack-based scheduling
    • Slack-based degrades with no slack
      • Poor for interactive real-time systems with sustained peaks of use
    • Time-based scheduling may introduce jitter because may pause until MMU is almost violated, such as at start of new window of time consideration.
      • Ensures that for time periods smaller than a window, mutator responsiveness inconsistent
      • Tax-based more consistently realtime
  • Combining tax-based and slack-based scheduling
    • Tax-and-spend combines all the other scheduling approaches by using an economic model.
    • Mutator threads pay back tax by collecting incrementally during allocation
    • Dedicated mutator threads with low priority exist to work during slack
      • Deposit work done as a 'credit' in a global pool for mutators to take from
      • One per processor usually
      • Must sleep itermittently to allow mutator threads which appear to schedule over them
    • Tax over all threads must be high enough to finish collection before memory runs out
    • When tax due tries to withdraw from global credit bank
      • If can, then collector is keeping up with mutator so doesn't need to pay tax.
      • Treats slack on uniprocessor and unexploited parallelism on a multiprocessor the same.

Tax-and-spend prerequisites

  • Needs an underlying collector that's both incremental and conucrrent. Should be parallel on multiprocessors.
  • Metronome changes for synchronization without halting mutator threads:
    NameDescription
    Raggeed epochs for global consensusNeeds global states to hold (such as change of fromspace/tospace pointers in read barrier) for all threads for phases of collection to change. Uses an epoch number. Threads that change state during yielding to collection increment their epoch. Once the lowest epoch among any threads is value of phase change started, can assume all mutators have hit that phase.
    Phase agreement with 'last one out'Global phase. When mutator thread starts to pay tax, increments worker count. If no work being done, and if worker count = 0, then it atomically changes phase and decrements worker count.
    Per-thread callbacksAll threads are requested to make change(flush thread-local buffer, copy out roots from mutator thread stack) by global thread whenever collection being done and hits step that demands it. They're given a callback to perform to do that step. If blocking on IO or executing native code, prevented from returning and action performed on their behalf.
    Priority boosting to ensure progressIf higher-priority work threads saturate processor, above steps can be starved. Need to boost threads requesting actions from other threads.

19.7 Controlling Fragmentation

Prologue

  • Fragmentation can cause violation of real-time space bound.
  • Can compact to avoid fragmentation
  • Can also use arralyets and object-lets to have fixed size allocation blocks, at cost of some internal fragmentation and higher memory usage
  • Concurrent replicating copying collectors have problem of keeping both copies of object consistent. Locking doesn't scale, and can cause mutator thread starvation.
  • Compressor and Pauseless algorithms both rely on traps and page-level synchronization and so suffer from poor MMU.
    • Work-based, and causes trap storm on phase. (Interferes with all threads)
  • Lock-freedom very desirable in a realtime system. Ways to obtain it follow:

Incremental Compaction in Metronome

  • Metronome non-copying, compacts while threads are stopped
  • Tax-and-spend version doesn't perform any compaction
  • Bacon(2003) uses segregated-fits and incrementally defrags size classes in slow/infrequent path of allocator
    1. Sort pages by number of unused objects per page from dense to sparse
    2. Set the allocation page to the densest non-full page in resulting list.
    3. Set the page to evacuate to the sparsest page in same size class
    4. While target number of pages to evacuate in size class not met, and page to evacuate does not equal page in which to allocate, evacuate from sparsest to next allocation page, moving to next page when allocation page filled.
  • Moves minimal number of objects and frees maximal number of pages
  • Always choosing densest can lead to hopping pages and poor locality, can require minimum number of cells in allocatin page

Incremental Compaction on Uniprocessors

  • To obtain atomicity on a uniprocessor can simply disable scheduler interrupts or only allow thread-switching during a gc safe-point.
  • If brooks forwarding pointers used, can optimize access.
    • In replication, forwarding pointer of each copy points to other copy. If only one, pointer points to self.
    • Mutator uses whichever address it has, no synchronization needed.
    • On write, write to writes to both the origional and copy. (Kalibera 2009, Java)
    • Double write is cheap, and makes reads much faster.
    • Many more reads than writes.
    • Doesn't harm cache through needing both lines in memory for each read.

Incremental Compaction on Multiprocessors

  • Stopless: lock-free garbage collection
    • Ensures lock-freedom
    • Doesn't double write
    • On start of collection, threads race to allocate a "wide" version of the object and set redirection pointer to that object.
      • Wide copy has state for each field that is either 'in origional', 'in wide copy', or 'in copy'. All set to 'in origional' at first.
      • On write, write to wide copy and set field to 'in wide copy'.
        • Also done by copying collector when it copies object with reference in that field
      • On read, read from origional or copy based on state.
      • Once all fields 'inWide', copying collector makes 'narrow' version in tospace and puts address as forward reference in tospace
      • Copies fields to copy, and set's wide's field status to 'in copy'
    • Downsides:
      • Doesn't work on double-word fields if double-word writes not atomic on architecture
      • Has space overhead of 3 copies of object for a period
        • Can use space filled by copies of objects emptied from pages during incremental compaction to hold copy.
  • Staccato: best-effort compaction with mutator-wait freedom
    • Doesn't require atomic instructions, but can use memory fence at GC safe points to ensure up-to-date references
    • Collector reserves bit in forwarding address to state that object is currently being copied.
    • Algorithm:
      1. Set copying bit using compare-and-set. Mutators access forwarding pointer without atomic operations, so changes take time to propagate. If a mutator accesses the object, the copying bit is unset atomically with compare-and-set. If this fails, copying finished and should go through reference.
      2. Wait for ragged synchronization at a read memory fence at a safe point.
      3. Perform a read fence of object to ensure that changes made by mutators between setting bit and ragged synchronization are seen by collection thread.
      4. Allocate copy, and copy fields
      5. Set read fence so mutators see field changes.
      6. Once ragged synchronization happens, make copy refer to itself through forwarding address, and unset copy bit using compare-and-swap. If fails, old object must have been moved by another mutator. Abort.
    • Can amortize memory fence overhead since many objects usually evacuated from/to same page.
    • If collector tries to incrementally copy objects with high mutator locality, the mutator threads will cause an "abort storm".
      • Unlikely because only sparsely populated pages moved so unlikely to move objects all allocated together recently.
  • Chicken: best-effort compaction with mutator-wait freedom for x86
    • Uses x86 memory model to know strong read ordering. Only needs to abort on write
  • Clover: Guaranteed compaction with probabilistic mutator lock-freedom
    • Each move selects a random value alpha and sets invariant that mutator won't write it into heap
      • If it tries to, block mutator.
      • Best to try an illegal value for the program, such as a pointer out of memory range or a NaN variant not used by hosted language.
    • Once a field is copied, it is replaced with alpha via compare-and-swap.
    • If mutator accesses an alpha, it reads//writes field to second copy.
  • Stopless vs Chicken vs Clover
    • Stopless can't guarantee progress to collector because of potentially infinite writes to wide copy's fields.
    • Chicken guarantees progress, but aborts copies
    • Clover may interleave with repeated changes to a field. It must keep copying, and is unable to compare-and-swap afterwards because of field changes.

Fragmented Allocation

  • All compaction techniques have overhead, so process must trade throughput for fragmentation
  • Solution may be to only allocate in fixed-size granules with no constraints on locality.
  • Arrays split into a trie of array
    • Problem is that access time is variable, not constant now
    • Java allows arrays with no maximum length, so can't prove deadline on memory reads.
    • Solution: Schism by pizlo
      • Uses spine of arraylets, stole from metronome.
      • Access is O(1) because field locations computed from values in a 'sentinel fragment' in header of start of object or array. Becomes one layer of indirection. Is standard reference
      • Problem of dynamic arrays
      • Memory needed bounded by 1.3104 * size of live set.
      • When space for sequential allocation of all arraylets, can mark in sentinel fragment and use conventional access techniques.

19.8 Issues to Consider

Real-time systems almost always require statistics about the expected application to be collected by the developer.

Difficulty lies in ensuring that you're seeing the worst-case truly possible by executing slow paths.

Date: 2014-09-29T01:18-0400
Author: akyte
Org version 7.9.3f with Emacs version 24
Validate XHTML 1.0

9-12

Table of Contents

Chapter 9: Generational Garbage Collection

Prologue

Tracing collectors have best case when heap is mostly garbage, but long-lived objects tend to exhibit pathologically bad behavior for copying and mark-sweep

  • Ex: Mark-sweep has dense prefix near bottom of heap, copying moves data unnecessarily

Solution: Segment the heap into generations and collect the youngest generation(nursery) most frequently. Tenure into older generations after surviving a number of sweeps.

  • Most of the time use copying
  • Has good mark/cons ratio, so should be faster for young generation
  • Can tune pause times / frequency of pauses by altering nursery size

Only improves average-case time. Worst case still needs to sweep whole heap.

Downsides:

  • Garbage sticks around in older generation longer
  • Cross-generational pointers need bookkeeping, will likely impose mutator overhead
  • Hard to tune perfectly

9.1 Example

Minor collection = Collection of only the nursery

Need to decide how/when to promote objects

  • Collector needs a way to measure the age of objects to decide if they've survived enough minor collections to be tenured
  • If a collector only sweeps one generation at a time, must record inter-generational pointers
    • Can create inter-generational pointers via mutator use in program or via promoting an object between generation. Must track both of them.
      • Requires a mutator write barrier and a copy barrier. Both must check for creation of inter-generational pointer.
    • Inter-generational pointers can add false roots to collections. Until older generation collected, garbage in old generation can keep young objects alive. Can even lead to promotion. Called "nepotism".

9.2 Measuring Time

Two choices: Bytes allocated or seconds elapsed

Seconds Elapsed

  • Considers the distribution of pause time
  • Good for ensuring requirements such as maximum gc pause time

Bytes Allocated

  • For actual reasoning, bytes allocated between life and death give more information about object
  • Largely machine/environment dependent
  • Tricky in multithreaded systems

Reality

  • Most of the time, the number of collections survived is used as a proxy for bytes allocated

9.3 Generational Hypotheses

Weak Generational Hypothesis: Most objects die very young

  • Appears valid across many languages/domains
  • Frequent for most data to only survive one cycle, and remaining minority of objects to be immortal.

Strong Generational Hypothesis: Younger objects are more likely to die than older ones, even if they are not allocated in the last generation

  • Not decidedly correct
  • Evidence to contrary:
    • Tend to see many objects die in clumps because programs operate in clumps
    • Immortal objects can be created at arbitrary locations in program flow

9.4 Generations and the Heap Layout

Can treat generation management in many different ways

  • Ex:
    • Fixed size of nursery vs ratio of heap
    • Homogenous generation or generation segmented by ages
    • Shared large-object subspace or per-generational one

Must decide sizes of generations

  • Benefits of small collection space:
    • Fast collection times because small generation
  • Downsides to small generations:
    • More frequent collections
    • Less time for objects to die, will have to copy objects between semispaces(or even to older generation) and waste time.
    • Older generation will fill too fast because of a higher promotion rate
      • Larger older generation negates many benefits of generational collection.
      • Larger older generation also causes nepotism because objects promoted too soon likely to both become garbage and refer to young objects
      • Younger objects are modified more frequently. Tenured objects require a mutator overhead on writes for inter-generational references. By promoting too early, we will spend a lot more time in this barrier.
      • Will "dilute working set" by messing up assumptions about behaviors of spaces.

9.5 Multiple Generations

Adding more generations solution to need to have both short pause times for nursery and infrequency of major collections.

Problem: If the collector promoted all live objects during minor collection, then it promoted objects made right before collection

  • Objects made right before collection haven't had time to become garbage, which they likely will
  • Solution: Multiple Generations. Can sweep the intermediate generation with frequency between nursery and oldest generation. Will free these freshly made objects much sooner than if they were tenured.
    • Helps prevent nepotism
    • Downsides:
      • Most systems collect younger generations when older collected, because only require write barrier for old-to-young which are less frequent
      • Ensures that pause time will be between nursery and major collection. Longer than nursery, but more frequent than oldest.
      • Increasing number of generations increases number of inter-generational pointers which increases the time spent in the mutator's barrier
      • Increases the size of root set, and may have pointers from garbage into nursery's root set keeping large garbage structure alive.
  • Most collectors only have 2 generations
    • Clever control of promotion rates used instead of multiple generations

9.6 Age Recording

En Masse Promotion

  • Easiest strategy is to have each generation but oldest have a single semispace and copy live objects "en masse" to the next generation on collection.
    • Can lead to 50% to 100% higher promotion rate than if we require that an object survive at least once collection before promotion from nursery
      • Increasing survival requirements for copying above 2 has diminishing returns.

Aging Semispaces

  • Create a generation partitioned into semispaces and copy objects back and forth between semispaces a number of times before promoting semispace as a whole.
  • Either all are copied to another internal semispace or all are copied to next generation, depending on generation's age as a whole.
    • Allows oldest objects in generation time to die, but youngest in generation may still be prematurely promoted. Can avoid with bookeeping.
  • Bucket Brigade
    • Idea is to create n buckets per generation and to copy from younger bucket to progressively older buckets. The oldest bucket is copied into the next generation.
    • Important: All buckets in a generation collected at same time, inter-bucket pointers not an issue.
      • Use of steps in youngest generation can solve problem of mutator overhead with intermediate generations

Survivor Spaces and Flexibility

  • En Masse / Aging Semispaces both wasteful because need to reserve twice as much space as data for copying.
  • Solution is to use an eden + 2 survivor spaces structure
    • Youngest generation is one large creation space(eden).
    • On collection, live data in eden evacuated to younger survivor tospace.
    • All objects in younger survivor fromspace evacuated to either older survivor tospace or to eden depending on age
    • Works well because eden is very large(makes sense, weak generational hypothesis). Ex: HotSpot has ratio of 32:1 between eden and semispaces. Would need twice this space for copying reserve.
  • Can combine with aging semispaces by creating buckets in the survivor spaces. Can use to keep age bookkeeping easy.

9.7 Adapting to Program Behavior

Prologue

  • If program does not behave according to weak generational hypothesis, a generational garbage collector's performance will suffer
  • Might be managed with a way to control tenuring.
  • Desirable to alter behavior of generations(ex: nursery space) to match behavior of mutator to avoid premature tenuring and to avoid holding on to excessive garbage.
    • Consider the behavior of allocating frequently and then freeing in clumps.

Appel-style Garbage Collection

  • Three spaces: reserve, nursery, and old generation. On collection, all data moved to old collection and rest of space used for nursery and reserve. If space falls below threshold, full collection.
  • Reserve size is dynamically set this way
  • Promotes all inter-generational garbage into older generation, which is then freed at next major collection.
  • Can also work in block-structured heap or by using shrinking nurseries with mark-sweep/other noncopying algorithm of the older generation.
  • Pathalogical worst case: Quick allocation with very low promotion rate will lead to a shrinking nursery and heap thrashing with minor collections. A major collection never occurs though, because of low promotion rate.
    • Solution: Major collection whenever young heap falls below certain size.

Feedback Controlled Promotion

  • Mostly related to pause time reduction
  • Demographic feedback-mediated tenuring
    • Fights pause time caused by collective death of many promoted objects
    • Volume of objects promoted used as predictor for length of next collection. Sets rate of promotion.
      • If survivors exceed a value, looks up survivor count in table which says age threshold for promotion of next generation.
    • Can't move old objects into nursery.
      • Threatening-boundary solutions to this exist.
        • Move threshold between two generations, can move in either direction.
        • Must track more pointers because can't know at allocation time which generation things may lie in because of boundary motion.
  • Ergonomics from HotSpot
    • Focuses on three soft goals. In order of priority:
      1. Maximum pause time
      2. Throughput
      3. Memory footprint
    • Pause time addressed by shrinking generations, one with longest observed collection time first
    • Throughput addressed by increasing generations in proportion to collection time
    • Venegerov has ThruMax algorithm for reasoning about HotSpot information.
      • Focuses on ratio of free space in old collection after major collection to amount of space used during promotion
      • ThruMax invoked after steady-state of survivor spaces reached.
  • Ergonomics has many tunable options for end-user, most of which are interdependent.

9.8 Inter-Generational Pointers

Prologue

  • Roots for each generation contains pointers in registers, stacks, globals AND all references into the generation from other heaps not being collected.
    • Includes older generations and spaces outside of the generational heap(large object, immortal, code)
  • Can be created three ways:
    1. Object creation
    2. Mutator updates to pointer slots
    3. Promotion
  • Call these pointes "interesting pointers" because they must be kept track of.
  • Problem of "boot-image" objects, ones alive when system starts. Must keep track of references in objects, but costly to scan. Can use tracing, scanning, or remembered sets.

Remembered Sets

  • Holds the location that a potentially interesting pointer could be stored.
  • Stores the source, not the destination.
    • Allows coping collector to update reference
    • Goal is to track as few pointers as possible
    • Needs write barrier to update reference by mutator and not collector

Pointer Direction

  • All stores do not need to be recorded
    • Ex: Some languages store activation record on the heap. Can be parsed to determine current pointers in objects.
  • If young generation always collected whenever the older is, don't need to store pointers from young to old.
  • Most mutation is initialization, initialization must refer to older objects.
    • Ex: Purely functional languages will only refer to older objects, except for thunk replacement.
    • Ex: ML requires mutable variables be annotated. They're the only source of old-to-young pointers.
    • Exception is OO languages like Java.
      • Many of these languages are written in an initialization-heavy way.
      • Programs vary heavily. Many of them will reassign a field very frequently. Can change policies for remembered set implementation.
  • In systems with multiple, independently collected heaps, may want to store mutation info to guess as spaces with most garbage. In this case, need write barrier in both directions.

9.9 Space Management

Minor GC is short, fast. Major GC slow because also sweeps younger generations.

Major GC may use copying for older generation, but bad choice because the space requirements for copying will make nursery allocation thrash smaller heap more often. Also moves long-lived data.

Choice of mark-sweep can be good for older generation

  • Allocation is slower for free-list, but cost deferred to collector since mutator allocates out of nursery.
  • Fragmentation can be controlled by infrequent mark-compact run
    • Can ignore "dense prefix" of very long-lived data.

Copying collection usually used for nursery

  • Cost of tospace overhead not bad in average-case because few objects survive first collection

Space usage can be improved by switching to compacting collector "on the fly" during a collection

9.10 Older-first Garbage Collection

Other possibilities for age-based collection exist

NameDescription
Youngest-only(Generational)Only collect youngest objects in heap
Oldest-onlyOnly collect oldest objects, on the premise that they are likely to die. Probably ineffective due to weak generational hypothesis
Older-firstFocuses on middle-aged objects. Gives them time to die without wasting time on "sediment"
Renewal Older-firstConsider age of the object to be time since last collected or time created. Mixes up object order in heap. Mutator overhead.
Deferred Older-firstConsider a sliding window of the heap. Move survivors to one side of the window, and march down toward the younger side. Start over at oldest side. Mutator overhead.

9.11 Beltway

Beltway collector exploits five generational insights:

  1. Most objects die young
  2. Avoid repeatedly collecting old objects
  3. Exploit incrementality(bound size of spaces, use small nurseries, etc)
  4. Small nurseries with sequential allocation are good for spacial locality
  5. Objects need sufficient time to die(avoid pretenuring)

Can be configured to behave like any other copying collector

Beltway breaks collection into units called "increments"

Increments are ordered into queues called belts

Increments selected FIFO for collection

  • Usually selects oldest increment on youngest belt
  • Specifies if it should move survivors into increment on same belt or promoted to higher belt

Not inherently generational. Collects increment by increment, doesn't treat belt like generation and collect all of each belt.

Ex:

  • Simple semispace: One belt, two increments
  • Generational: One belt per generation
    • Nursery increment on youngest belt may be of fixed size or allowed to grow(appel-style)
    • Mutator must track references from older to younger increments and from older to younger belts

Because so configurable, attributes depenend heavily on configuration

9.12 Analytic Support For Generational Collection

Generational Collectors handle young objects well but old objects poorly.

Problems:

  • Collection of garbage in older generations not prompt
  • Long-lived objects must be copied between generations/spaces until they reach oldest generations

Solutions:

  • Can improve with collection of metrics to determine what to pretenure.
    • Machine Learning possibly used here
  • Can analyze source to determine which objects likely to be created/changed/die together and keep objects in same space
    • Good improvements seen. Doesn't need to be perfect.
  • Functional languages can know that immutable objects can only point to older objects.

9.13 Issues to Consider

If weak generational hypothesis does not hold for a program, generational GC will perform poorly.

Only improves expected pause times, not worst-case pause times because must do major collections eventually

Offers many tunable parameters which may be confusing/interrelated.

  • Number of generations. Most often two and an immortal generation. Intermediate generations used to improve premature promotion, but better ways to do that.
  • Promotion rate depends on nursery size. May set fixed size or vary at runtime to meet throughput goals.
  • Promotion can be controlled by setting a number of survived cycles before collection.
    • En masse = 1 cycle.
    • Alternatives are aging semispaces and eden/survivor spaces

Mutator cost dependent on organization

  • Higher promotion rates can place a higher burden on the mutator
  • If we demand a collection of younger generations when older generations are collected, only need to track young-to-old pointers.

9.14 Abstract Generational Garbage Collection

Same way that we see tracing as deferred reference counting, generational collection is a form of deferred reference counting which keeps track of remembered set and typically tries to sweep nursery before defaulting to full collection

Chapter 10: Other Partition Schemes

10.1 Large Object Spaces

Can use actual size or size relative to allocation blocks

Large objects expensive to allocate and likely to cause fragmentation( CH. 8 criteria )

Usually don't move because of associated cost, but eventually might need to

Common techniques include

  • Mark-sweep, and freelists
  • Allocating directly into old generation
  • Making a header that is treated as a regular object and points to non-moving data body, which is freed when header is.

Treadmill Collector

  • Makes circular, doubly linked list of objects with black, white, grey, and freed segments. Uses tricolour abstraction.
  • Allocates until free-buffer region of doubly linked list is empty
  • Frees as copying collector, but copies by changing pointers.
  • Allocation and copying are cheap, constant-time.
  • Smart pointer policies(which end of the grey area of list determines breadth or depth-first traversal.
  • Overhead of pointers, but removal of need for seperate tospace.
  • Problem of allocation of different object sizes
    • For large objects, not a problem since actual data kept on seperate page.
    • May store pointers on page, or seperately. Good argument for seperately for cache reasons.

Moving objects with OS support

  • Can simply change virtual address that page is mapped to
  • Also able to lazily initialize data pages, not load until referenced.

10.2 Topological Collectors

Mature Object Space GC / Train Collector

  • Collection of oldest generation has unbounded pause time(function of live data in heap)
  • Solution is to manage this oldest generation("MOS" or mature object space) as a set of areas of memory and collect/evacuate one region each collection
  • Each area is a "car", cars are stacked in a "train" queue.
    • Important that this moves cyclical garbage into the same car, so can free all at once.
    • Algorithm:
      1. Select lowest-numbered car from lowest-numbered train. Call it fromCar.
      2. If no remember set to train, and no roots to train, collect entire train.
      3. Otherwise if referenced by root, copy to higher-numbered car toCar in higher-numbered train, making one if needed.
      4. Copy everything in C that can be referenced by toCar to toCar, creating cars for overfill on the same train.
      5. Move promoted objects to train holding references to it.
      6. Move every object in the remembered set reachable from another train to that train.
      7. Copy any objects reachable from other objects in current train to the end of the current train.
    • Benefits:
      • Incremental
      • Bounds amount of copying to size of car
      • Moves objects close to other objects which refer to them
      • Only need to track higher-to-lower train pointers
    • Downsides:
      • Garbage cycles which span multiple cars may end up in situations where they are never collected.
      • Does not bound remembered sets, which may be huge for popular objects.

Connectivity-Based Collection

  • Lifetimes of objects correlated with connections.
  • Ex:
    • Stack-only references are freed quickly
    • Globals tend to be immortal or near-immortal
    • Objects connected by chains of pointers die together
  • Idea: Create partitions based on connectivity. Chain partitions in DAG. Everything in partition can only be referred to by other things in partitions or by partition's predecessor in DAG.
  • Does not require write-barriers or remembered sets
  • Did not do as well as could be helped in practice. Evidence suggests better information about connectivity could improve performance.

Thread-local Collection

  • If objects local to thread, can perform stop-the-world on single thread to collect this "heaplet" of data.
    • Must still synchronize in case of shared, mutable data.
  • Usually partitions into heaplets and single, shared space.
  • Objects in heaplet may reference other objects in heaplet or shared heap. Shared heap may never reference objects in heaplet. Thread may never reference other thread's data.
  • Need to perform accurate escape analysis to determine where allocation should be done, and if reference was legal.
    • Can use mutator barriers to accomplish this
  • In case of immutable local data and shared mutable data
    • Can simply place forwarding address in shared object's header. Future references will use the new object through the forwarding address, and existing uses will see the old value.
    • Can use for generational as well

Stack Allocation

  • Allocating on the stack where possible has opportunities.
  • Requires that nothing with longer lifespan reference data on stack, escape analysis
    • Read/write barriers would need to move to heap in this case.
      • Must be rare to be worth it because costly
      • Used by Azul System's hardware jvm appliances using hardware trap.
  • Stack allocation excels in same environment where generational collection excels.

Region Inferencing

  • Objects are allocated into regions, which are freed as soon as nothing in the region is needed by the program.
  • Ex: Real Time Specification for Java
  • Requires either programmer annotation or a significantly intelligent compiler
  • Compiler-directed escape analysis and region inferencing is difficult and slow. Best example = ML Kit Algorithm.

10.3 Hybrid Mark-Sweep, Copying Collectors

Prologue

  • Collectors can be described by choice of how much data to consider "enough" to evacuate a block, and the expected allocation portion of a block
  • Uncompacted heaps can eventually run into performance problems from fragmentation
  • Solution: Combine mark-sweep with copying
    • Divide heap into k+1 windows, keep one empty
    • During collection, choose one window to be a fromspace, and use free window as tospace. Use mark-sweep for rest of heap.
    • Will take k collections to compact the heap, with a space overhead of 1/k.

Garbage First

  • Goal is that no more than x milliseconds every y time slice are used for GC
  • Used in hotspot
  • Divides heap into windows of same size. Allocates from current allocation window.
  • Allows for arbitrary window number
    • Imposes higher write barrier, many more interesting pointers.
  • Can be combined with generational.

Immix and Others

  • Dimpsey / IBM Parallel Mark/Sweep GC
    • Used thread-local allocation buffers
    • Found that dense prefix was slowing down allocation time, made seperate freelist for allocation of buffer-sized memory.
      • Great performance improvements. Didn't need to scan dense prefix for large enough block. Also shortened freelists.
  • Immix Collector
    • Uses liveness estimate from previous collection cycle to determine whether to use copying collection or marking on a block
    • Allocates into either free or partially-filled blocks using a linear, next-fit strategy.
    • Allows allocation into smaller "lines" in blocks
      • A line is 128 bytes, roughly the size of a cache line.
    • When marking, can "implicitly" mark lines because expects empty lines after last line in sequence of lines to free.
    • Distinguishes "medium-sized" objects from small ones, and bump-ballocates into an empty block.

Copying Collection in Constrained Memory

  • Idea is to split oldest generation into many block-sized windows and to copying-collect oldest generation one block at a time, during the same collection period.
  • Uses mark phase to create remembered sets
  • Lower space usage = fewer collections
  • Downsides:
    • Each full collection touches every object twice: to mark and to copy
    • Still has overhead: requires space for bookkeeping data structures for marking/collection

MC-squared collector

  • Relaxes need for blocks to be layed out sequentially by numbering them.
  • Allows for oldest generation to avoid copying full blocks by simply modifying numbers.
  • Marks incrementally.
  • Segregates popular objects into special block without remembered sets. (Reversibly marks as immortal)

10.4 Bookmarking Garbage Collection

Page faults have incredibly high performance overhead. Bookmarking collector prevents paging during collection and mitigates overhead for mutator.

Works with OS to choose page to evict.

Default choice likely to be least-used page, which for a copying collector is likely to be in the next tospace. Worst possible choice, since we will allocate from it soon. Fromspace page better choice

If the runtime must choose a page to evict, we "bookmark" live references in it before evicting it. We use this bookmark in collection, and then erase it when we next load the page.

Sets empty pages to be first to free, otherwise chooses victim and marks outgoing references.

Bookmarked objects never moved to avoid having to update pointers on non-resident page.

10.5 Ulterior Reference Counting

Is segregation of objects based on mutation rate.

Tracing great for mutation-high, allocation-high, short-lived data. Reference counting great for data with low mutation rate. Both algorithms are at worst-case for other input. Makes sense to use right algorithm for right data.

Uses a bounded-sized nursery with a copying collector. Survivors are promoted to the reference counted area. Pointers from the RC area to the tracing area are tracked.

Collects cycles when heap size falls below limit.

10.6 Things to Consider

Specialized behavior good for large objects, and objects without pointers.

Objects part of large pointer cycles should all be stored together to allow freeing it by freeing a single page.

Tends to allow for very-specialized collection schemes for objects known to have special traits

Chapter 11: Run-Time Interface

11.1 Interface To Allocation

Prologue

  • Languages require that the object be both allocated and initialized(to a varying degree)
  • Three phases to allocation/initalization
    StepJobDescriptionOverseer
    1Allocate cell of correct size and alignmentAllocation subsystem of memory manager
    2System InitializationSetting fields required for object to be legal(header bits, method dispatch)
    3Secondary InitializationSetting fields after the object has been returned from allocation subsystem.
    • Languages which require complete initialization in steps 1 + 2 and forbid step 3 (haskell, ml), must pass an "initialization closure" to allocator to retrieve values from. Stems from problem of needing to specialize allocation to each type.
      • Alternative(haskell) is to inline all initialization and call collector on exhaustion. Possible because allocation simple "bump a pointer"
    • Complete initialization allows for optimizations.
      • A generational write barrier isn't necessary because all references must be to older objects.
  • Paramount that steps 1 + 2 be atomic with respect to collector/system. Otherwise may try to interact with an object lacking headers.
  • Possible variations on divisions of labor between steps.
  • Parameters to allocation(Step 1) may include
    1. The size requested
    2. An alignment constraint
    3. Kind of object to allocate(for header fields, size, space to allocate into)
    4. Type of object(Object type, including fields to initialize)
  • Post conditions for step 1: May or may not zero, object header may be filled or not, full or partial initialization
  • Must be atomic if inlined.

Speeding Allocation

  • Solution may be to inline the "common case" of success and call out to less common cases
  • Sequential allocation fast, especially if inlined
    • If pointer to bump in register, assumes per-thread allocation areas
  • May make segregated fits as fast if we know the list to allocate from at compile time
    • Can be as simple as "bump a pointer with checks" in sequential allocation.

Zeroing

  • Some languages require zeroing, others do not
    • May be used for debugging, but not the standard
  • Zeroing in bulk, using optimized native libraries, are the prevailing wisdom
  • Zeroing during collection has poor cache performance. Need to zero somewhat ahead of allocator, but soon enough that page stays in cache

11.2 Finding Pointers

Collector must be certain that garbage has no reference to free, but may choose to not free garbage with no penalty rather than wasted memory in footprint

Conservative Pointer Finding

  • Treat every segment of bytes that could be a pointer value as a pointer value
  • Must be fast
    • Filters out values that do not refer to any heap areas in memory
    • See if the reference refers to data which has been allocated.
      • Language requirements on pointers referring to start of object or at known offset can simplify this
      • Difficult for languages which allow flexible interior pointer locations(C)
    • May need to choose to not allocate blocks referred to by observed literals which point into the heap range.

Accurate Pointer Finding useing tagges values

  • Many languages include a type tag with each pointer
  • Two Approaches:
    • Bit Stealing
      • Align to make lower bits zero, and then use them to mark a pointer as a pointer
      • Reduces number of bits that can be used by numeric values, may require de-mangling for external numeric libraries to work.
        • Some processors have instructions for working with tagged integers
    • Big Bag of Pages
      • Associates each block with a type
      • Better numeric convenience
      • Allows for memoization of value allocation, if the value is already in the block for the given type, simply use that instead
        • Called hash-consing
        • Should use weak references for bookkeeping structures internal to table/blocks

Accurate Pointer Finding in Objects

  • Objects typically associate pointer offsets with the type-level information such as method dispatch
    • Can alter offsets to improve cache performance with modern set-associative caches
  • May be easier to partitions all pointers from all non-pointers
    • If polymorphism is retained into runtime, can become very problematic
    • Better idea may be to place pointers before reference offset into object, and values after it.
  • Some systems require that pointer-chasing methods actually be written by programmer
  • Possible to avoid objects headers entirely in statically-typed languages, but this might cause more work for collector when parsing the heap.

Accurate Pointer Finding in Stacks and Registers

  • Some runtimes heap-allocate call stacks to improve parsability
  • Others make the stack look like heap objects(GHC)
  • Three issues:
    • Finding frames / activation records in stack
      • Also needs to be done by exception handler, continuations
      • Many runtimes have a reference field in each frame which points to the start of the previous frame.
      • May include maps from return addresses to functions which called that address
      • Stack maps
        • Metadata added to each frame on the stack.
        • Collection during initialization of stack map is dangerous
        • Single map per frame not always possible
          • Stack Maps can't know type of polymorphic reference. A pointer in a list of one type is treated the same as a pointer in a list of another.
          • Needs to check types in call predecessor to find.
        • Compiler will need to cooperate, propagate type information through the backend.
    • Finding pointers in each frame
    • Dealing with calling / stack layout conventions
  • Finding pointers in registers
    • Difficult because we can't impose invariants on registers as easily as on stack fields. Fewer registers demands more flexibility.
    • Compiler optimizations may lead to register holding obfsucated pointer.
    • If called function must preserve register contents, cannot know if register contains pointer. Must get information from calling function.
      • If stack doesn't have fields that refer to previous frame, necesitates another way to unwind the stack.
      • Solution:
        • Store metatdata in stack table that says which fields the registers are saved to
        • Move from top of stack downwards, 'unsaving' registers and seeing where they were filled in calling frame. After traversal, we will have pointers that the collector can update
        • After collector runs, walk back up stack, and resave modified pointers to stack fields
        • Make sure that we don't process roots from stack twice if they are also in the heap
  • Compressing Stack Maps
    • Stack maps can occupy a lot of space relatively
    • Compressing them makes sense
      • Multiple safe-gc points have same map
      • If pointers placed similarly together, even more have same map

Accurate Pointer Finding in Code

  • Pointers in code(including dynamically generater) may refer to data in the heap
  • General solution is to have compiler create a side table that marks embedded pointers.
  • Target Objects that Move
    • If a target of an embedded pointer moves, must alter the pointer in the code
      • May not be able to alter code segment because of OS permissions
        • Costly to change permissions temporarily
          • Caches don't automatically update when code modified
          • Must specifically force invalidation
            • Expensive
        • May be easier to disallow embedded references
        • If code is moved, then you must alter the return addresess and function pointers on the stack.
    • Handling Interior Pointers
      • Interior pointers refer to object but aren't the 'standard reference'
      • Can be tricky. Can be outside of object itself. Ex: C allows pointers one location beyond end of the array to be considered valid.
      • Problem lies in finding object that pointer belongs to
      • Solutions:
        • Use a table or bitmask to mark the start of each object per block
        • May simply store references to power-of-two blocks and parse area containing pointer
        • Big Bag of Pages can use constant object size to simply find start using rounding
      • While they may be rare, storing structures to handle interior pointers can add space/performance overhead
    • Handling Derived Pointers
      • May be treated similarly to interior pointers
      • We know that if the address is p + i, must belong to object at p if i < size(typeof(p))
      • If i is large, may point outside of p.
      • May refer to something like p1 - p2, the distance between the two pointers
      • If i is constant, can make tidy pointer, or standard reference address, at compile time.
      • If the tidy pointer made by addition can be optimized away, we must explicitly require that the compiler keep around a computed tidy pointer for each derived pointer.
      • For moving collector, must compute distance moved and add to derived pointer expression
      • May prohibit certain code optimizations

11.3 Object Tables

Rather than treating references as pointers to the objects, can treat them at pointers to values in a table or as offsets into that table.

Allows for straightforward relocation of data. Only one pointer needs to be updated, rather than multiple.

Should be weak references from table

May be beneficial to store attributes and bookkeeping in table rather than in object headers.

Can compact the table itself.

Can be problematic with interior or derived pointers

May be required by some language semantics

  • Ex: Smalltalk's object swapping. Would otherwise require heap parsing.

11.4 References From External Code

Many managed languages allow interop with other languages, those other languages may modify the heap of the managed language

  • System IO, file handles also an issue

Must consider external reference when tracing

  • Only during external function call

Must ensure external code can find internal object

  • Moving collectors a problem
  • May provide layer of indirection, so that moving the object in internal heap is safe
    • Called handles. Are pointers into the "registered objects" table.
    • May not work if external code doesn't follow access protocol. Ex: system calls and IO
  • May simply allocate the objects which will be "pinned" into non-moving space
    • Difficult to determine. May offer programmer a pin/unpin function.

Options for collectors

  • Defer collection of pinned object's region while pinned
  • Immediately collect and move to non-moving region
  • Collector can tolerate non-moved objects. Can be inefficient/problematic.
    • Will introduce fragmentation likely
  • Other language features may require pinning, such as keeping interior pointers on the stack. It may be easier to pin than handle rewriting.

11.5 Stack Barriers

If we need to scan a stack while a thread is running, we can place a stack barrier and not allow thread to return past that level of the stack

  • Allows for incremental scanning

Hijacks return address of frame that would return past it

Scanning thread can move barriers as it scans.

Can also be used to mark stack frames that have not changed since last sweep

Can be used to limit pause time

11.6 GC-safe points and mutator suspension

Two questions for a given address of code being executed: is it safe to pause and what is the size of the stack map table.

Certain sequences must be atomic/not interrupted for GC, such as write/read barriers and object initialization.

Need to specify "safe points."

Threads may be frozen at unsafe points

  • Must either interpret remaining instructions, or wake up until it hits a safe point

Needs maps + safe points anywhere that allocation can occur, or when a function call may happen which will do allocation

In order to avoid unbounded thread wait times, may want other safe points.

  • In each loop
  • At each function entry and return
  • Must make these "GC checkpoints" since they won't ever allocate and actually end up in GC

Need way to activate these checkpoints when thread restarted

  • Polling: At safepoint check if moving for collection
  • Patching: Insert a collection yield call at safepoint. Harder to get right, but faster.

Thread notification may be synchronous or async

  • If async, need to wait for thread to hit safe point and respond so still sync.

11.7 Garbage Collecting Code

Many languages dynamically create and compile code

Need way to free code that won't be executed

Need runtime support to mark code that won't be executed, difficult to trace because frequently refered to by symbol table or global variables

Special Cases:

  • Closures = function that keeps environment of bindings
    • Can determine which variables are used by code and which can be pruned from closure, freeing other closures most likely
  • Classful OO
    • Normally objects point to classes, code. Pointer ignored by tracer because in non-collected, non-moving area.
    • Need to scan pointers, costly
    • Might want to only run on special cue.
    • Also must ensure that either class unloading/loading is idempotent(immutable class variables for example) or that nothing references class in future.
  • Code which has been optimized, and previous version still needed by executing code
    • May need RC, tracing
    • On-stack replacement allows old code to be freed.

11.8 Read and Write Barriers

Prologue

  • Without read/write barriers, are many situations where mutator can hide pointers from collector.
  • Must consider any references from outside subset of heap being collected as roots, since can't know if they're garbage.
  • Can break into the detection and recording of interesting pointers.

Engineering

  • Barriers usually apply checks before allowing an action to progress
  • Inlining very desirable
    • Can't inline code sequences which are long
      • Solution is to inline the fast path and call out to the slow path on success.
      • Should ensure that slow path orders checks to match as soon as possible.
  • Must be able to access data structures quickly as well
  • Requires cooperation of compiler
    • May need to not optimize the allocator sequences for atomicity requirements.

Precision of Write Barriers

  • Whether to include metadata and to record more types of interesting pointers
  • Choice of trade off of more work for the mutator or collector
    • Collector will have to generate the data that the mutator doesn't save
  • May be required to provide some information for some collection schemes.
  • Decisions:
    • How accurately to record pointer writes?
      • May be cheaper to log everything
      • Depends on remembered set implementation. May be costly to store unnecessary information.
      • May have to filter for concurrent/incremental collectors.
      • May want to inline some filters and call out to others
    • At what granularity to record pointer location?
      • Most precise is address of field in object.
        • Will make remembered set larger
      • May simply point to object, let collector scavenge
      • May use cards(fixed-size regions of memory) or pages.
        • Simply mark card/page as dirty and then let collector traverse
    • Should the remebered set be allowed to contain duplicates?
      • Card/page and sometimes object - granularity will avoid duplicates.
  • Collector's cost:
    Cards/pagesdepends on either number of cards/pages dirty
    Pointer without duplicatesNumber of objects modified
    Pointer with duplicatesNumber of pointer writes

Hash Tables

  • For pointer / object deduplication, remembered set must be a set.
  • Adding items must be fast
  • Must have one per generation
  • Example: Circular hash table for each generation, with linear hashing
  • Strategy: With multiple generations/remembered sets, let mutator place interesting pointers in "scratch" location, and let collector sort them out.
    • Removes need for mutator to find correct target generation
  • Hash tables used by train algorithm

Sequential Store Buffers

  • Faster recording of pointers
  • Can have one per thread
  • "Bump a pointer" insertion
    • Can avoid out of bounds check using hardware paging.
      • Very costly to handle trap. Must be faster than running the check at each iteration.
    • Can also use architecture-specific tricks
      • Solaris UTRAP for example.
  • May be used directly as remembered sets, or may be front-end to hash tables
    • Choice depends on whether sets are discarded at the end of each collection, on collection scheme.

Overflow Action

  • SSBs and Hash tables can overflow
    • Can enlarge or empty into another, larger structure if being used as frontend

Card Tables

  • Divides heap into fixed-sized cards
  • Marks a card dirty when an interesting pointer written into a field in that card
  • Moves costs from mutator to collector
  • May implement as bit array(one bit per card), but not best choice
    • Bit manipulations not atomic. Software synchronization costly.
    • Usually use array of bytes
      • Exist many efficient ways to zero an arbitrary byte, so usually dirties by making 0.
    • Can reduce overhead by instead marking card which is standard reference to object.
  • Cards tables very small compared to heap usually.
  • Card size is a compromise between space usage and collector time
  • Promote objects in generational collector to clean card to avoid needless scanning next cycle(pg. 199 for clarification)
  • Since most cards will be clean, will waste time in sequential scan of cards looking for dirty cards
    • May want to create set of larger cards, and use it to find the indices of the array of smaller cards to sweep.
  • In generational collector with no interesting pointers in nursery, only needs cards for older generation.

Crossing Maps

  • Start of card may not necessarily align with start of objects. Since very large objects can have fields many cards apart, can cause problems in scanning cards.
  • Need a crossing map that determines how objects cross cards.
  • One entry for every card. Stores the offset into the card that the first object starts.
  • Depends on whether cards record slots or objects.
    • Slot-recording must store index to last object in card, negative value if no objects start in card
    • When given card, can move backwards in array until a non-negative offset is found. This card contains start of object, at offset in crossing map.
  • Object-recording can traverse relevant card from first object.

Summarising Cards

  • If not all objects promoted en masse, some cards will remain dirty between collections.
  • Rather than re-scan these cards for pointers, why not store the interesting pointers in a hash table?
  • May also store information related to generation of referent in table, other information the collector must compute each sweep.
    • Lead to 80% improvement in 5-generation SML collector

Hardware Virtual Memory Techniques

  • May intercept page eviction and collect 'dirty' bit at this time.
  • May also intercept any modifications of a non-nursery page, and set dirty bit at this point.
    • Poor choice on modern machines, but done historically
    • Using virtual memory for locking can lead to 'trap storms' when many pages(tospace, newly grown older generation) need to have attributes modified at end of collection.
  • Works without compiler support.

Write Barrier Mechanisms in Summary

  • No winner in performance comparison between different strategies.
  • Write barriers usually have costs less than 2%.
  • For each platform, a different technique might lend itself to easier implementation.

Chunked Lists

  • A common data structure used throughout runtimes/compilers is the chunked list
  • Consists of a linked list of nodes containing arrays.
    • Typically makes sizes and starting position of blocks a value which is a power of two.
  • Desirable for parallel collectors because can take individual chunks. Less contention for lock to modify linked list.

11.9 Managing Address Space

Some algorithms require large, contiguous memory regions.

  • Some 32-bit machines can't provide that space.
  • Dynamic libraries can be located anywhere in address space.

Many do it to allow latency-sensitive code to be fast by allowing them to do simple pointer comparisons to determine allocation space.

Better to not require contiguous memory, if table lookup cost is acceptable.

Solution is to work with power-of-2 sizes frames. May mark frames with generations or other attributes.

Can use software emulation of virtual memory, or simply use the OS's virtual memory hardware.

Terminology

  • Reserving = Using the managed system's notion of requesting addredd space for a use.
  • Allocating = Requesting memory from the operating system.

11.10 Applications of Virtual Memory Page Protection

Prologue

  • Many GC-semantic checks can be done cheaply through configuration of the virtual paging hardware. A violation will be caught as a memory protection fault.
  • Modifying protections can be expensive
  • Useful when dealing with uncooperative code.

Double Mapping

  • System maps same page at two addresses, with different protections at each address.
    • Ex: Preventing fromspace access in concurrent copying collection by setting address mutator knows as "no access" and then mapping somewhere else for collector to use.
  • May be harder in 32-bit systems
  • Some caches index by virtual address. Cache coherency can be threatened by double mapping.

Application of No-Access Pages

  • Uses:
    • Unconditional read barrier
    • Detect null pointer dereferences(trying to access address 0, or near it)
    • Guard Paging
      • Doesn't need to check that allocator pointer is in bounds if page followed by no-access page.
      • Can use same idea for stack or heap overflows.
    • Can allow large logical space but small virtual memory space

11.11 Choosing Heap Size

Larger heap sizes generally lead to less thrashing and more throughput.

Too big a heap can lead to excessively large working set, take up too much data in bookkeeping data structures to perform well.

Schemes:

  • After each collection, collect system and heap usage statistics to decide how much heap space is safe to use before next collection
  • If collection frees less than a certain fraction of the heap, expand the heap. This has no provisions for shrinking the heap.
  • If the user can inform the program of the target working set size, the collector can grow the heap until that size is met. Good observed performance.
    • Also marked fromspace as discardable to virtual memory manager
  • Can rely on special OS syscalls to inform program how much memory it may use
  • Can use number of allocation stalls as indicator of memory load
    • Tries to grow nursery every collection until stalls, and then shrinks heap to last observed size that didn't have stalls.
  • If we have knowledge of all programs running on the system, we can figure out heap size for maximum throughput for all processses.
  • Dynamic heap sizing is also possible. Change heap size in reaction to page faults

11.12 Issues to Consider

Initialization/allocation requirements for stability and performance must be considered above all

Degree that allocation sequence can be inlined.

Whether we can spare a single pointer-register for bump allocation

  • Not an option with too few registers or if register allocation would be too difficult.

Whether more information(pointer location, etc) can be passed on by compiler or whether conservative collection can be used.

Chapter 12: Language-Specific Concerns

12.1 Finalization

Prologue

  • Some resources which lie outside of the scope of the collector(file handler, network connections, etc) require that tasks be done when done with it
  • Need a way to perform thes freeing actions when the objects encapsulating that resource are freed.
  • Store table of finalisers for objects. An object is finaliser-reachable if it can only be reached from the finalizer table.

When Do Finalisers Run?

  • Problem: If they are run during collection, they may need to allocate on the (currently full) heap.
  • Solution: Run after collection
  • Create a queue of finalisers to run and process after collection
  • To run before mutator threads resumed or not?
    • If so, possible deadlocks if expect response from thread or needs to acquire locks. Faster finalizer, lower throughput.
  • Ideally, finalizer semantics should not limit optimizations in runtime.

Which Thread Runs a Finalizer?

  • Idea: Use seperate thread to run finalisers.
    • Finalisers must now be reentrant.
    • Constructors and finalisers for the same type must be safe to run concurrently.

Can Finalisers Run Concurrently with Eachother?

  • If they must be safe to run concurrently with arbitrary non-finalizer code, they should be safe to run concurrently with eachother.

Can Finalisers Access the Object Which Became Unreachable?

  • Finalizer needs ability to specialize for given resource. Needs ability to take arguments(object fields) or to access the object.
  • If so, then collector must trace references of garbage object and ensure they survive

When are finalized objects reclaimed?

  • Some, more dynamic languages, allow for a finalizer to run more than once. Others may not make guarantee that it runs only once but programmer may not explicitly call multiple times.

What happens if there is an error in the finalizer?

  • Finalisers will have access to the language's error recovery mechanism. Not a GC problem.

Is there a guaranteed order to finalization

  • Sane default is to guarantee finalization in the order of reachability. If A and B are garbage, and A encapsulates or references B, but B only references the resource, free B first.
  • Cyclical references will require programmer understanding of semantics of finalisers in question.

The Finalization Rate Problem

  • If compiler optimizes away references to freed objects with finalisers, collector may free resource and then return to function stack referring to dead object's resource.
  • Solution is to provide a keepalive function that the compiler won't optimize away.

Finalisers and locks

  • Finalizers may modify global data structures related to resources, but appear to run concurrently to programmer-provided code.
  • Solutions include:
    • Locking everything, always
    • Adding finalization calls to a queue, calling them at a programmer or compiler-defined "safe point", and running all finalizers at that point
      • Pain in making sure location is safe.
      • If the thread running the finalizer already has a shared lock, it will run without the lock-aquiring semantics that the programmer designed.
      • Ex: Why java will not run finalizers while the thread has any user-visible locks.

12.2 Weak References

Prologue

  • Using Pointers to find object references can have suboptimal behavior
  • There may be data that we want a reference to, in order to use when there are other objects associated with it. If no other objects refer to it, we have no use for it.
  • These are called weak pointers, or weak references.
  • May be useful to be notified on nullifying these references, so that we can remove them from data structures.
  • Tracing collectors can implement by not chasing weak pointers during marking phase, and then altering them with new object address or nullifying them during collection phase.
  • Collector must be able to identify the reference as weak.
  • Solutions:
    • May alter pointer's less significant bits to record information
      • Requires de-alteration to dereference
    • May create weak reference object. Only special to collector, can behave like normal object to user code.
  • System must give programmer way to specify reference as weak

Additional Motivations

  • Caches, other programs which use a space-memory tradeoff. Getting "low space" feedback from the runtime can help simplify cache expiry logic.

Supporting Multiple Pointer Strengths

  • Prologue
    • We use the phrase x-reachable if it can be reached from the roots by pointers of strength x, and no higher. x* reachable is reachable by at least level x.
    • Strengths:
      StrongCollector never clears
      SoftCollector clears if space is at a premium
      WeakCollector always clears if only weakly-reachable
      FinaliserReference from table of finalizers to object with finalizer.
      PhantomCollector must clear. Does not have ability to dereference.
    • Harder to implement with reference counting. Need to store more metadata, have consistent way to decide when to clear soft references.
    • Detecting cycles must account for these semantics
  • Using Phantom Objects to control finalization order
    • Phantom objects can have a phantom reference to one object, but a strong reference to another. When the first object runs through finalization, the phantom reference is cleared. Then the second can run.
    • Phantoms purposefully designed to have weaker reachability than finalizers for this reason.
  • Race in weak pointer clearing
    • Finalizer race condition involving compiler optimizing references away can happen here
  • Notification of Weak Pointer Clearing
    • Placement of cleared pointers in a queue allows for each notification

12.3 Issues to Consider

Finalization and weak pointer semantics determined at language design time.

Both of these features add complexity to tracing that should be considered from the onset. Do not bolt them on later.

Concurrent access and clearing of weak references is an issue

  • May be better to clear during infrequent stop-the-world of otherwise concurrent collector.

Chapter 13: Concurrency Preliminaries

13.1 Hardwarse

Threads

  • Thread states = ready, blocked, or running
  • Threads can move processors when descheduled
    • Can use processor affinity
  • Simultaneous multithreading or hyperthreading (multiple processors with same execution pipleline) not actual threading, but similarly named
  • Multiprocessor = More than one processor
  • Multicore/CMP/Many-core = many processors on same integrated circuit
  • Mutithreaded software runs concurrently on a multiprocessor
  • Multiprogrammed = software executing multiple threads on a single processor

Interconnect

  • Shared memory differentiates local concurrent software from distributed systems
  • Access mediated by bus or interconnect
    • Bus = single point that all requests pass through
    • Granularity of memory acccess a tradeoff between bandwidth and number of busses
    • Busses a bottleneck on 8-16+ core systems
  • SMP vs NUMA
    • SMP = All processors have equal access time to memory
    • NUMA = Each processor has a small bank of memory that they have faster access to

Memory

  • Because of concurrent access to memory, hard to describe a single state
  • Alright because we only need to access each unit of memory, and that access is consistent

Caches

  • Memory access is slow relative to processors, so we use faster and smaller caches
  • If needed address in cache, called a cache hit
  • If needed address not in cache, called a cache miss
    • On cache miss, we bring memory's values into cache
    • If cache full, we must evict a cache line. Evicted line called victim.
  • If we write the data to memory immediately on write, is called write-through. Otherwise called write-back if only when flushed.
  • Cache Organization
    • Cache Location
      • Fully Associative
        • Any memory addresses lines can reside in the cache together at the same time
      • Direct-Mapped
        • Each address has a specific place in cache it must reside
      • k-way Set Associative Caches
        • Each address has k possible cache locations in memory to reside
        • Can evict however many it needs
      • Victim caches
        • On eviction, a line is put in a fully associative victim cache
        • Allows to easier recalling of victims
    • Cache level relationship
      • Strictly inclusive if must be held by higher level if held by lower level
      • Strictly exclusive if may not be held by higher level if held by lower level
      • Can be neither, may allow anything

Coherence

  • Multiple levels of cache may get out of sync with one another, hold different valuse for same address
  • Common cache coherency protocol: MESI
    StateDescription
    ModifiedThe cache is the only one holding a copy of hte line, has been modified but not written back to main memory.
    ExclusiveOnly one cache holds the address, not modified.
    SharedOther caches may hold the line, all have same value
    InvalidCache does not hold line
    • To satisfy read, must hold in MES state. How to satisfy write depends on state of address.
    • Point is that there is only one writer for a given line, and no caches hold disagreeing values.
    • Does not scale well to many processors.
  • False sharing = if multiple processors frequently access and update different data in the same cache line, the line will be moved and invalidated between processors frequently. "Ping-ponging"
  • Performance example: Spin locks
    • Poor because threads on different processors will contend for line with lock and invalidate it frequently.

13.2 Hardware Memory Consistency

Prologue

  • Writes to a given address have a total order, and processors won't see a state change without a write having happened between both accesses.
  • Order of writes to more than one address do not correspond to actual changes in memory or cache
    • Ex: Write buffer that summarizes multiples writes by choosing later writes rather than writing both earlier and later values. A read will fetch the value from write buffer if it's in there.
    • Ex: Compilers may reorder instructions
    • Almost universally done for performance gains
  • Non-atomic, relaxed ordering can break program invariants
  • Three access types: read, write, atomic.
    • Atomic uses test-and-set to ensure atomic read-modify-write

Fences and happens-before

  • Fences prevent certain types of access happening on one side of a point to happen on the other
  • Alternative is aquire-release pair which prevents things from happening before the aquire or after the release.
    • Useful for protecting critical sections

Consistency models

  • Strict consistency => All operations have a total ordering everywhere in system
  • Sequential consistency => A partial ordering between processor program orders exist.
  • Weak consistency => All atomic operations are total fences
  • Causal consistency => Enforces that reads not be reordered across writes that may alter the read's value.

13.3 Hardware Primitives

AtomicExchange and Test-and-set are both fairly simple. They set a value unconditionally and return the old value. This can be seen as setting a value conditionally if the semantics are "only proceed if a certain value was returned."

Compare and Set

  • Takes an old and new value and if the old value is equal to the value currently in the address, write the new value. Returns whether or not updated.

Compare and Swap = Same as compare and set but returns old value

  • Trap is that the value might have changed multiple times and is now equal to the older value being checked for
  • Ex: Misreading an int for a pointer and finding it the same bits and proceeding.
  • Called the ABA problem

Load-linked / Store Conditionally

  • Use processor's cache coherence mechanism to detect changes at a memory address.
  • Solves the ABA problem very well

Atomic Arithmetic: Atomic increments and decrements

  • Not as powerful at other primitives. Have finite consensus numbers
  • A primitive with a consensus number of N allows for correct creation and selection of a value between N threads

More powerful primitives

  • Multi-word compare-and-swap can use a counter in the second word to avoid the ABA problem
  • More useful to update two non-adjacent words atomically
    • Some chips like 88000 have such primitives

Overheads of atomic primitives

  • Expensiveness of atomic primitives lead to disuse by programmers, and error
  • Expensive for two reasons:
    • Cache Coherence
      • Must aquire exclusive access to line
    • Memory fences sacrifice optimizations by reordering

13.4 Progress Guarantees

Important to guarantee that the threads contending will make progress, bound wait times

Guarantees:

  • Wait-free: Every thread can always make progress regardless of actions of other threads.
    • Typically involves threads helping eachother along
    • If a thread's action will make another thread already holding resources unable to proceed, the first thread will wait until the other thread leaves its critical section
  • Obstruction-free: If given a period of isolated execution, each thread will complete in a finite number of steps.
    • Easier than wait-freedom, but may require scheduler cooperation
    • Threads can tell if they are contending and will back off
  • Lock-free: Infinitely often, a thread will finish in a finite number of steps
    • Requires only that at least one contender makes progress at any given time.

Guarantees and Concurrent Collection

  • Parallel collectors use multiple threads to collect, but stop mutation threads to collect.
  • Concurrent collectors perform some actions while mutation threads run
  • Concurrent collection has implementation edge cases
    • When objects are freed or become garbage during collection
    • Proof of termination
      • Mutators may need to be throttled back to avoid making more work than the collector threads can complete
      • Need to prevent mutators from using all memory before collection finishes.
  • May offer different guarantees for different portions.

13.5 Notation Used for Concurrent Algorithms: Not relevant. Used for describing pseudocode.

13.6 Mutual Exclusion

Need to balance progress guarantees and mutual exclusion

  • Peterson's Algorithm good tool

13.7 Work Sharing and Termination Detection

Parallel algorithms need a way to detect termination of parallel algorithm

Simple if processing a finite amount of data, but sometimes threads create more work to process

  • Can push work to other threads or pull from other threads. Pulling called work-stealing
  • No work may be lost
  • Counters may become bottlenecks, cause cache contention

May use a seperate thread whose only job is to detect terminations

  • Workers set a busy flag. Detector finds.
  • Needs to track jobs pushed or pulled to atomically scan all flags.

Rendevous Barriers

  • Common to need all threads to hit same phase of collection.
  • May simply need all therads to reach a given point, called a rendevous barrier

13.8 Concurrent Data Structures

If structure accessed rarely, can simply lock it entirely with a mutex, but won't scale

Concurrent structures exist that allow a interleaving of certain actions on structure.

  • Is linearizable if overlapping actions has same effect as seperating them.
  • Each action has a point at which it can be observed to have occurred. Called the linearization point.
  • Variety of locking patterns that can be used:
    NameDescription
    Coarse-grained lockingOne large lock for whole structure
    Fine-grained lockingLock subsets or individual elements. Needs to ensure that locking order consistent or will cause deadlock.
    Optimistic lockingSearches the structure without locks, and then locks, and verifies that the structure hasn't changed since searching, and right elements locked.
    Lazy updateUse metadata to mark an element as being removed or altered, release lock, and then aquire lock to make change. This prevents blocking reading threads.
    Non-blockingUse of atomic primitives to remove need for locking.

Concurrent Stacks

13.9 Transactional Memory

Implementation:

  • All reads and writes in a transaction should appear to not interleave with any other threads'.
  • Should allow over many independent words.
  • Attempts to commit the transaction may fail. If succeeds, changes seen. If fails, software may respond by retrying or doing other steps.
    • Allows transactions to execute speculatively.
  • Atomic, Consistent, and Isolated.
  • If two transactions will interfere with eachother, one must be killed to allow the other to proceed.
  • Atomicity methods
    • Undoing
      • Creates an undo log, actually makes writes to main memory.
      • On failure, undoes instructions in log.
    • Buffering
      • Create a scratch buffer and store all writes to it. On success, write to memory. On failure, discard.
      • More work for successful commits, and success is common case.
  • Conflict detection
    • Can use eager or lazy methods
    • Eager:
      • Checks all accesses for conflicts
    • Lazy:

Use for collection

  • Caveats:
    • STM has a significant overhead
    • Hardware transactional memory may have gotchas, not obvious how code maps to transactions, may have strange cache behavior.
    • May interfere with guarantees because the retry behavior is less understandable than hand-written locking
    • May require very careful performance tuning.

Interaction with languages with transaction memory

  • Memory manager may cause more transactions to fail, cause many retries
  • May have collector and mutator fight for transaction, constantly restart
    • Harder to handle with hardware transactional memory
  • Transaction logs need to be part of the root set, or garbage created during a failed transaction would cause premature freeing
  • Undoing allocation may require a lock on allocation pointers or free-list metatdata, undesirable
  • If non-transaction actions allowed during transaction, need to make sure that allocated memory is initialized.

13.10 Issues to Consider

Hard to get right

What guarantees are actually necessary?

What atomic primitives are supported on our architectures?

Chapter 14: Parallel Garbage Collection

14.1 Is There Sufficient Work to Parallelize?

Goal is to reduce overhead by exploiting hardware

In stop-the-world, will shorten pause times

In concurrent schemes, will shorten GC cycle time

Must be worth it in the tradeoff of the overhead of synchronization primitives

In practice, not many stalls because the object graph is not very deep. This is amenable to parallelism.

Harder to parallelize tracing, easier to parallelize "sweep" activities like fixing references.

14.2 Load Balancing

Needs to properly allocate work to threads, or will be mostly sequential execution

Has coordination cost

Methods

  • May schedule all work ahead of time, such as with a sweep that may work by dividing by region. Called static balacing
  • Most tasks need dynamic balancing. Use a heuristic to divide labor.
    • Good solution is to over-partition the work into more pieces than threads and allow threads to take them off a queue.
      • More accurately partitions based on real work than heuristics

14.3 Sychronization

Over-diving into too small granules of work will have synchronization overhead, despite better processor use

Need to prevent double-copying because of problems like future updates only affecting one version, inconsistent state.

Synchronization may be removed where actions are idempotent and repeating work not too costly.

14.4 Taxonomy

Processor-centric algorithms

  • Gets work granules of different sizes
  • Usually by work-stealing
  • No consideration of locality

Memory-centric algorithms

  • Take locality into account.

14.5 Parallel Marking

Prologue

  • Marking is 3 actions: Getting an object, marking it, and adding children to a work-list
  • All known-algorithms processor-centric
    • No synchronization required if work-list is thread-local
    • Marking can be idemponent(set a bit)
    • Synchronization required for work-stealing
  • If marks are stored in bitmap, access needs synchronization.
  • May need to move work from thread-local store

Processor-centric Techniques

  • Work stealing
    • If work-list empty, try to steal from other thread's
    • Needs to lock, but can give up if unable to aquire lock and move to next thread's queue
    • Problem with processing large array of pointers. Local work stack will overflow.
    • Solution is to push arrays onto global work queue as smaller lists of pointers
    • Only requires sychronization on taking of work, faster than other algorithms which always use synchronization
    • Termination for work stealing
      • Use of an atomic counter possible, but contention can lead to serialization
      • Thread-local counters allow for introspection without locks
      • Detection involves setting a "No threads stole work" flag, trying to steal work, on failure checking if the "no threads stole work" flag has been unset by thread on aquiring work. If so, terminates.
      • May be worthwhile to have lock for termination detection
  • Grey Packets
    • Divide work into packets and have threads contend for work packets
    • Each thread has an output packet, and steals work to add to input packet
      • Adds children of marked object to output packet
      • Called grey packets because input and output packets are both grey in tricolor abstraction
      • Allows for packet object prefetching.
      • Only require synchronization to get input packet and put output packet
      • Termination state is that the total number of empty packets(outputs with no grey elements) is equal to the total number of packets.
  • Channel
    • Create a channel(queue with single reader and writer) between each thread to each other thread
    • Store channels in circular buffer
    • Push children of marked object onto other thread's queue
      • If no children, take work from global pool and push instead.
    • No synchronization needed, and performs better than work stealing on computers with many processeors

14.6 Parallel Copying

Differs from marking in that may only copy once

Processor-Centric Techniques

  • Work dividing
    • Each copying thread given a stack of work to do.
      • Stacks offer easier synchronization
    • Synchronization works with "rooms," divided into pushing rooms and popping rooms
      • Must always have one empty pushing and popping room
      • New thread moves to non-empty popping room and pops work out of it, putting work generated onto it's stack. Leaves pop room when done.
      • Once all threads leave pop room, locks pop room to future thread enterance and empties queue into push room's shared stack. Leaves push room when done.
      • Last thread to leave push room unlocks pop room.
    • Problem with waiting. Because work time varies between threads, may have significant wait time.
      • Can fix by making threads leave pop room as soon as they get work, since working doesn't depend on shared stack.
    • Terminates when shared stack is empty when last thread leaves push room, and no objects working outside of the rooms.
  • Copying in parallel
    • If all alocation done in same region, need to synchronize on allocation.
    • May race to write "busy" pointer into object's forwarding slot.
      • If use thread-local buffers for allocation, can race to write forwarding location directly.
        • May need to un-allocate if another thread has already copied in the interim.
    • If local allocation buffer bounded, NUMA architectures may suffer due to allocation on another processor
      • NUMA should be baked into the processor

Memory-Centric Techniques

  • Per-thread fromspace and to-space
    • Risks poor load-balancing and space overflow
  • Block-structured heaps
    • Over-partition fromspace and tospace and have threads compete for granules of fromspace
    • Small blocks a good idea
    • A good way to fight wasted space due to fragmentation in each subset of tospace is to use big-bag-of-pages.
    • Naive implementation uses breadth-first copying.
      • Seperates parents from children, poor locality
  • Channels
    • Removes need for atomic synchronization primitives
    • Designed for NUMA
    • Over-partition heap, consider each subset a work-list
    • If a thread sees a refernce in it's own partition, adds to work list
    • Otherwise sends to other thread through channel
  • Card Tables
    • Difficult to balance load with card tables
    • Straightforward partition has problems because cards tend to have uneven amount of live data, work
    • Solution is to overpartition

14.7 Parallel Sweeping

Embarassingly parallel

Either partition heap statically or over-partition

  • Adding garbage to free list becomes bottleneck
  • Segregated fits a good solution, also thread-local allocation

Giving each thread multiple blocks a good way to reduce contention

14.8 Parallel Compaction

Easy to parallelize since we know the future location of a piece of garbage after marking(since we know amount of live data seen before it during marking.)

Multiple sweeps required:

  • 1. Calculate forwarding addresses and insert into headers
  • 2. Update all references in objects
  • 3. Copy objects

Can do each in parallel, but need all threads to reach rendevous before next phase starts.

Chapter 15: Concurrent Garbage Collection

15.1 Correctness Of Concurrent Collection

Must be able to retain all reachable objects, and must eventually complete collection cycle

Tricolor Abstraction

  • Easy to reason about correctness with tricolor abstraction
  • Collection can be seen as turning a portion of grey(unscanned) objects into white(dead) and black(live) ones.
  • Concurrent mutation can make mutator and collector's views of world inconsistent
  • If mutator changes white or grey objects it is fine because collector still needs to scan
  • If mutator changes black(already scanned) objects, is alright as long as new references already known as reachable.

Can still cause problems by freeing live objects

Lost Object problem

  • Two failure scenarios
    • A mutator can wide a white objects directly reachable from grey object by inserting reference into fields of already-scanned objects and deleting link from grey object
    • Can hide object transitively reachable by inserting pointer into chain into already-scanned objects, and then breaking pointer chain in unscanned objects.
  • Only possible to lose objects if two conditions hold:
    • Mutator stores a pointer to white object in black object
    • All paths from any grey objects to the white object is destroyed
  • Strong and weak tricolor invariants:
    • Weak triolor invariant: All white objects pointed to by a black object are grey protected.
      • To prevent both from holding, we must make sure that all references from black objects to white objects are tracked
      • Non-copying collectors ensure this because of traversal pattern, all references from black objects are black
    • Strong tricolor invariant: There are no pointers from black objects to white objects.
    • Strong implies weak, but not inverse
    • Ensuring strong is sufficient for copying collector
  • Precision
    • Varying precision of live object tracking mean holding onto some garbage for a duration
    • Stop the world has maximal precision, concurrent schemes typically trade precision for concurrency
  • Mutator color
    • Useful to consider mutator thread as if it were an object, and roots like references.
    • If a mutator is allowed to become grey, need to be rescanned
    • May be multiple mutator threads, need to handle
  • Allocation color
    • Need to consider which color to color new allocations
    • Must not let mutator's references to allocated cell break tricolor invariant.
  • Incremental update solutions
    • Need a write barrier to inform collector of mutator actions which would violate the invariants
  • Snapshot-at-the-beginning solutions
    • If a reference to an object removed during collection, need to conservatively treat as root
    • Requires write barrier

15.2 Barrier Techniques for Concurrent Collection

Collector can respond to concurrent mutation in only three useful ways

  1. Add to wavefront(black/white divide) by shading an object grey if it was white (Adding a root essentially)
  2. Advance wavefront by scanning object to make black
  3. Retreat the wavefront by marking black object grey ('unscanning')
  • Any other actions(reverting object to white or shading black without scanning) would break invariants.

Grey Mutator Barrier Techniques

  • Assumes a grey mutator
  • Use an interstion barrier to prevent storing white pointers in black objects
  • Because grey, no read barrier needed
  • Techniques:
    • On writing a pointer into a black object, make object grey to require re-traversal. (Steele)
    • Can mark any object with a pointer written into it grey since idempotent if grey and harmless if white. Removes need to check state of object being modified. (Bohem)
    • Always shade target of inserted pointer black, regardless if reference later removed. Ensures progress of concurrent threads. (Dijkstra)

Black Mutator Techniques

  • Can use a read barrier to mark any white objects grey as soon as their reference is read, written, or deleted.
  • Can use virtual memory hardware to implement access barrier of black fromspace.
  • Can do converse and mark deleted references black to prevent removal of last observed pointer freeing live objects.

Completeness of mutator techniques

  • No other organization schemes can preserve the tricolor invariant under concurrent collection(Pirinen 1998)
  • Can make variations by coarsening granularity and promptness of these techniques.
    • Ex: Mark targets of altered references grey and re-scan
    • May cause difficulties in ensuring termination
  • Strong invariant techniques need only to ensure that one of the two conditions cannot hold, don't need to do any more
  • Weak invariant / snapshot techniques need black mutator

Concurrent write barrier mechanisms

  • Prologue
    • Mutator barriers need to store references to grey objects in a structure, which collector needs to use to color relevant objects
    • Structures need to be safe for both mutator-collector concurrency and mutator-mutator concurrency
    • Can use a log to record grey objects, also card tables.
    • Card tables
      • Dirties a bit to state the grey color an object in a range of addresses
      • What is grey(targets of interesting pointers as well as simply transitive references through stack's roots) varies with collection technique
      • Card table becomes collector's work list
        • Mutator can re-dirty cards so collector needs to scan repeatedly
      • One level card tables
        • Card is in one of three states: dirty, refining, or clean
        • Mutator barriers set to dirty using store instruction, no expensive atomic primitives
        • Collector sets to refining without atomic instruction, starts cleaning
        • Collector tries to set to clean with atomic primitive. (test and set with test being a check if it is still refining). If unable to write, mutator must have set to dirty while collector was refining, must re-scan.
      • Two level card tables
        • Overhead of card table is in scanning cost, since most will be clean.
        • Cheaper to keep a card table which records location of dirtied cards at a much lower(2 ** n) granularity.
        • Concurrency issue: Mutator dirties fine grained and then coarse-grained cards while collector reads coarse-grained and then fine-grained. May consider all clean in race. Preserving atomicity may need primitives with overhead in mutator.
    • Reducing Work
      • Don't need to recursively mark children in mutator if we know that during collection we will have to trace through children on the dirty card.
        • Avoids traversing graph twice
      • Reduce number of dirty cards by only marking dirty if untraced.
        • Can keep a card table of traced objects
        • Can use local allocation buffer to record allocations in a buffer to trace. When the buffer overflows, trace all and mark all as clean.
          • Leaves some garbage for next cycle, but faster and ensures progress of threads.

Issues to consider

In some systems, mutator barriers do not harm throughput more than lower collection times add

  • Ex: A multi-phase payment system where a GC pause means a timeout, which requires reprocessing.

Costs vary by collection scheme.

  • Ex: References counting = high, while mark-sweep is low

Large impact on pause time is whether precision must be high, or if we can allow garbage to "float" from one sweep into the next, where it will be freed in the next sweep.

Very useful for large heaps where the amount of live objects is very high. Even with a parallel collector, pause times would be long.

Downsides:

  • Needs more headroom because allocating during collection.

Variants:

  • Generational partially-concurrent hybrid collector
    • Manage older generation frequently
    • Since most objects die young, nursery collection is quick enough
    • Avoids mutator barrier to younger objects because stop-the-world
    • Biggest performance loss in concurrent collectors is that they mark objects allocated during collection black always, but most objects die young
      • Hybrid is solution

Chapter 16: Concurrent Mark-sweep

Prologue

Mark-sweep very easy to parallelize

Also fairly performant because don't require read barrier.

  • Write barrier is 2-3% overhead in benchmarks. Read barrier is 20% overhead.

16.1 Initialization

Rather than running at exhaustion, collectors can run concurrently with allocation

When to trigger new mark phase critical

  • Must not run out of memory before collection finished

Simple implementation is to require all allocation to do a bit of work(hijack mutator thread)

Needs to stop the world to get mutator roots usually

  • Can use grey reference access barrier to avoid stopping all mutators, incrementally add work as mutators yield to collection work, but at least one mutator needs to be scanned to start work list.

16.2 Termination

If all mutator threads black, can terminate since no grey pointers held, so anything not scanned is garbage.

Grey mutator termination requires scanning mutator roots to ensure that no new references introduced to unmarked objects.

16.3 Allocation

New objects colored according to mutator

Since black mutator could not have known white/grey object reference, we allocate black objects.

Grey mutator has options:

  • Can simply color black during marking and add to roots, and white otherwise
  • If object initialized with all fields(immutable languages) we can color based on references. If none white, can allocate black. Must allocate white.
  • Allocating white may lead to high allocation rate in mutator preventing prompt termination

16.4 Concurrent Marking and Sweeping

Need way to differentiate between newly allocated data and marked garbage.

Solution = new color for abstraction: purple

  • After marking, set unmarked objects to be pending freedom
  • Allocate new objects without this mark
  • On sweeping, only removed marked objects
  • Problem is that this requires touching all objects in heap before ending phase of paused mutators
  • Solution is to interpret same colors differently during different collection cycles
    • Easy solution is an 'even-odd' allocation scheme. Live data switches sign, allocation created new sign-ed objects, and sweep frees old sign.

16.5 On-The-Fly Marking

Prologue

  • Alternative to stopping world to scan roots is to scan each mutator independently
  • Uses handshake to ask each mutator to stop at next safe point
  • Scans roots of mutator, and then sends on way
  • Can use stack barrier to limit mutator's actions to lower frames, so that collector can scan higher roots
    • Makes handshake and pause much shorter

Write barriers for on-the-fly collection

  • Existance of white and black mutator threads intermixing allows for deletions from white mutator and addition of reference by black mutator, to lead to black mutator being only reference to white object/violation of strong tricolor invariant

Doligez-Leroy-Gonthier

  • Leveraged ML immutability to allocate immutable data locally and mutable data in shared heap
  • Use on-the-fly marking with series of handshakes for phases of collection.
  • Java Implementation
    • No thread-local heaps, single shared heap
    • Used queues between mutators and collectors to insert grey objects into to avoid rescanning thread's roots

Sliding views

  • Uses sliding views and thread-local buffer of all fields of an object before mutator first modified by collector since marking started.
  • All mutators handshake on phase change, and then disable buffers.

16.7 Issues to Consider

Much more difficult than non-concurrent collection usually

While generational collection improves expected pause times, can't help with worst-case pause times. Concurrent collectors offer shorter and more predictable pause times.

To truly bound pause times, must be both concurrent and on-the fly.

Concurrent mark-sweep has same problems with fragmentation and slower free-list allocation that serial mark-sweep has

Mark sweep has easier heap coherency model

Has many possible variations of implementation, each choice with performance impacts

Chapter 17: Concurrent Copying & Compaction

Prologe

Must protect collector and mutator threads from eachother both

Black mutator tospace invariant: Black mutator must hold only fromspace pointers

  • Otherwise we'd never scan them

In absence of read barrier, grey mutator cannot find pointers in to tospace objects in fromspace objects(since copying collector doesn't alter fields)

  • Called grey mutator fromspace invariant
  • At collection cycle termination, we must only have tospace pointers, so roots must be forwarded somehow.

Mutator updates to fromspace objects must be propagated to tospace.

17.1 Mostly-Concurrent copying: Baker's algorithm

Maintaining a tospace invariant requires only replacing roots with addresses of tospace objects

  • Roots now black, but point to grey objects with fromspace references.

Introduce a black mutator read barrier to prevent a mutator accessing a fromspace pointer

  • On access, atomically copy target of pointer and return forward reference.

Mostly concurrent, mostly-copying

  • Mostly-copying collectors fit well with mostly-copying, we can pin ambiguous root references in mutator threads.
  • Used historically with page protection to allow concurrent copying for compilers without stack maps
  • Read barrier only needs to aquire atomic roots when forwarding, so can avoid costs of synchronization until last minute.

17.2 Brooks's Indirection Barrier

Hold forwarding reference in every object header. Points to either self or tospace copy.

Uses write barrier to prevent modification of object when copying.

Since mutator now grey, needs to scan stack for fromspace references and replace before swapping semispaces.

17.3 Self-Erasing Read Barriers

Read barriers more costly than write because more common

Read barriers conditional, forward object if not in tospace

Have closure contain a code pointer, and compile two copies of code. One does a baker-style copy while the other directly chases references.

When collecting, set bit in closures to execute baker code. When not, don't.

Beauty is that when code executed if object has already been copied, never enters read barrier.

Downside: Duplication of code. 25% larger binaries in ghc.

17.4 Replication Copying

Mutator references old objects, never pays instruction/cache/space overhead of brook's indirection pointers

When copied, write barrier logs modifications of the old object, modifies new

When all of tospace scanned, mutator's roots scanned, and write log empty, terminates

Requires pausing to ensure termination conditions hold for all threads when termination signalled.

  • Also needs to stop to rewrite pointers in roots.

17.5 Multi-Version Copying

Both previous versions must still pause world

Processors have own fromspace and tospace subset.

One tospace, multiple fromspaces. Accumulate old, invalid copies in fromspaces.

  • Barrier on access, so mutators must search for new reference and update root.

Secret is to decouple fromspace reclaimation from phase change, or 'flip'.

Each processor scans tospace for fromspace references and fixes.

Processor flips when it's tospace is full. Can't free fromspace until all references have left it.

  • Upon collecting a processor checks bits set by other processor to determine if they've flipped
  • If so, it updates all of the fromspace references it has in it's to/from spaces to point to new tospace
  • When all processors have done so, fromspace discarded.
  • This has the effect of ensuring termination because if the mutator outpaced the collector, the mutators would all scan and eventually flip until enough fromspace was released.

17.6 Sapphire

Prologue

  • If from/tospace divided equally between threads/cores, may end up with multiple threads unable to allocate until one thread frees.
    • Largest problem on non-parallel systems.
  • Solution is to place collection work asymmetrically on one or more collector threads.
    • Can modify priority to set mutator-collector balance
  • New objects allocated as black in seperate newspace that isn't scanned
    • Has write barrier to track references written into it.

Collector Phases

  • MarkCopy
    • Overview
      • All reads are from fromspace, all writes are duplicated
        • Buffered changes, eventually consistent
      • At application-level thread synchronization points, all made consistent
  • Flip
    • Sets all roots to point to tospace copies
    • Until this point, all mutators can refer to tospace/fromspace interchangeably
      • Need barrier for all pointer comparison operations.
  • Volatile Fields(Java, etc)
    • Require synchronization at every read/write between fromspace and tospace

17.7 Concurrent Compaction

Compressor

  • Scan roots, mark objects, and compute new compaction address of all live objects from knowledge of live data size and offset of object
  • Set mutator roots to fromspace and unpause
  • Use access barrier on tospace to copy an object if not alredy there.
  • Concurrently copy fromspace to tospace, and update references, but don't copy children. Don't need to in order to know their new reference, since we know all future addresses.
  • Extensive use of double mapping and virtual memory hardware, which is expensive so batching required

Pauseless

  • More intelligently only protects objects actually being copied, rather than all tospace pages.
  • Protect them incrementally while copying and updating references(known without having to copy children)

17.8 Issues to Consider

Pauses/Poor real-time behavior of baker

  • On any access with a baker-style access barrier of tospace, every access may take unknowable amount of time because many have to recursively copy object and references through it.

Chapter 18: Concurrent Reference Counting

Prologue

Previous reference counting optimizations(including elimination of cycles) involved stopping the world for tracing or batch changes of references.

18.1 Simple Reference Counting Revisited

Difficult to do concurrently because increment/decrement of references might interleave in the mutator even if reads/writes atomic

One solution to lock each object before altering fields.

Lockless solution would be nice

  • Primitives for single memory location not enough because another mutator can drop reference to zero and free between pointer load and reference increment. Attempt to increment might corrupt another object allocated in that location.
  • If we can alter two words, that's sufficient because we can access address and count concurrently.

18.2 Buffered Reference Counting

All reference count changes placed in a buffer that a single thread scans and executes.

  • All increments executed before decrements to prevent premature freeing

Doesn't fix problem of making increment/decrement and pointer load/store atomic.

Can do so by giving each thread a local buffer and an epoch count and using ragged epochs. When local buffer full, get new one and increment epoch. When all epochs reach a level, local buffers of that level and below processed(all increments first) together.

Variant of deferred reference counting.

18.3 Concurrent, Cyclic Reference Counting

Trial deletion has problems

  1. Can't guarantee that the subgraph it retraces hasn't changed since last trial deletion.
  2. Pointer deletions may seperate graph
  3. Reference counts may go out of date

Only current good solution is to run reference count concurrently, and set a mark, and then free that next cycle if it's still garbage.

In theory, may leave garbage around indefinitely

18.4 Taking a Snapshot of the Heap

Refinement of coalesced reference counting

Problem with previous description is that while we can restart threads once we get their local buffers, if we're going to decrement children's counts before freeing object, we need to know state of object at the time the change happened, not what's left from the mutator.

Solution:

  • If object clean, old reference still valid
  • If object not, mutator's log buffer has the change in it, so we can find children by reading state of object before reference change in buffer.

18.5 Sliding views

Don't need to stop all mutator threads.

Only collect states of a subset of objects each time, fix subset's references each time

Applications:

  • Age-based
    • Use sliding reference counting for long-lived data and mark-sweep for nursery

With only a subset of the heap in view, we can use trial deletion and use graph made from local allocation buffers, prevent graph from changing under it's feed.

18.5 Issues to Consider

Exist large literature on ways to safely reclaim memory for insertion/deletion with certain data structures, might want to specialize for language.

Chapter 19: Real-Time Garbage Collection

Prologue

All earlier algorithms not truly real-time because they didn't guarantee progress of mutator

Thread scheduling may starve mutator thread when many collector threads

19.1 Real-Time Systems

Impose operational deadline on tasks

May be soft: example is video. Pause degrades performance

  • Should use goal of maximum collector time, time slice, and failure rate

May be hard: Pause causes failure. Example: ignition system

  • Performance less important than predictability of performance
  • WCET: Worst case execution time of a task
  • Embedded realtime systems have tight space constraints

Needs specific collection scheme, different than previous concurrent gc schemes.

19.2 Scheduling Real-Time Collection

Incremental collectors will impose cost on all mutation

Concurrent collectors as before

To maintain steady-state space usage needs to free and recycle objects at rate of mutation.

  • Must avoid fragmentation to provide constant space as well
  • Moving has costs

Alternative methods

NameDescription
Work-basedImposes cost on mutator actions
Slack-basedRuns work during slack portions of collector time-slice. Can be significant in periodic real-time systems.
Time-basedSchedules a constant time slice of collection while world stopped

19.3 Work-Based Real-Time Collection

Classic baker algorithm

  • Fixed-size objects assumed
  • Read barrier
    • Reading copies into tospace. Bounded by object size.
    • Also marks a bounded number of objects
  • Has specific bounds:
    • Space = 2R(1 + 1/K)
    • R = reachable space
    • K = Adjustable time bound, defined as number of fields scanned each allocation

Parallel, concurrent replication

  • Adds concurrent and parallel collector
  • Because work-based, can suffer from variation in distribution of pauses.
    • Difficult to obtain bounded mutation times
      • Solutions exist, 19.5
  • Algorithm
    • On read, copy to tospace. Also copy concurrently.
    • All copies to tospace leave a forwarding address in header of previous object. All reads by mutator in future go through this address
    • Fromspace and tospace divided between threads
    • When collecting, all allocation allocates into both fromspace and tospace, with forward.

Uneven work and its impact on work-based scheduling

  • If we use worst-case execution time, we have the problem that the worst-case is much worse than the expected case
    • Array of things with references pathalogical worst case
  • We can bound worst case by breaking down objects and arrays into fixed-size units
    • Places overhead on access, space
  • Considered a nail in coffin of work-based scheduling
    • Opinion is that overly starves mutator
  • Problem is that even when non-copying and fairly constant, allocation storms cause massive variance in time of GC
    • Response has been to treat mutator work as something that must be scheduled like any other real-time job.
    • Leads to better worst-case and expected-case ratios

19.4 Slack-Based Real-Time Collection

Implementation: Hendriksson

  • Forbids collection during high-priority tasks
  • GC work delayed until no high-priority tasks executing
    • Also called the slack-period
  • Low-priority tasks do collector work during allocation
  • During high-priority tasks, record omitted GC work and perform it during slack
  • Threads for collecting high-priority tasks have lower priority than high-priority work but higher priority than low-priority work
  • Incremental copying is done by low-priority work in standard concurrent copying collector way
    • All from-space accesses copy into tospace. All allocation done in tospace.
    • Enforces constraint that no tospace object refers to a fromspace object.
  • High priority tasks do less mutation cost
    • Objects evacuated lazily. allocate in fromspace but don't actually forward through fromspace header
    • High-priority collection threads during slack time do copying
    • Uses unconditional read indirection barrier and points to fromspace header through forwarding address, avoids mutator overhead
  • Collector works as coroutine through well-defined yield points.

Sceduling the collector work

  • Let Wmax be the maximum amount of work that must be done starting at the start of slack time to allow the next high-priority interval to not run out of memory
  • Let Fmin be the amount of memory that the high-priority interval will need
  • Minimum GC ratio = GCRmin = Wmax / Fmin
  • Current GC ratio = GCR = Performed GC work / Amount of new objects in fromspace = W / A
    • Allocation increases A, collection increases W
    • A throttle by yielding to GC until GCRmin >= GCR
    • Ability to arbitrarily enter a high-priority period means that space overhead may be required to ensure that all tospace allocation is satisfiable.
      • Programmer must know worst-case allocation needed by high-priority task.
      • Not as bad as it sounds because high-performance, hard real-time tasks typically use little memory

Execution overheads

  • Since caches alter memory access time, instruction count not accurate measure of time
    • Needs to be tested repeatedly to know predictable cache behavior, or assume worst-case time
    • Collection in mutation has synchronization overhead.
    • Forwarding pointer overhead on reads.
    • Allocation is simple bump-a-pointer allocation.

Programmer input

  • Programmer must inform GC about worst-case allocation requirements and period in which it runs.

19.5 Time-Based Real-Time Collection: Metronome

Mutator utilisation

Supporting predictability

Analysis

Prologue

  • Treats minimum mutator utilisation as an input
  • For java
  • Concurrent incremental mark-sweep, with occasional compaction to avoid fragmentation
  • Uses deletion write barrier to enforce weak tricolor abstraction
  • Uses brooks-style forwarding pointers

Mutator utilisation

  • Ensures minimum mutator utilisation. All other time to discretion of collector, can give to mutator or use.
  • Is a trade-off between space and mutator utilisation.
  • Split time into quanta, ensure that for a given window the MMU ratio is maintained
  • Scheduler will usually schedule collection until MMU about to be violated and then back off
  • When considered over 'windows' of time between deadline, leads to mini-pauses or jitter where it collects for a while and then backs off.

Supporting Predictability

  • Fragmentation a threat to predictability. Many considerations
    NameDescription
    ArrayletsBreak arrays and variable-sized objects into fixed sizes to fight fragmentation.
    Read barrierTo avoid performance cost of brooks-style read barrier, eagerly change references in roots and on stack to point to tospace when copied.
    Scheduling CollectorUse 'alarm thread' to check if it should switch to a collection quantum and if so then tell mutator threads to reach safe-point and yield to it. Colector thread collects.
    Suspending mutator threadsAt pause will reach safe point, mutator threads store all references on stack/in roots in specific location. On reload will read them back in, to allow faster changes.
    Ragged root scanningDeeply nested stacks a problem, may not allow collector to scan during quanta. Solution is concurrent scans and a write barrier into scanned stacks to prevent overlooking references.

Analysis

  • Metronome gave formalized model for concurrent, realtime GC
  • Parameters
    • A(t) is instantaneous allocation rate at time t
    • G(t) is instantaneous gabage generation rate at time t
    • P is GC processing rate as function of live data
    • Memory allocated, garbage created is simple integral
    • Mutator quantom and collector quantum are time that each can run without yielding
    • Only works if parameters have accurate estimate.
  • Minium mutator utilization = ( mutator quantum * (time elapsed / Mutator quantum + collector quantum) + x ) / (time elapsed)
    • x = Size of remaining mutator quantum
  • Can be used to analyze work-based and other scheduling approaches

Robustness

  • If parameters are poorly known, only graceful way to degrade is to slow down allocation rate
  • Can slow down allocation rate is to use generational scheme
    • Nursery filters into primary heap
    • Removes short-lived data, reduces total floating memory
    • Unacceptable in naive approach due to variation in pause time when nursery overfills during mutator slice
    • Syncopation a solution
      • Always copy out of nursery during collector slice
      • Use nursery survival rate as a parameter so to size nursery so only one evacuation

19.6 Combining Scheduling Approaches: Tax-and-Spend

Prologue

  • Metronome best on uniprocessors because of need to stop all mutation
  • Work-based collectors have order of magnitude worse latency than time-based
  • Slack-based fragile, will break when no slack; under load
  • Tax-and-spend combines all of them.
  • Performant, cut metronome latencies down to a third of previously
  • Mutators must do an amount of collection work dependent on desired MMU.
  • Slack time used by collector to give MMU 'credit'.
  • Taxation can be scheduled during any safe point, but always checks when thread-local allocation buffer exhausted.

Tax-and-spend scheduling

  • MMU surperior to maximum pause time because it allows for computation of clustering of pauses that would cause a missed deadline.
  • MMU can vary by thread to prevent missing deadlines with tax-and-spend
  • Can also use other cores to offload collection work
  • Per-thread scheduling
    • Must track mutator activity to determine tax to be paid
    • Must avoid letting tax get too high, such that a deadline would be missed.
    • By interleaving collection and mutation in same thread, you prevent the OS scheduler from placing some other mutator or an unrelated process on the core after the mutator yields.
    • Allowing MMU to vary between threads important when allocation rates vary
  • Tax-based vs slack-based scheduling
    • Slack-based degrades with no slack
      • Poor for interactive real-time systems with sustained peaks of use
    • Time-based scheduling may introduce jitter because may pause until MMU is almost violated, such as at start of new window of time consideration.
      • Ensures that for time periods smaller than a window, mutator responsiveness inconsistent
      • Tax-based more consistently realtime
  • Combining tax-based and slack-based scheduling
    • Tax-and-spend combines all the other scheduling approaches by using an economic model.
    • Mutator threads pay back tax by collecting incrementally during allocation
    • Dedicated mutator threads with low priority exist to work during slack
      • Deposit work done as a 'credit' in a global pool for mutators to take from
      • One per processor usually
      • Must sleep itermittently to allow mutator threads which appear to schedule over them
    • Tax over all threads must be high enough to finish collection before memory runs out
    • When tax due tries to withdraw from global credit bank
      • If can, then collector is keeping up with mutator so doesn't need to pay tax.
      • Treats slack on uniprocessor and unexploited parallelism on a multiprocessor the same.

Tax-and-spend prerequisites

  • Needs an underlying collector that's both incremental and conucrrent. Should be parallel on multiprocessors.
  • Metronome changes for synchronization without halting mutator threads:
    NameDescription
    Raggeed epochs for global consensusNeeds global states to hold (such as change of fromspace/tospace pointers in read barrier) for all threads for phases of collection to change. Uses an epoch number. Threads that change state during yielding to collection increment their epoch. Once the lowest epoch among any threads is value of phase change started, can assume all mutators have hit that phase.
    Phase agreement with 'last one out'Global phase. When mutator thread starts to pay tax, increments worker count. If no work being done, and if worker count = 0, then it atomically changes phase and decrements worker count.
    Per-thread callbacksAll threads are requested to make change(flush thread-local buffer, copy out roots from mutator thread stack) by global thread whenever collection being done and hits step that demands it. They're given a callback to perform to do that step. If blocking on IO or executing native code, prevented from returning and action performed on their behalf.
    Priority boosting to ensure progressIf higher-priority work threads saturate processor, above steps can be starved. Need to boost threads requesting actions from other threads.

19.7 Controlling Fragmentation

Prologue

  • Fragmentation can cause violation of real-time space bound.
  • Can compact to avoid fragmentation
  • Can also use arralyets and object-lets to have fixed size allocation blocks, at cost of some internal fragmentation and higher memory usage
  • Concurrent replicating copying collectors have problem of keeping both copies of object consistent. Locking doesn't scale, and can cause mutator thread starvation.
  • Compressor and Pauseless algorithms both rely on traps and page-level synchronization and so suffer from poor MMU.
    • Work-based, and causes trap storm on phase. (Interferes with all threads)
  • Lock-freedom very desirable in a realtime system. Ways to obtain it follow:

Incremental Compaction in Metronome

  • Metronome non-copying, compacts while threads are stopped
  • Tax-and-spend version doesn't perform any compaction
  • Bacon(2003) uses segregated-fits and incrementally defrags size classes in slow/infrequent path of allocator
    1. Sort pages by number of unused objects per page from dense to sparse
    2. Set the allocation page to the densest non-full page in resulting list.
    3. Set the page to evacuate to the sparsest page in same size class
    4. While target number of pages to evacuate in size class not met, and page to evacuate does not equal page in which to allocate, evacuate from sparsest to next allocation page, moving to next page when allocation page filled.
  • Moves minimal number of objects and frees maximal number of pages
  • Always choosing densest can lead to hopping pages and poor locality, can require minimum number of cells in allocatin page

Incremental Compaction on Uniprocessors

  • To obtain atomicity on a uniprocessor can simply disable scheduler interrupts or only allow thread-switching during a gc safe-point.
  • If brooks forwarding pointers used, can optimize access.
    • In replication, forwarding pointer of each copy points to other copy. If only one, pointer points to self.
    • Mutator uses whichever address it has, no synchronization needed.
    • On write, write to writes to both the origional and copy. (Kalibera 2009, Java)
    • Double write is cheap, and makes reads much faster.
    • Many more reads than writes.
    • Doesn't harm cache through needing both lines in memory for each read.

Incremental Compaction on Multiprocessors

  • Stopless: lock-free garbage collection
    • Ensures lock-freedom
    • Doesn't double write
    • On start of collection, threads race to allocate a "wide" version of the object and set redirection pointer to that object.
      • Wide copy has state for each field that is either 'in origional', 'in wide copy', or 'in copy'. All set to 'in origional' at first.
      • On write, write to wide copy and set field to 'in wide copy'.
        • Also done by copying collector when it copies object with reference in that field
      • On read, read from origional or copy based on state.
      • Once all fields 'inWide', copying collector makes 'narrow' version in tospace and puts address as forward reference in tospace
      • Copies fields to copy, and set's wide's field status to 'in copy'
    • Downsides:
      • Doesn't work on double-word fields if double-word writes not atomic on architecture
      • Has space overhead of 3 copies of object for a period
        • Can use space filled by copies of objects emptied from pages during incremental compaction to hold copy.
  • Staccato: best-effort compaction with mutator-wait freedom
    • Doesn't require atomic instructions, but can use memory fence at GC safe points to ensure up-to-date references
    • Collector reserves bit in forwarding address to state that object is currently being copied.
    • Algorithm:
      1. Set copying bit using compare-and-set. Mutators access forwarding pointer without atomic operations, so changes take time to propagate. If a mutator accesses the object, the copying bit is unset atomically with compare-and-set. If this fails, copying finished and should go through reference.
      2. Wait for ragged synchronization at a read memory fence at a safe point.
      3. Perform a read fence of object to ensure that changes made by mutators between setting bit and ragged synchronization are seen by collection thread.
      4. Allocate copy, and copy fields
      5. Set read fence so mutators see field changes.
      6. Once ragged synchronization happens, make copy refer to itself through forwarding address, and unset copy bit using compare-and-swap. If fails, old object must have been moved by another mutator. Abort.
    • Can amortize memory fence overhead since many objects usually evacuated from/to same page.
    • If collector tries to incrementally copy objects with high mutator locality, the mutator threads will cause an "abort storm".
      • Unlikely because only sparsely populated pages moved so unlikely to move objects all allocated together recently.
  • Chicken: best-effort compaction with mutator-wait freedom for x86
    • Uses x86 memory model to know strong read ordering. Only needs to abort on write
  • Clover: Guaranteed compaction with probabilistic mutator lock-freedom
    • Each move selects a random value alpha and sets invariant that mutator won't write it into heap
      • If it tries to, block mutator.
      • Best to try an illegal value for the program, such as a pointer out of memory range or a NaN variant not used by hosted language.
    • Once a field is copied, it is replaced with alpha via compare-and-swap.
    • If mutator accesses an alpha, it reads//writes field to second copy.
  • Stopless vs Chicken vs Clover
    • Stopless can't guarantee progress to collector because of potentially infinite writes to wide copy's fields.
    • Chicken guarantees progress, but aborts copies
    • Clover may interleave with repeated changes to a field. It must keep copying, and is unable to compare-and-swap afterwards because of field changes.

Fragmented Allocation

  • All compaction techniques have overhead, so process must trade throughput for fragmentation
  • Solution may be to only allocate in fixed-size granules with no constraints on locality.
  • Arrays split into a trie of array
    • Problem is that access time is variable, not constant now
    • Java allows arrays with no maximum length, so can't prove deadline on memory reads.
    • Solution: Schism by pizlo
      • Uses spine of arraylets, stole from metronome.
      • Access is O(1) because field locations computed from values in a 'sentinel fragment' in header of start of object or array. Becomes one layer of indirection. Is standard reference
      • Problem of dynamic arrays
      • Memory needed bounded by 1.3104 * size of live set.
      • When space for sequential allocation of all arraylets, can mark in sentinel fragment and use conventional access techniques.

19.8 Issues to Consider

Real-time systems almost always require statistics about the expected application to be collected by the developer.

Difficulty lies in ensuring that you're seeing the worst-case truly possible by executing slow paths.

No comments:

Post a Comment