Chapter 13: Concurrency Preliminaries

13.1 Hardwarse

Threads

Thread states = ready, blocked, or running

Threads can move processors when descheduled
- Can use processor affinity

Simultaneous multithreading or hyperthreading (multiple processors with same execution pipleline) not actual threading, but similarly named

Multiprocessor = More than one processor

Multicore/CMP/Many-core = many processors on same integrated circuit

Mutithreaded software runs concurrently on a multiprocessor

Multiprogrammed = software executing multiple threads on a single processor

Interconnect

Shared memory differentiates local concurrent software from distributed systems

Access mediated by bus or interconnect
- Bus = single point that all requests pass through
- Granularity of memory acccess a tradeoff between bandwidth and number of busses
- Busses a bottleneck on 8-16+ core systems

SMP vs NUMA
- SMP = All processors have equal access time to memory
- NUMA = Each processor has a small bank of memory that they have faster access to

Memory

Because of concurrent access to memory, hard to describe a single state

Alright because we only need to access each unit of memory, and that access is consistent

Caches

Memory access is slow relative to processors, so we use faster and smaller caches

If needed address in cache, called a cache hit

If needed address not in cache, called a cache miss
- On cache miss, we bring memory's values into cache
- If cache full, we must evict a cache line. Evicted line called victim.

If we write the data to memory immediately on write, is called write-through. Otherwise called write-back if only when flushed.

Cache Organization
- Cache Location
  - Fully Associative
    - Any memory addresses lines can reside in the cache together at the same time
  - Direct-Mapped
    - Each address has a specific place in cache it must reside
  - k-way Set Associative Caches
    - Each address has k possible cache locations in memory to reside
    - Can evict however many it needs
  - Victim caches
    - On eviction, a line is put in a fully associative victim cache
    - Allows to easier recalling of victims
- Cache level relationship
  - Strictly inclusive if must be held by higher level if held by lower level
  - Strictly exclusive if may not be held by higher level if held by lower level
  - Can be neither, may allow anything

Coherence

Multiple levels of cache may get out of sync with one another, hold different valuse for same address

Common cache coherency protocol: MESI

State	Description
Modified	The cache is the only one holding a copy of hte line, has been modified but not written back to main memory.
Exclusive	Only one cache holds the address, not modified.
Shared	Other caches may hold the line, all have same value
Invalid	Cache does not hold line

To satisfy read, must hold in MES state. How to satisfy write depends on state of address.

Point is that there is only one writer for a given line, and no caches hold disagreeing values.

Does not scale well to many processors.

False sharing = if multiple processors frequently access and update different data in the same cache line, the line will be moved and invalidated between processors frequently. "Ping-ponging"

Performance example: Spin locks
- Poor because threads on different processors will contend for line with lock and invalidate it frequently.

13.2 Hardware Memory Consistency

Prologue

Writes to a given address have a total order, and processors won't see a state change without a write having happened between both accesses.

Order of writes to more than one address do not correspond to actual changes in memory or cache
- Ex: Write buffer that summarizes multiples writes by choosing later writes rather than writing both earlier and later values. A read will fetch the value from write buffer if it's in there.
- Ex: Compilers may reorder instructions
- Almost universally done for performance gains

Non-atomic, relaxed ordering can break program invariants

Three access types: read, write, atomic.
- Atomic uses test-and-set to ensure atomic read-modify-write

Fences and happens-before

Fences prevent certain types of access happening on one side of a point to happen on the other

Alternative is aquire-release pair which prevents things from happening before the aquire or after the release.
- Useful for protecting critical sections

Consistency models

Strict consistency => All operations have a total ordering everywhere in system

Sequential consistency => A partial ordering between processor program orders exist.

Weak consistency => All atomic operations are total fences

Causal consistency => Enforces that reads not be reordered across writes that may alter the read's value.

13.3 Hardware Primitives

AtomicExchange and Test-and-set are both fairly simple. They set a value unconditionally and return the old value. This can be seen as setting a value conditionally if the semantics are "only proceed if a certain value was returned."

Compare and Set

Takes an old and new value and if the old value is equal to the value currently in the address, write the new value. Returns whether or not updated.

Compare and Swap = Same as compare and set but returns old value

Trap is that the value might have changed multiple times and is now equal to the older value being checked for

Ex: Misreading an int for a pointer and finding it the same bits and proceeding.

Called the ABA problem

Load-linked / Store Conditionally

Use processor's cache coherence mechanism to detect changes at a memory address.

Solves the ABA problem very well

Atomic Arithmetic: Atomic increments and decrements

Not as powerful at other primitives. Have finite consensus numbers

A primitive with a consensus number of N allows for correct creation and selection of a value between N threads

More powerful primitives

Multi-word compare-and-swap can use a counter in the second word to avoid the ABA problem

More useful to update two non-adjacent words atomically
- Some chips like 88000 have such primitives

Overheads of atomic primitives

Expensiveness of atomic primitives lead to disuse by programmers, and error

Expensive for two reasons:
- Cache Coherence
  - Must aquire exclusive access to line
- Memory fences sacrifice optimizations by reordering

13.4 Progress Guarantees

Important to guarantee that the threads contending will make progress, bound wait times

Guarantees:

Wait-free: Every thread can always make progress regardless of actions of other threads.
- Typically involves threads helping eachother along
- If a thread's action will make another thread already holding resources unable to proceed, the first thread will wait until the other thread leaves its critical section

Obstruction-free: If given a period of isolated execution, each thread will complete in a finite number of steps.
- Easier than wait-freedom, but may require scheduler cooperation
- Threads can tell if they are contending and will back off

Lock-free: Infinitely often, a thread will finish in a finite number of steps
- Requires only that at least one contender makes progress at any given time.

Guarantees and Concurrent Collection

Parallel collectors use multiple threads to collect, but stop mutation threads to collect.

Concurrent collectors perform some actions while mutation threads run

Concurrent collection has implementation edge cases
- When objects are freed or become garbage during collection
- Proof of termination
  - Mutators may need to be throttled back to avoid making more work than the collector threads can complete
  - Need to prevent mutators from using all memory before collection finishes.

May offer different guarantees for different portions.

13.5 Notation Used for Concurrent Algorithms: Not relevant. Used for describing pseudocode.

13.6 Mutual Exclusion

Need to balance progress guarantees and mutual exclusion

Peterson's Algorithm good tool

13.7 Work Sharing and Termination Detection

Parallel algorithms need a way to detect termination of parallel algorithm

Simple if processing a finite amount of data, but sometimes threads create more work to process

Can push work to other threads or pull from other threads. Pulling called work-stealing

No work may be lost

Counters may become bottlenecks, cause cache contention

May use a seperate thread whose only job is to detect terminations

Workers set a busy flag. Detector finds.

Needs to track jobs pushed or pulled to atomically scan all flags.

Rendevous Barriers

Common to need all threads to hit same phase of collection.

May simply need all therads to reach a given point, called a rendevous barrier

13.8 Concurrent Data Structures

If structure accessed rarely, can simply lock it entirely with a mutex, but won't scale

Concurrent structures exist that allow a interleaving of certain actions on structure.

Is linearizable if overlapping actions has same effect as seperating them.

Each action has a point at which it can be observed to have occurred. Called the linearization point.

Variety of locking patterns that can be used:

Name	Description
Coarse-grained locking	One large lock for whole structure
Fine-grained locking	Lock subsets or individual elements. Needs to ensure that locking order consistent or will cause deadlock.
Optimistic locking	Searches the structure without locks, and then locks, and verifies that the structure hasn't changed since searching, and right elements locked.
Lazy update	Use metadata to mark an element as being removed or altered, release lock, and then aquire lock to make change. This prevents blocking reading threads.
Non-blocking	Use of atomic primitives to remove need for locking.

Concurrent Stacks

13.9 Transactional Memory

Implementation:

All reads and writes in a transaction should appear to not interleave with any other threads'.

Should allow over many independent words.

Attempts to commit the transaction may fail. If succeeds, changes seen. If fails, software may respond by retrying or doing other steps.
- Allows transactions to execute speculatively.

Atomic, Consistent, and Isolated.

If two transactions will interfere with eachother, one must be killed to allow the other to proceed.

Atomicity methods
- Undoing
  - Creates an undo log, actually makes writes to main memory.
  - On failure, undoes instructions in log.
- Buffering
  - Create a scratch buffer and store all writes to it. On success, write to memory. On failure, discard.
  - More work for successful commits, and success is common case.

Conflict detection
- Can use eager or lazy methods
- Eager:
  - Checks all accesses for conflicts
- Lazy:

Use for collection

Caveats:
- STM has a significant overhead
- Hardware transactional memory may have gotchas, not obvious how code maps to transactions, may have strange cache behavior.
- May interfere with guarantees because the retry behavior is less understandable than hand-written locking
- May require very careful performance tuning.

Interaction with languages with transaction memory

Memory manager may cause more transactions to fail, cause many retries

May have collector and mutator fight for transaction, constantly restart
- Harder to handle with hardware transactional memory

Transaction logs need to be part of the root set, or garbage created during a failed transaction would cause premature freeing

Undoing allocation may require a lock on allocation pointers or free-list metatdata, undesirable

If non-transaction actions allowed during transaction, need to make sure that allocated memory is initialized.

13.10 Issues to Consider

Hard to get right

What guarantees are actually necessary?

What atomic primitives are supported on our architectures?

Chapter 14: Parallel Garbage Collection

14.1 Is There Sufficient Work to Parallelize?

Goal is to reduce overhead by exploiting hardware

In stop-the-world, will shorten pause times

In concurrent schemes, will shorten GC cycle time

Must be worth it in the tradeoff of the overhead of synchronization primitives

In practice, not many stalls because the object graph is not very deep. This is amenable to parallelism.

Harder to parallelize tracing, easier to parallelize "sweep" activities like fixing references.

14.2 Load Balancing

Needs to properly allocate work to threads, or will be mostly sequential execution

Has coordination cost

Methods

May schedule all work ahead of time, such as with a sweep that may work by dividing by region. Called static balacing

Most tasks need dynamic balancing. Use a heuristic to divide labor.
- Good solution is to over-partition the work into more pieces than threads and allow threads to take them off a queue.
  - More accurately partitions based on real work than heuristics

14.3 Sychronization

Over-diving into too small granules of work will have synchronization overhead, despite better processor use

Need to prevent double-copying because of problems like future updates only affecting one version, inconsistent state.

Synchronization may be removed where actions are idempotent and repeating work not too costly.

14.4 Taxonomy

Processor-centric algorithms

Gets work granules of different sizes

Usually by work-stealing

No consideration of locality

Memory-centric algorithms

Take locality into account.

14.5 Parallel Marking

Prologue

Marking is 3 actions: Getting an object, marking it, and adding children to a work-list

All known-algorithms processor-centric
- No synchronization required if work-list is thread-local
- Marking can be idemponent(set a bit)
- Synchronization required for work-stealing

If marks are stored in bitmap, access needs synchronization.

May need to move work from thread-local store

Processor-centric Techniques

Work stealing
- If work-list empty, try to steal from other thread's
- Needs to lock, but can give up if unable to aquire lock and move to next thread's queue
- Problem with processing large array of pointers. Local work stack will overflow.
- Solution is to push arrays onto global work queue as smaller lists of pointers
- Only requires sychronization on taking of work, faster than other algorithms which always use synchronization
- Termination for work stealing
  - Use of an atomic counter possible, but contention can lead to serialization
  - Thread-local counters allow for introspection without locks
  - Detection involves setting a "No threads stole work" flag, trying to steal work, on failure checking if the "no threads stole work" flag has been unset by thread on aquiring work. If so, terminates.
  - May be worthwhile to have lock for termination detection

Grey Packets
- Divide work into packets and have threads contend for work packets
- Each thread has an output packet, and steals work to add to input packet
  - Adds children of marked object to output packet
  - Called grey packets because input and output packets are both grey in tricolor abstraction
  - Allows for packet object prefetching.
  - Only require synchronization to get input packet and put output packet
  - Termination state is that the total number of empty packets(outputs with no grey elements) is equal to the total number of packets.

Channel
- Create a channel(queue with single reader and writer) between each thread to each other thread
- Store channels in circular buffer
- Push children of marked object onto other thread's queue
  - If no children, take work from global pool and push instead.
- No synchronization needed, and performs better than work stealing on computers with many processeors

14.6 Parallel Copying

Differs from marking in that may only copy once

Processor-Centric Techniques

Work dividing
- Each copying thread given a stack of work to do.
  - Stacks offer easier synchronization
- Synchronization works with "rooms," divided into pushing rooms and popping rooms
  - Must always have one empty pushing and popping room
  - New thread moves to non-empty popping room and pops work out of it, putting work generated onto it's stack. Leaves pop room when done.
  - Once all threads leave pop room, locks pop room to future thread enterance and empties queue into push room's shared stack. Leaves push room when done.
  - Last thread to leave push room unlocks pop room.
- Problem with waiting. Because work time varies between threads, may have significant wait time.
  - Can fix by making threads leave pop room as soon as they get work, since working doesn't depend on shared stack.
- Terminates when shared stack is empty when last thread leaves push room, and no objects working outside of the rooms.

Copying in parallel
- If all alocation done in same region, need to synchronize on allocation.
- May race to write "busy" pointer into object's forwarding slot.
  - If use thread-local buffers for allocation, can race to write forwarding location directly.
    - May need to un-allocate if another thread has already copied in the interim.
- If local allocation buffer bounded, NUMA architectures may suffer due to allocation on another processor
  - NUMA should be baked into the processor

Memory-Centric Techniques

Per-thread fromspace and to-space
- Risks poor load-balancing and space overflow

Block-structured heaps
- Over-partition fromspace and tospace and have threads compete for granules of fromspace
- Small blocks a good idea
- A good way to fight wasted space due to fragmentation in each subset of tospace is to use big-bag-of-pages.
- Naive implementation uses breadth-first copying.
  - Seperates parents from children, poor locality

Channels
- Removes need for atomic synchronization primitives
- Designed for NUMA
- Over-partition heap, consider each subset a work-list
- If a thread sees a refernce in it's own partition, adds to work list
- Otherwise sends to other thread through channel

Card Tables
- Difficult to balance load with card tables
- Straightforward partition has problems because cards tend to have uneven amount of live data, work
- Solution is to overpartition

14.7 Parallel Sweeping

Embarassingly parallel

Either partition heap statically or over-partition

Adding garbage to free list becomes bottleneck

Segregated fits a good solution, also thread-local allocation

Giving each thread multiple blocks a good way to reduce contention

14.8 Parallel Compaction

Easy to parallelize since we know the future location of a piece of garbage after marking(since we know amount of live data seen before it during marking.)

Multiple sweeps required:

1. Calculate forwarding addresses and insert into headers

2. Update all references in objects

3. Copy objects

Can do each in parallel, but need all threads to reach rendevous before next phase starts.

Chapter 15: Concurrent Garbage Collection

15.1 Correctness Of Concurrent Collection

Must be able to retain all reachable objects, and must eventually complete collection cycle

Tricolor Abstraction

Easy to reason about correctness with tricolor abstraction

Collection can be seen as turning a portion of grey(unscanned) objects into white(dead) and black(live) ones.

Concurrent mutation can make mutator and collector's views of world inconsistent

If mutator changes white or grey objects it is fine because collector still needs to scan

If mutator changes black(already scanned) objects, is alright as long as new references already known as reachable.

Can still cause problems by freeing live objects

Lost Object problem

Two failure scenarios
- A mutator can wide a white objects directly reachable from grey object by inserting reference into fields of already-scanned objects and deleting link from grey object
- Can hide object transitively reachable by inserting pointer into chain into already-scanned objects, and then breaking pointer chain in unscanned objects.

Only possible to lose objects if two conditions hold:
- Mutator stores a pointer to white object in black object
- All paths from any grey objects to the white object is destroyed

Strong and weak tricolor invariants:
- Weak triolor invariant: All white objects pointed to by a black object are grey protected.
  - To prevent both from holding, we must make sure that all references from black objects to white objects are tracked
  - Non-copying collectors ensure this because of traversal pattern, all references from black objects are black
- Strong tricolor invariant: There are no pointers from black objects to white objects.
- Strong implies weak, but not inverse
- Ensuring strong is sufficient for copying collector

Precision
- Varying precision of live object tracking mean holding onto some garbage for a duration
- Stop the world has maximal precision, concurrent schemes typically trade precision for concurrency

Mutator color
- Useful to consider mutator thread as if it were an object, and roots like references.
- If a mutator is allowed to become grey, need to be rescanned
- May be multiple mutator threads, need to handle

Allocation color
- Need to consider which color to color new allocations
- Must not let mutator's references to allocated cell break tricolor invariant.

Incremental update solutions
- Need a write barrier to inform collector of mutator actions which would violate the invariants

Snapshot-at-the-beginning solutions
- If a reference to an object removed during collection, need to conservatively treat as root
- Requires write barrier

15.2 Barrier Techniques for Concurrent Collection

Collector can respond to concurrent mutation in only three useful ways

Add to wavefront(black/white divide) by shading an object grey if it was white (Adding a root essentially)
Advance wavefront by scanning object to make black
Retreat the wavefront by marking black object grey ('unscanning')

Any other actions(reverting object to white or shading black without scanning) would break invariants.

Grey Mutator Barrier Techniques

Assumes a grey mutator

Use an interstion barrier to prevent storing white pointers in black objects

Because grey, no read barrier needed

Techniques:
- On writing a pointer into a black object, make object grey to require re-traversal. (Steele)
- Can mark any object with a pointer written into it grey since idempotent if grey and harmless if white. Removes need to check state of object being modified. (Bohem)
- Always shade target of inserted pointer black, regardless if reference later removed. Ensures progress of concurrent threads. (Dijkstra)

Black Mutator Techniques

Can use a read barrier to mark any white objects grey as soon as their reference is read, written, or deleted.

Can use virtual memory hardware to implement access barrier of black fromspace.

Can do converse and mark deleted references black to prevent removal of last observed pointer freeing live objects.

Completeness of mutator techniques

No other organization schemes can preserve the tricolor invariant under concurrent collection(Pirinen 1998)

Can make variations by coarsening granularity and promptness of these techniques.
- Ex: Mark targets of altered references grey and re-scan
- May cause difficulties in ensuring termination

Strong invariant techniques need only to ensure that one of the two conditions cannot hold, don't need to do any more

Weak invariant / snapshot techniques need black mutator

Concurrent write barrier mechanisms

Prologue
- Mutator barriers need to store references to grey objects in a structure, which collector needs to use to color relevant objects
- Structures need to be safe for both mutator-collector concurrency and mutator-mutator concurrency
- Can use a log to record grey objects, also card tables.
- Card tables
  - Dirties a bit to state the grey color an object in a range of addresses
  - What is grey(targets of interesting pointers as well as simply transitive references through stack's roots) varies with collection technique
  - Card table becomes collector's work list
    - Mutator can re-dirty cards so collector needs to scan repeatedly
  - One level card tables
    - Card is in one of three states: dirty, refining, or clean
    - Mutator barriers set to dirty using store instruction, no expensive atomic primitives
    - Collector sets to refining without atomic instruction, starts cleaning
    - Collector tries to set to clean with atomic primitive. (test and set with test being a check if it is still refining). If unable to write, mutator must have set to dirty while collector was refining, must re-scan.
  - Two level card tables
    - Overhead of card table is in scanning cost, since most will be clean.
    - Cheaper to keep a card table which records location of dirtied cards at a much lower(2 ** n) granularity.
    - Concurrency issue: Mutator dirties fine grained and then coarse-grained cards while collector reads coarse-grained and then fine-grained. May consider all clean in race. Preserving atomicity may need primitives with overhead in mutator.
- Reducing Work
  - Don't need to recursively mark children in mutator if we know that during collection we will have to trace through children on the dirty card.
    - Avoids traversing graph twice
  - Reduce number of dirty cards by only marking dirty if untraced.
    - Can keep a card table of traced objects
    - Can use local allocation buffer to record allocations in a buffer to trace. When the buffer overflows, trace all and mark all as clean.
      - Leaves some garbage for next cycle, but faster and ensures progress of threads.

Issues to consider

In some systems, mutator barriers do not harm throughput more than lower collection times add

Ex: A multi-phase payment system where a GC pause means a timeout, which requires reprocessing.

Costs vary by collection scheme.

Ex: References counting = high, while mark-sweep is low

Large impact on pause time is whether precision must be high, or if we can allow garbage to "float" from one sweep into the next, where it will be freed in the next sweep.

Very useful for large heaps where the amount of live objects is very high. Even with a parallel collector, pause times would be long.

Downsides:

Needs more headroom because allocating during collection.

Variants:

Generational partially-concurrent hybrid collector
- Manage older generation frequently
- Since most objects die young, nursery collection is quick enough
- Avoids mutator barrier to younger objects because stop-the-world
- Biggest performance loss in concurrent collectors is that they mark objects allocated during collection black always, but most objects die young
  - Hybrid is solution

Chapter 16: Concurrent Mark-sweep

Prologue

Mark-sweep very easy to parallelize

Also fairly performant because don't require read barrier.

Write barrier is 2-3% overhead in benchmarks. Read barrier is 20% overhead.

16.1 Initialization

Rather than running at exhaustion, collectors can run concurrently with allocation

When to trigger new mark phase critical

Must not run out of memory before collection finished

Simple implementation is to require all allocation to do a bit of work(hijack mutator thread)

Needs to stop the world to get mutator roots usually

Can use grey reference access barrier to avoid stopping all mutators, incrementally add work as mutators yield to collection work, but at least one mutator needs to be scanned to start work list.

16.2 Termination

If all mutator threads black, can terminate since no grey pointers held, so anything not scanned is garbage.

Grey mutator termination requires scanning mutator roots to ensure that no new references introduced to unmarked objects.

16.3 Allocation

New objects colored according to mutator

Since black mutator could not have known white/grey object reference, we allocate black objects.

Grey mutator has options:

Can simply color black during marking and add to roots, and white otherwise

If object initialized with all fields(immutable languages) we can color based on references. If none white, can allocate black. Must allocate white.

Allocating white may lead to high allocation rate in mutator preventing prompt termination

16.4 Concurrent Marking and Sweeping

Need way to differentiate between newly allocated data and marked garbage.

Solution = new color for abstraction: purple

After marking, set unmarked objects to be pending freedom

Allocate new objects without this mark

On sweeping, only removed marked objects

Problem is that this requires touching all objects in heap before ending phase of paused mutators

Solution is to interpret same colors differently during different collection cycles
- Easy solution is an 'even-odd' allocation scheme. Live data switches sign, allocation created new sign-ed objects, and sweep frees old sign.

16.5 On-The-Fly Marking

Prologue

Alternative to stopping world to scan roots is to scan each mutator independently

Uses handshake to ask each mutator to stop at next safe point

Scans roots of mutator, and then sends on way

Can use stack barrier to limit mutator's actions to lower frames, so that collector can scan higher roots
- Makes handshake and pause much shorter

Write barriers for on-the-fly collection

Existance of white and black mutator threads intermixing allows for deletions from white mutator and addition of reference by black mutator, to lead to black mutator being only reference to white object/violation of strong tricolor invariant

Doligez-Leroy-Gonthier

Leveraged ML immutability to allocate immutable data locally and mutable data in shared heap

Use on-the-fly marking with series of handshakes for phases of collection.

Java Implementation
- No thread-local heaps, single shared heap
- Used queues between mutators and collectors to insert grey objects into to avoid rescanning thread's roots

Sliding views

Uses sliding views and thread-local buffer of all fields of an object before mutator first modified by collector since marking started.

All mutators handshake on phase change, and then disable buffers.

16.7 Issues to Consider

Much more difficult than non-concurrent collection usually

While generational collection improves expected pause times, can't help with worst-case pause times. Concurrent collectors offer shorter and more predictable pause times.

To truly bound pause times, must be both concurrent and on-the fly.

Concurrent mark-sweep has same problems with fragmentation and slower free-list allocation that serial mark-sweep has

Mark sweep has easier heap coherency model

Has many possible variations of implementation, each choice with performance impacts

Chapter 17: Concurrent Copying & Compaction

Prologe

Must protect collector and mutator threads from eachother both

Black mutator tospace invariant: Black mutator must hold only fromspace pointers

Otherwise we'd never scan them

In absence of read barrier, grey mutator cannot find pointers in to tospace objects in fromspace objects(since copying collector doesn't alter fields)

Called grey mutator fromspace invariant

At collection cycle termination, we must only have tospace pointers, so roots must be forwarded somehow.

Mutator updates to fromspace objects must be propagated to tospace.

17.1 Mostly-Concurrent copying: Baker's algorithm

Maintaining a tospace invariant requires only replacing roots with addresses of tospace objects

Roots now black, but point to grey objects with fromspace references.

Introduce a black mutator read barrier to prevent a mutator accessing a fromspace pointer

On access, atomically copy target of pointer and return forward reference.

Mostly concurrent, mostly-copying

Mostly-copying collectors fit well with mostly-copying, we can pin ambiguous root references in mutator threads.

Used historically with page protection to allow concurrent copying for compilers without stack maps

Read barrier only needs to aquire atomic roots when forwarding, so can avoid costs of synchronization until last minute.

17.2 Brooks's Indirection Barrier

Hold forwarding reference in every object header. Points to either self or tospace copy.

Uses write barrier to prevent modification of object when copying.

Since mutator now grey, needs to scan stack for fromspace references and replace before swapping semispaces.

17.3 Self-Erasing Read Barriers

Read barriers more costly than write because more common

Read barriers conditional, forward object if not in tospace

Have closure contain a code pointer, and compile two copies of code. One does a baker-style copy while the other directly chases references.

When collecting, set bit in closures to execute baker code. When not, don't.

Beauty is that when code executed if object has already been copied, never enters read barrier.

Downside: Duplication of code. 25% larger binaries in ghc.

17.4 Replication Copying

Mutator references old objects, never pays instruction/cache/space overhead of brook's indirection pointers

When copied, write barrier logs modifications of the old object, modifies new

When all of tospace scanned, mutator's roots scanned, and write log empty, terminates

Requires pausing to ensure termination conditions hold for all threads when termination signalled.

Also needs to stop to rewrite pointers in roots.

17.5 Multi-Version Copying

Both previous versions must still pause world

Processors have own fromspace and tospace subset.

One tospace, multiple fromspaces. Accumulate old, invalid copies in fromspaces.

Barrier on access, so mutators must search for new reference and update root.

Secret is to decouple fromspace reclaimation from phase change, or 'flip'.

Each processor scans tospace for fromspace references and fixes.

Processor flips when it's tospace is full. Can't free fromspace until all references have left it.

Upon collecting a processor checks bits set by other processor to determine if they've flipped

If so, it updates all of the fromspace references it has in it's to/from spaces to point to new tospace

When all processors have done so, fromspace discarded.

This has the effect of ensuring termination because if the mutator outpaced the collector, the mutators would all scan and eventually flip until enough fromspace was released.

17.6 Sapphire

Prologue

If from/tospace divided equally between threads/cores, may end up with multiple threads unable to allocate until one thread frees.
- Largest problem on non-parallel systems.

Solution is to place collection work asymmetrically on one or more collector threads.
- Can modify priority to set mutator-collector balance

New objects allocated as black in seperate newspace that isn't scanned
- Has write barrier to track references written into it.

Collector Phases

MarkCopy
- Overview
  - All reads are from fromspace, all writes are duplicated
    - Buffered changes, eventually consistent
  - At application-level thread synchronization points, all made consistent

Flip
- Sets all roots to point to tospace copies
- Until this point, all mutators can refer to tospace/fromspace interchangeably
  - Need barrier for all pointer comparison operations.

Volatile Fields(Java, etc)
- Require synchronization at every read/write between fromspace and tospace

17.7 Concurrent Compaction

Compressor

Scan roots, mark objects, and compute new compaction address of all live objects from knowledge of live data size and offset of object

Set mutator roots to fromspace and unpause

Use access barrier on tospace to copy an object if not alredy there.

Concurrently copy fromspace to tospace, and update references, but don't copy children. Don't need to in order to know their new reference, since we know all future addresses.

Extensive use of double mapping and virtual memory hardware, which is expensive so batching required

Pauseless

More intelligently only protects objects actually being copied, rather than all tospace pages.

Protect them incrementally while copying and updating references(known without having to copy children)

17.8 Issues to Consider

Pauses/Poor real-time behavior of baker

On any access with a baker-style access barrier of tospace, every access may take unknowable amount of time because many have to recursively copy object and references through it.

Chapter 18: Concurrent Reference Counting

Prologue

Previous reference counting optimizations(including elimination of cycles) involved stopping the world for tracing or batch changes of references.

18.1 Simple Reference Counting Revisited

Difficult to do concurrently because increment/decrement of references might interleave in the mutator even if reads/writes atomic

One solution to lock each object before altering fields.

Lockless solution would be nice

Primitives for single memory location not enough because another mutator can drop reference to zero and free between pointer load and reference increment. Attempt to increment might corrupt another object allocated in that location.

If we can alter two words, that's sufficient because we can access address and count concurrently.

18.2 Buffered Reference Counting

All reference count changes placed in a buffer that a single thread scans and executes.

All increments executed before decrements to prevent premature freeing

Doesn't fix problem of making increment/decrement and pointer load/store atomic.

Can do so by giving each thread a local buffer and an epoch count and using ragged epochs. When local buffer full, get new one and increment epoch. When all epochs reach a level, local buffers of that level and below processed(all increments first) together.

Variant of deferred reference counting.

18.3 Concurrent, Cyclic Reference Counting

Trial deletion has problems

Can't guarantee that the subgraph it retraces hasn't changed since last trial deletion.
Pointer deletions may seperate graph
Reference counts may go out of date

Only current good solution is to run reference count concurrently, and set a mark, and then free that next cycle if it's still garbage.

In theory, may leave garbage around indefinitely

18.4 Taking a Snapshot of the Heap

Refinement of coalesced reference counting

Problem with previous description is that while we can restart threads once we get their local buffers, if we're going to decrement children's counts before freeing object, we need to know state of object at the time the change happened, not what's left from the mutator.

Solution:

If object clean, old reference still valid

If object not, mutator's log buffer has the change in it, so we can find children by reading state of object before reference change in buffer.

18.5 Sliding views

Don't need to stop all mutator threads.

Only collect states of a subset of objects each time, fix subset's references each time

Applications:

Age-based
- Use sliding reference counting for long-lived data and mark-sweep for nursery

With only a subset of the heap in view, we can use trial deletion and use graph made from local allocation buffers, prevent graph from changing under it's feed.

18.5 Issues to Consider

Exist large literature on ways to safely reclaim memory for insertion/deletion with certain data structures, might want to specialize for language.

Chapter 19: Real-Time Garbage Collection

Prologue

All earlier algorithms not truly real-time because they didn't guarantee progress of mutator

Thread scheduling may starve mutator thread when many collector threads

19.1 Real-Time Systems

Impose operational deadline on tasks

May be soft: example is video. Pause degrades performance

Should use goal of maximum collector time, time slice, and failure rate

May be hard: Pause causes failure. Example: ignition system

Performance less important than predictability of performance

WCET: Worst case execution time of a task

Embedded realtime systems have tight space constraints

Needs specific collection scheme, different than previous concurrent gc schemes.

19.2 Scheduling Real-Time Collection

Incremental collectors will impose cost on all mutation

Concurrent collectors as before

To maintain steady-state space usage needs to free and recycle objects at rate of mutation.

Must avoid fragmentation to provide constant space as well

Moving has costs

Alternative methods

Name	Description
Work-based	Imposes cost on mutator actions
Slack-based	Runs work during slack portions of collector time-slice. Can be significant in periodic real-time systems.
Time-based	Schedules a constant time slice of collection while world stopped

19.3 Work-Based Real-Time Collection

Classic baker algorithm

Fixed-size objects assumed

Read barrier
- Reading copies into tospace. Bounded by object size.
- Also marks a bounded number of objects

Has specific bounds:
- Space = 2R(1 + 1/K)
- R = reachable space
- K = Adjustable time bound, defined as number of fields scanned each allocation

Parallel, concurrent replication

Adds concurrent and parallel collector

Because work-based, can suffer from variation in distribution of pauses.
- Difficult to obtain bounded mutation times
  - Solutions exist, 19.5

Algorithm
- On read, copy to tospace. Also copy concurrently.
- All copies to tospace leave a forwarding address in header of previous object. All reads by mutator in future go through this address
- Fromspace and tospace divided between threads
- When collecting, all allocation allocates into both fromspace and tospace, with forward.

Uneven work and its impact on work-based scheduling

If we use worst-case execution time, we have the problem that the worst-case is much worse than the expected case
- Array of things with references pathalogical worst case

We can bound worst case by breaking down objects and arrays into fixed-size units
- Places overhead on access, space

Considered a nail in coffin of work-based scheduling
- Opinion is that overly starves mutator

Problem is that even when non-copying and fairly constant, allocation storms cause massive variance in time of GC
- Response has been to treat mutator work as something that must be scheduled like any other real-time job.
- Leads to better worst-case and expected-case ratios

19.4 Slack-Based Real-Time Collection

Implementation: Hendriksson

Forbids collection during high-priority tasks

GC work delayed until no high-priority tasks executing
- Also called the slack-period

Low-priority tasks do collector work during allocation

During high-priority tasks, record omitted GC work and perform it during slack

Threads for collecting high-priority tasks have lower priority than high-priority work but higher priority than low-priority work

Incremental copying is done by low-priority work in standard concurrent copying collector way
- All from-space accesses copy into tospace. All allocation done in tospace.
- Enforces constraint that no tospace object refers to a fromspace object.

High priority tasks do less mutation cost
- Objects evacuated lazily. allocate in fromspace but don't actually forward through fromspace header
- High-priority collection threads during slack time do copying
- Uses unconditional read indirection barrier and points to fromspace header through forwarding address, avoids mutator overhead

Collector works as coroutine through well-defined yield points.

Sceduling the collector work

Let W_max be the maximum amount of work that must be done starting at the start of slack time to allow the next high-priority interval to not run out of memory

Let F_min be the amount of memory that the high-priority interval will need

Minimum GC ratio = GCR_min = W_max / F_min

Current GC ratio = GCR = Performed GC work / Amount of new objects in fromspace = W / A
- Allocation increases A, collection increases W
- A throttle by yielding to GC until GCR_min >= GCR
- Ability to arbitrarily enter a high-priority period means that space overhead may be required to ensure that all tospace allocation is satisfiable.
  - Programmer must know worst-case allocation needed by high-priority task.
  - Not as bad as it sounds because high-performance, hard real-time tasks typically use little memory

Execution overheads

Since caches alter memory access time, instruction count not accurate measure of time
- Needs to be tested repeatedly to know predictable cache behavior, or assume worst-case time
- Collection in mutation has synchronization overhead.
- Forwarding pointer overhead on reads.
- Allocation is simple bump-a-pointer allocation.

Programmer input

Programmer must inform GC about worst-case allocation requirements and period in which it runs.

19.5 Time-Based Real-Time Collection: Metronome

Mutator utilisation

Supporting predictability

Analysis

Prologue

Treats minimum mutator utilisation as an input

For java

Concurrent incremental mark-sweep, with occasional compaction to avoid fragmentation

Uses deletion write barrier to enforce weak tricolor abstraction

Uses brooks-style forwarding pointers

Mutator utilisation

Ensures minimum mutator utilisation. All other time to discretion of collector, can give to mutator or use.

Is a trade-off between space and mutator utilisation.

Split time into quanta, ensure that for a given window the MMU ratio is maintained

Scheduler will usually schedule collection until MMU about to be violated and then back off

When considered over 'windows' of time between deadline, leads to mini-pauses or jitter where it collects for a while and then backs off.

Supporting Predictability

Fragmentation a threat to predictability. Many considerations

Name	Description
Arraylets	Break arrays and variable-sized objects into fixed sizes to fight fragmentation.
Read barrier	To avoid performance cost of brooks-style read barrier, eagerly change references in roots and on stack to point to tospace when copied.
Scheduling Collector	Use 'alarm thread' to check if it should switch to a collection quantum and if so then tell mutator threads to reach safe-point and yield to it. Colector thread collects.
Suspending mutator threads	At pause will reach safe point, mutator threads store all references on stack/in roots in specific location. On reload will read them back in, to allow faster changes.
Ragged root scanning	Deeply nested stacks a problem, may not allow collector to scan during quanta. Solution is concurrent scans and a write barrier into scanned stacks to prevent overlooking references.

Analysis

Metronome gave formalized model for concurrent, realtime GC

Parameters
- A(t) is instantaneous allocation rate at time t
- G(t) is instantaneous gabage generation rate at time t
- P is GC processing rate as function of live data
- Memory allocated, garbage created is simple integral
- Mutator quantom and collector quantum are time that each can run without yielding
- Only works if parameters have accurate estimate.

Minium mutator utilization = ( mutator quantum * (time elapsed / Mutator quantum + collector quantum) + x ) / (time elapsed)
- x = Size of remaining mutator quantum

Can be used to analyze work-based and other scheduling approaches

Robustness

If parameters are poorly known, only graceful way to degrade is to slow down allocation rate

Can slow down allocation rate is to use generational scheme
- Nursery filters into primary heap
- Removes short-lived data, reduces total floating memory
- Unacceptable in naive approach due to variation in pause time when nursery overfills during mutator slice
- Syncopation a solution
  - Always copy out of nursery during collector slice
  - Use nursery survival rate as a parameter so to size nursery so only one evacuation

19.6 Combining Scheduling Approaches: Tax-and-Spend

Prologue

Metronome best on uniprocessors because of need to stop all mutation

Work-based collectors have order of magnitude worse latency than time-based

Slack-based fragile, will break when no slack; under load

Tax-and-spend combines all of them.

Performant, cut metronome latencies down to a third of previously

Mutators must do an amount of collection work dependent on desired MMU.

Slack time used by collector to give MMU 'credit'.

Taxation can be scheduled during any safe point, but always checks when thread-local allocation buffer exhausted.

Tax-and-spend scheduling

MMU surperior to maximum pause time because it allows for computation of clustering of pauses that would cause a missed deadline.

MMU can vary by thread to prevent missing deadlines with tax-and-spend

Can also use other cores to offload collection work

Per-thread scheduling
- Must track mutator activity to determine tax to be paid
- Must avoid letting tax get too high, such that a deadline would be missed.
- By interleaving collection and mutation in same thread, you prevent the OS scheduler from placing some other mutator or an unrelated process on the core after the mutator yields.
- Allowing MMU to vary between threads important when allocation rates vary

Tax-based vs slack-based scheduling
- Slack-based degrades with no slack
  - Poor for interactive real-time systems with sustained peaks of use
- Time-based scheduling may introduce jitter because may pause until MMU is almost violated, such as at start of new window of time consideration.
  - Ensures that for time periods smaller than a window, mutator responsiveness inconsistent
  - Tax-based more consistently realtime

Combining tax-based and slack-based scheduling
- Tax-and-spend combines all the other scheduling approaches by using an economic model.
- Mutator threads pay back tax by collecting incrementally during allocation
- Dedicated mutator threads with low priority exist to work during slack
  - Deposit work done as a 'credit' in a global pool for mutators to take from
  - One per processor usually
  - Must sleep itermittently to allow mutator threads which appear to schedule over them
- Tax over all threads must be high enough to finish collection before memory runs out
- When tax due tries to withdraw from global credit bank
  - If can, then collector is keeping up with mutator so doesn't need to pay tax.
  - Treats slack on uniprocessor and unexploited parallelism on a multiprocessor the same.

Tax-and-spend prerequisites

Needs an underlying collector that's both incremental and conucrrent. Should be parallel on multiprocessors.

Metronome changes for synchronization without halting mutator threads:

Name	Description
Raggeed epochs for global consensus	Needs global states to hold (such as change of fromspace/tospace pointers in read barrier) for all threads for phases of collection to change. Uses an epoch number. Threads that change state during yielding to collection increment their epoch. Once the lowest epoch among any threads is value of phase change started, can assume all mutators have hit that phase.
Phase agreement with 'last one out'	Global phase. When mutator thread starts to pay tax, increments worker count. If no work being done, and if worker count = 0, then it atomically changes phase and decrements worker count.
Per-thread callbacks	All threads are requested to make change(flush thread-local buffer, copy out roots from mutator thread stack) by global thread whenever collection being done and hits step that demands it. They're given a callback to perform to do that step. If blocking on IO or executing native code, prevented from returning and action performed on their behalf.
Priority boosting to ensure progress	If higher-priority work threads saturate processor, above steps can be starved. Need to boost threads requesting actions from other threads.

19.7 Controlling Fragmentation

Prologue

Fragmentation can cause violation of real-time space bound.

Can compact to avoid fragmentation

Can also use arralyets and object-lets to have fixed size allocation blocks, at cost of some internal fragmentation and higher memory usage

Concurrent replicating copying collectors have problem of keeping both copies of object consistent. Locking doesn't scale, and can cause mutator thread starvation.

Compressor and Pauseless algorithms both rely on traps and page-level synchronization and so suffer from poor MMU.
- Work-based, and causes trap storm on phase. (Interferes with all threads)

Lock-freedom very desirable in a realtime system. Ways to obtain it follow:

Incremental Compaction in Metronome

Metronome non-copying, compacts while threads are stopped

Tax-and-spend version doesn't perform any compaction

Bacon(2003) uses segregated-fits and incrementally defrags size classes in slow/infrequent path of allocator
1. Sort pages by number of unused objects per page from dense to sparse
2. Set the allocation page to the densest non-full page in resulting list.
3. Set the page to evacuate to the sparsest page in same size class
4. While target number of pages to evacuate in size class not met, and page to evacuate does not equal page in which to allocate, evacuate from sparsest to next allocation page, moving to next page when allocation page filled.

Moves minimal number of objects and frees maximal number of pages

Always choosing densest can lead to hopping pages and poor locality, can require minimum number of cells in allocatin page

Incremental Compaction on Uniprocessors

To obtain atomicity on a uniprocessor can simply disable scheduler interrupts or only allow thread-switching during a gc safe-point.

If brooks forwarding pointers used, can optimize access.
- In replication, forwarding pointer of each copy points to other copy. If only one, pointer points to self.
- Mutator uses whichever address it has, no synchronization needed.
- On write, write to writes to both the origional and copy. (Kalibera 2009, Java)
- Double write is cheap, and makes reads much faster.
- Many more reads than writes.
- Doesn't harm cache through needing both lines in memory for each read.

Incremental Compaction on Multiprocessors

Stopless: lock-free garbage collection
- Ensures lock-freedom
- Doesn't double write
- On start of collection, threads race to allocate a "wide" version of the object and set redirection pointer to that object.
  - Wide copy has state for each field that is either 'in origional', 'in wide copy', or 'in copy'. All set to 'in origional' at first.
  - On write, write to wide copy and set field to 'in wide copy'.
    - Also done by copying collector when it copies object with reference in that field
  - On read, read from origional or copy based on state.
  - Once all fields 'inWide', copying collector makes 'narrow' version in tospace and puts address as forward reference in tospace
  - Copies fields to copy, and set's wide's field status to 'in copy'
- Downsides:
  - Doesn't work on double-word fields if double-word writes not atomic on architecture
  - Has space overhead of 3 copies of object for a period
    - Can use space filled by copies of objects emptied from pages during incremental compaction to hold copy.

Staccato: best-effort compaction with mutator-wait freedom
- Doesn't require atomic instructions, but can use memory fence at GC safe points to ensure up-to-date references
- Collector reserves bit in forwarding address to state that object is currently being copied.
- Algorithm:
  1. Set copying bit using compare-and-set. Mutators access forwarding pointer without atomic operations, so changes take time to propagate. If a mutator accesses the object, the copying bit is unset atomically with compare-and-set. If this fails, copying finished and should go through reference.
  2. Wait for ragged synchronization at a read memory fence at a safe point.
  3. Perform a read fence of object to ensure that changes made by mutators between setting bit and ragged synchronization are seen by collection thread.
  4. Allocate copy, and copy fields
  5. Set read fence so mutators see field changes.
  6. Once ragged synchronization happens, make copy refer to itself through forwarding address, and unset copy bit using compare-and-swap. If fails, old object must have been moved by another mutator. Abort.
- Can amortize memory fence overhead since many objects usually evacuated from/to same page.
- If collector tries to incrementally copy objects with high mutator locality, the mutator threads will cause an "abort storm".
  - Unlikely because only sparsely populated pages moved so unlikely to move objects all allocated together recently.

Chicken: best-effort compaction with mutator-wait freedom for x86
- Uses x86 memory model to know strong read ordering. Only needs to abort on write

Clover: Guaranteed compaction with probabilistic mutator lock-freedom
- Each move selects a random value alpha and sets invariant that mutator won't write it into heap
  - If it tries to, block mutator.
  - Best to try an illegal value for the program, such as a pointer out of memory range or a NaN variant not used by hosted language.
- Once a field is copied, it is replaced with alpha via compare-and-swap.
- If mutator accesses an alpha, it reads//writes field to second copy.

Stopless vs Chicken vs Clover
- Stopless can't guarantee progress to collector because of potentially infinite writes to wide copy's fields.
- Chicken guarantees progress, but aborts copies
- Clover may interleave with repeated changes to a field. It must keep copying, and is unable to compare-and-swap afterwards because of field changes.

Fragmented Allocation

All compaction techniques have overhead, so process must trade throughput for fragmentation

Solution may be to only allocate in fixed-size granules with no constraints on locality.

Arrays split into a trie of array
- Problem is that access time is variable, not constant now
- Java allows arrays with no maximum length, so can't prove deadline on memory reads.
- Solution: Schism by pizlo
  - Uses spine of arraylets, stole from metronome.
  - Access is O(1) because field locations computed from values in a 'sentinel fragment' in header of start of object or array. Becomes one layer of indirection. Is standard reference
  - Problem of dynamic arrays
  - Memory needed bounded by 1.3104 * size of live set.
  - When space for sequential allocation of all arraylets, can mark in sentinel fragment and use conventional access techniques.

19.8 Issues to Consider

Real-time systems almost always require statistics about the expected application to be collected by the developer.

Difficulty lies in ensuring that you're seeing the worst-case truly possible by executing slow paths.

Date: 2014-09-29T01:18-0400

Author: akyte

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0

9-12

Chapter 9: Generational Garbage Collection

Prologue

Tracing collectors have best case when heap is mostly garbage, but long-lived objects tend to exhibit pathologically bad behavior for copying and mark-sweep

Ex: Mark-sweep has dense prefix near bottom of heap, copying moves data unnecessarily

Solution: Segment the heap into generations and collect the youngest generation(nursery) most frequently. Tenure into older generations after surviving a number of sweeps.

Most of the time use copying

Has good mark/cons ratio, so should be faster for young generation

Can tune pause times / frequency of pauses by altering nursery size

Only improves average-case time. Worst case still needs to sweep whole heap.

Downsides:

Garbage sticks around in older generation longer

Cross-generational pointers need bookkeeping, will likely impose mutator overhead

Hard to tune perfectly

9.1 Example

Minor collection = Collection of only the nursery

Need to decide how/when to promote objects

Collector needs a way to measure the age of objects to decide if they've survived enough minor collections to be tenured

If a collector only sweeps one generation at a time, must record inter-generational pointers
- Can create inter-generational pointers via mutator use in program or via promoting an object between generation. Must track both of them.
  - Requires a mutator write barrier and a copy barrier. Both must check for creation of inter-generational pointer.
- Inter-generational pointers can add false roots to collections. Until older generation collected, garbage in old generation can keep young objects alive. Can even lead to promotion. Called "nepotism".

9.2 Measuring Time

Two choices: Bytes allocated or seconds elapsed

Seconds Elapsed

Considers the distribution of pause time

Good for ensuring requirements such as maximum gc pause time

Bytes Allocated

For actual reasoning, bytes allocated between life and death give more information about object

Largely machine/environment dependent

Tricky in multithreaded systems

Reality

Most of the time, the number of collections survived is used as a proxy for bytes allocated

9.3 Generational Hypotheses

Weak Generational Hypothesis: Most objects die very young

Appears valid across many languages/domains

Frequent for most data to only survive one cycle, and remaining minority of objects to be immortal.

Strong Generational Hypothesis: Younger objects are more likely to die than older ones, even if they are not allocated in the last generation

Not decidedly correct

Evidence to contrary:
- Tend to see many objects die in clumps because programs operate in clumps
- Immortal objects can be created at arbitrary locations in program flow

9.4 Generations and the Heap Layout

Can treat generation management in many different ways

Ex:
- Fixed size of nursery vs ratio of heap
- Homogenous generation or generation segmented by ages
- Shared large-object subspace or per-generational one

Must decide sizes of generations

Benefits of small collection space:
- Fast collection times because small generation

Downsides to small generations:
- More frequent collections
- Less time for objects to die, will have to copy objects between semispaces(or even to older generation) and waste time.
- Older generation will fill too fast because of a higher promotion rate
  - Larger older generation negates many benefits of generational collection.
  - Larger older generation also causes nepotism because objects promoted too soon likely to both become garbage and refer to young objects
  - Younger objects are modified more frequently. Tenured objects require a mutator overhead on writes for inter-generational references. By promoting too early, we will spend a lot more time in this barrier.
  - Will "dilute working set" by messing up assumptions about behaviors of spaces.

9.5 Multiple Generations

Adding more generations solution to need to have both short pause times for nursery and infrequency of major collections.

Problem: If the collector promoted all live objects during minor collection, then it promoted objects made right before collection

Objects made right before collection haven't had time to become garbage, which they likely will

Solution: Multiple Generations. Can sweep the intermediate generation with frequency between nursery and oldest generation. Will free these freshly made objects much sooner than if they were tenured.
- Helps prevent nepotism
- Downsides:
  - Most systems collect younger generations when older collected, because only require write barrier for old-to-young which are less frequent
  - Ensures that pause time will be between nursery and major collection. Longer than nursery, but more frequent than oldest.
  - Increasing number of generations increases number of inter-generational pointers which increases the time spent in the mutator's barrier
  - Increases the size of root set, and may have pointers from garbage into nursery's root set keeping large garbage structure alive.

Most collectors only have 2 generations
- Clever control of promotion rates used instead of multiple generations

9.6 Age Recording

En Masse Promotion

Easiest strategy is to have each generation but oldest have a single semispace and copy live objects "en masse" to the next generation on collection.
- Can lead to 50% to 100% higher promotion rate than if we require that an object survive at least once collection before promotion from nursery
  - Increasing survival requirements for copying above 2 has diminishing returns.

Aging Semispaces

Create a generation partitioned into semispaces and copy objects back and forth between semispaces a number of times before promoting semispace as a whole.

Either all are copied to another internal semispace or all are copied to next generation, depending on generation's age as a whole.
- Allows oldest objects in generation time to die, but youngest in generation may still be prematurely promoted. Can avoid with bookeeping.

Bucket Brigade
- Idea is to create n buckets per generation and to copy from younger bucket to progressively older buckets. The oldest bucket is copied into the next generation.
- Important: All buckets in a generation collected at same time, inter-bucket pointers not an issue.
  - Use of steps in youngest generation can solve problem of mutator overhead with intermediate generations

Survivor Spaces and Flexibility

En Masse / Aging Semispaces both wasteful because need to reserve twice as much space as data for copying.

Solution is to use an eden + 2 survivor spaces structure
- Youngest generation is one large creation space(eden).
- On collection, live data in eden evacuated to younger survivor tospace.
- All objects in younger survivor fromspace evacuated to either older survivor tospace or to eden depending on age
- Works well because eden is very large(makes sense, weak generational hypothesis). Ex: HotSpot has ratio of 32:1 between eden and semispaces. Would need twice this space for copying reserve.

Can combine with aging semispaces by creating buckets in the survivor spaces. Can use to keep age bookkeeping easy.

9.7 Adapting to Program Behavior

Prologue

If program does not behave according to weak generational hypothesis, a generational garbage collector's performance will suffer

Might be managed with a way to control tenuring.

Desirable to alter behavior of generations(ex: nursery space) to match behavior of mutator to avoid premature tenuring and to avoid holding on to excessive garbage.
- Consider the behavior of allocating frequently and then freeing in clumps.

Appel-style Garbage Collection

Three spaces: reserve, nursery, and old generation. On collection, all data moved to old collection and rest of space used for nursery and reserve. If space falls below threshold, full collection.

Reserve size is dynamically set this way

Promotes all inter-generational garbage into older generation, which is then freed at next major collection.

Can also work in block-structured heap or by using shrinking nurseries with mark-sweep/other noncopying algorithm of the older generation.

Pathalogical worst case: Quick allocation with very low promotion rate will lead to a shrinking nursery and heap thrashing with minor collections. A major collection never occurs though, because of low promotion rate.
- Solution: Major collection whenever young heap falls below certain size.

Feedback Controlled Promotion

Mostly related to pause time reduction

Demographic feedback-mediated tenuring
- Fights pause time caused by collective death of many promoted objects
- Volume of objects promoted used as predictor for length of next collection. Sets rate of promotion.
  - If survivors exceed a value, looks up survivor count in table which says age threshold for promotion of next generation.
- Can't move old objects into nursery.
  - Threatening-boundary solutions to this exist.
    - Move threshold between two generations, can move in either direction.
    - Must track more pointers because can't know at allocation time which generation things may lie in because of boundary motion.

Ergonomics from HotSpot
- Focuses on three soft goals. In order of priority:
  1. Maximum pause time
  2. Throughput
  3. Memory footprint
- Pause time addressed by shrinking generations, one with longest observed collection time first
- Throughput addressed by increasing generations in proportion to collection time
- Venegerov has ThruMax algorithm for reasoning about HotSpot information.
  - Focuses on ratio of free space in old collection after major collection to amount of space used during promotion
  - ThruMax invoked after steady-state of survivor spaces reached.

Ergonomics has many tunable options for end-user, most of which are interdependent.

9.8 Inter-Generational Pointers

Prologue

Roots for each generation contains pointers in registers, stacks, globals AND all references into the generation from other heaps not being collected.
- Includes older generations and spaces outside of the generational heap(large object, immortal, code)

Can be created three ways:
1. Object creation
2. Mutator updates to pointer slots
3. Promotion

Call these pointes "interesting pointers" because they must be kept track of.

Problem of "boot-image" objects, ones alive when system starts. Must keep track of references in objects, but costly to scan. Can use tracing, scanning, or remembered sets.

Remembered Sets

Holds the location that a potentially interesting pointer could be stored.

Stores the source, not the destination.
- Allows coping collector to update reference
- Goal is to track as few pointers as possible
- Needs write barrier to update reference by mutator and not collector

Pointer Direction

All stores do not need to be recorded
- Ex: Some languages store activation record on the heap. Can be parsed to determine current pointers in objects.

If young generation always collected whenever the older is, don't need to store pointers from young to old.

Most mutation is initialization, initialization must refer to older objects.
- Ex: Purely functional languages will only refer to older objects, except for thunk replacement.
- Ex: ML requires mutable variables be annotated. They're the only source of old-to-young pointers.
- Exception is OO languages like Java.
  - Many of these languages are written in an initialization-heavy way.
  - Programs vary heavily. Many of them will reassign a field very frequently. Can change policies for remembered set implementation.

In systems with multiple, independently collected heaps, may want to store mutation info to guess as spaces with most garbage. In this case, need write barrier in both directions.

9.9 Space Management

Minor GC is short, fast. Major GC slow because also sweeps younger generations.

Major GC may use copying for older generation, but bad choice because the space requirements for copying will make nursery allocation thrash smaller heap more often. Also moves long-lived data.

Choice of mark-sweep can be good for older generation

Allocation is slower for free-list, but cost deferred to collector since mutator allocates out of nursery.

Fragmentation can be controlled by infrequent mark-compact run
- Can ignore "dense prefix" of very long-lived data.

Copying collection usually used for nursery

Cost of tospace overhead not bad in average-case because few objects survive first collection

Space usage can be improved by switching to compacting collector "on the fly" during a collection

9.10 Older-first Garbage Collection

Other possibilities for age-based collection exist

Name	Description
Youngest-only(Generational)	Only collect youngest objects in heap
Oldest-only	Only collect oldest objects, on the premise that they are likely to die. Probably ineffective due to weak generational hypothesis
Older-first	Focuses on middle-aged objects. Gives them time to die without wasting time on "sediment"
Renewal Older-first	Consider age of the object to be time since last collected or time created. Mixes up object order in heap. Mutator overhead.
Deferred Older-first	Consider a sliding window of the heap. Move survivors to one side of the window, and march down toward the younger side. Start over at oldest side. Mutator overhead.

9.11 Beltway

Beltway collector exploits five generational insights:

Most objects die young
Avoid repeatedly collecting old objects
Exploit incrementality(bound size of spaces, use small nurseries, etc)
Small nurseries with sequential allocation are good for spacial locality
Objects need sufficient time to die(avoid pretenuring)

Can be configured to behave like any other copying collector

Beltway breaks collection into units called "increments"

Increments are ordered into queues called belts

Increments selected FIFO for collection

Usually selects oldest increment on youngest belt

Specifies if it should move survivors into increment on same belt or promoted to higher belt

Not inherently generational. Collects increment by increment, doesn't treat belt like generation and collect all of each belt.

Ex:

Simple semispace: One belt, two increments

Generational: One belt per generation
- Nursery increment on youngest belt may be of fixed size or allowed to grow(appel-style)
- Mutator must track references from older to younger increments and from older to younger belts

Because so configurable, attributes depenend heavily on configuration

9.12 Analytic Support For Generational Collection

Generational Collectors handle young objects well but old objects poorly.

Problems:

Collection of garbage in older generations not prompt

Long-lived objects must be copied between generations/spaces until they reach oldest generations

Solutions:

Can improve with collection of metrics to determine what to pretenure.
- Machine Learning possibly used here

Can analyze source to determine which objects likely to be created/changed/die together and keep objects in same space
- Good improvements seen. Doesn't need to be perfect.

Functional languages can know that immutable objects can only point to older objects.

9.13 Issues to Consider

If weak generational hypothesis does not hold for a program, generational GC will perform poorly.

Only improves expected pause times, not worst-case pause times because must do major collections eventually

Offers many tunable parameters which may be confusing/interrelated.

Number of generations. Most often two and an immortal generation. Intermediate generations used to improve premature promotion, but better ways to do that.

Promotion rate depends on nursery size. May set fixed size or vary at runtime to meet throughput goals.

Promotion can be controlled by setting a number of survived cycles before collection.
- En masse = 1 cycle.
- Alternatives are aging semispaces and eden/survivor spaces

Mutator cost dependent on organization

Higher promotion rates can place a higher burden on the mutator

If we demand a collection of younger generations when older generations are collected, only need to track young-to-old pointers.

9.14 Abstract Generational Garbage Collection

Same way that we see tracing as deferred reference counting, generational collection is a form of deferred reference counting which keeps track of remembered set and typically tries to sweep nursery before defaulting to full collection

Chapter 10: Other Partition Schemes

10.1 Large Object Spaces

Can use actual size or size relative to allocation blocks

Large objects expensive to allocate and likely to cause fragmentation( CH. 8 criteria )

Usually don't move because of associated cost, but eventually might need to

Common techniques include

Mark-sweep, and freelists

Allocating directly into old generation

Making a header that is treated as a regular object and points to non-moving data body, which is freed when header is.

Treadmill Collector

Makes circular, doubly linked list of objects with black, white, grey, and freed segments. Uses tricolour abstraction.

Allocates until free-buffer region of doubly linked list is empty

Frees as copying collector, but copies by changing pointers.

Allocation and copying are cheap, constant-time.

Smart pointer policies(which end of the grey area of list determines breadth or depth-first traversal.

Overhead of pointers, but removal of need for seperate tospace.

Problem of allocation of different object sizes
- For large objects, not a problem since actual data kept on seperate page.
- May store pointers on page, or seperately. Good argument for seperately for cache reasons.

Moving objects with OS support

Can simply change virtual address that page is mapped to

Also able to lazily initialize data pages, not load until referenced.

10.2 Topological Collectors

Mature Object Space GC / Train Collector

Collection of oldest generation has unbounded pause time(function of live data in heap)

Solution is to manage this oldest generation("MOS" or mature object space) as a set of areas of memory and collect/evacuate one region each collection

Each area is a "car", cars are stacked in a "train" queue.
- Important that this moves cyclical garbage into the same car, so can free all at once.
- Algorithm:
  1. Select lowest-numbered car from lowest-numbered train. Call it fromCar.
  2. If no remember set to train, and no roots to train, collect entire train.
  3. Otherwise if referenced by root, copy to higher-numbered car toCar in higher-numbered train, making one if needed.
  4. Copy everything in C that can be referenced by toCar to toCar, creating cars for overfill on the same train.
  5. Move promoted objects to train holding references to it.
  6. Move every object in the remembered set reachable from another train to that train.
  7. Copy any objects reachable from other objects in current train to the end of the current train.
- Benefits:
  - Incremental
  - Bounds amount of copying to size of car
  - Moves objects close to other objects which refer to them
  - Only need to track higher-to-lower train pointers
- Downsides:
  - Garbage cycles which span multiple cars may end up in situations where they are never collected.
  - Does not bound remembered sets, which may be huge for popular objects.

Connectivity-Based Collection

Lifetimes of objects correlated with connections.

Ex:
- Stack-only references are freed quickly
- Globals tend to be immortal or near-immortal
- Objects connected by chains of pointers die together

Idea: Create partitions based on connectivity. Chain partitions in DAG. Everything in partition can only be referred to by other things in partitions or by partition's predecessor in DAG.

Does not require write-barriers or remembered sets

Did not do as well as could be helped in practice. Evidence suggests better information about connectivity could improve performance.

Thread-local Collection

If objects local to thread, can perform stop-the-world on single thread to collect this "heaplet" of data.
- Must still synchronize in case of shared, mutable data.

Usually partitions into heaplets and single, shared space.

Objects in heaplet may reference other objects in heaplet or shared heap. Shared heap may never reference objects in heaplet. Thread may never reference other thread's data.

Need to perform accurate escape analysis to determine where allocation should be done, and if reference was legal.
- Can use mutator barriers to accomplish this

In case of immutable local data and shared mutable data
- Can simply place forwarding address in shared object's header. Future references will use the new object through the forwarding address, and existing uses will see the old value.
- Can use for generational as well

Stack Allocation

Allocating on the stack where possible has opportunities.

Requires that nothing with longer lifespan reference data on stack, escape analysis
- Read/write barriers would need to move to heap in this case.
  - Must be rare to be worth it because costly
  - Used by Azul System's hardware jvm appliances using hardware trap.

Stack allocation excels in same environment where generational collection excels.

Region Inferencing

Objects are allocated into regions, which are freed as soon as nothing in the region is needed by the program.

Ex: Real Time Specification for Java

Requires either programmer annotation or a significantly intelligent compiler

Compiler-directed escape analysis and region inferencing is difficult and slow. Best example = ML Kit Algorithm.

10.3 Hybrid Mark-Sweep, Copying Collectors

Prologue

Collectors can be described by choice of how much data to consider "enough" to evacuate a block, and the expected allocation portion of a block

Uncompacted heaps can eventually run into performance problems from fragmentation

Solution: Combine mark-sweep with copying
- Divide heap into k+1 windows, keep one empty
- During collection, choose one window to be a fromspace, and use free window as tospace. Use mark-sweep for rest of heap.
- Will take k collections to compact the heap, with a space overhead of 1/k.

Garbage First

Goal is that no more than x milliseconds every y time slice are used for GC

Used in hotspot

Divides heap into windows of same size. Allocates from current allocation window.

Allows for arbitrary window number
- Imposes higher write barrier, many more interesting pointers.

Can be combined with generational.

Immix and Others

Dimpsey / IBM Parallel Mark/Sweep GC
- Used thread-local allocation buffers
- Found that dense prefix was slowing down allocation time, made seperate freelist for allocation of buffer-sized memory.
  - Great performance improvements. Didn't need to scan dense prefix for large enough block. Also shortened freelists.

Immix Collector
- Uses liveness estimate from previous collection cycle to determine whether to use copying collection or marking on a block
- Allocates into either free or partially-filled blocks using a linear, next-fit strategy.
- Allows allocation into smaller "lines" in blocks
  - A line is 128 bytes, roughly the size of a cache line.
- When marking, can "implicitly" mark lines because expects empty lines after last line in sequence of lines to free.
- Distinguishes "medium-sized" objects from small ones, and bump-ballocates into an empty block.

Copying Collection in Constrained Memory

Idea is to split oldest generation into many block-sized windows and to copying-collect oldest generation one block at a time, during the same collection period.

Uses mark phase to create remembered sets

Lower space usage = fewer collections

Downsides:
- Each full collection touches every object twice: to mark and to copy
- Still has overhead: requires space for bookkeeping data structures for marking/collection

MC-squared collector

Relaxes need for blocks to be layed out sequentially by numbering them.

Allows for oldest generation to avoid copying full blocks by simply modifying numbers.

Marks incrementally.

Segregates popular objects into special block without remembered sets. (Reversibly marks as immortal)

10.4 Bookmarking Garbage Collection

Page faults have incredibly high performance overhead. Bookmarking collector prevents paging during collection and mitigates overhead for mutator.

Works with OS to choose page to evict.

Default choice likely to be least-used page, which for a copying collector is likely to be in the next tospace. Worst possible choice, since we will allocate from it soon. Fromspace page better choice

If the runtime must choose a page to evict, we "bookmark" live references in it before evicting it. We use this bookmark in collection, and then erase it when we next load the page.

Sets empty pages to be first to free, otherwise chooses victim and marks outgoing references.

Bookmarked objects never moved to avoid having to update pointers on non-resident page.

10.5 Ulterior Reference Counting

Is segregation of objects based on mutation rate.

Tracing great for mutation-high, allocation-high, short-lived data. Reference counting great for data with low mutation rate. Both algorithms are at worst-case for other input. Makes sense to use right algorithm for right data.

Uses a bounded-sized nursery with a copying collector. Survivors are promoted to the reference counted area. Pointers from the RC area to the tracing area are tracked.

Collects cycles when heap size falls below limit.

10.6 Things to Consider

Specialized behavior good for large objects, and objects without pointers.

Objects part of large pointer cycles should all be stored together to allow freeing it by freeing a single page.

Tends to allow for very-specialized collection schemes for objects known to have special traits

Chapter 11: Run-Time Interface

11.1 Interface To Allocation

Prologue

Languages require that the object be both allocated and initialized(to a varying degree)

Three phases to allocation/initalization

Step	Job	Description	Overseer
1	Allocate cell of correct size and alignment		Allocation subsystem of memory manager
2	System Initialization	Setting fields required for object to be legal(header bits, method dispatch)
3	Secondary Initialization	Setting fields after the object has been returned from allocation subsystem.

Languages which require complete initialization in steps 1 + 2 and forbid step 3 (haskell, ml), must pass an "initialization closure" to allocator to retrieve values from. Stems from problem of needing to specialize allocation to each type.
- Alternative(haskell) is to inline all initialization and call collector on exhaustion. Possible because allocation simple "bump a pointer"

Complete initialization allows for optimizations.
- A generational write barrier isn't necessary because all references must be to older objects.

Paramount that steps 1 + 2 be atomic with respect to collector/system. Otherwise may try to interact with an object lacking headers.

Possible variations on divisions of labor between steps.

Parameters to allocation(Step 1) may include
1. The size requested
2. An alignment constraint
3. Kind of object to allocate(for header fields, size, space to allocate into)
4. Type of object(Object type, including fields to initialize)

Post conditions for step 1: May or may not zero, object header may be filled or not, full or partial initialization

Must be atomic if inlined.

Speeding Allocation

Solution may be to inline the "common case" of success and call out to less common cases

Sequential allocation fast, especially if inlined
- If pointer to bump in register, assumes per-thread allocation areas

May make segregated fits as fast if we know the list to allocate from at compile time
- Can be as simple as "bump a pointer with checks" in sequential allocation.

Zeroing

Some languages require zeroing, others do not
- May be used for debugging, but not the standard

Zeroing in bulk, using optimized native libraries, are the prevailing wisdom

Zeroing during collection has poor cache performance. Need to zero somewhat ahead of allocator, but soon enough that page stays in cache

11.2 Finding Pointers

Collector must be certain that garbage has no reference to free, but may choose to not free garbage with no penalty rather than wasted memory in footprint

Conservative Pointer Finding

Treat every segment of bytes that could be a pointer value as a pointer value

Must be fast
- Filters out values that do not refer to any heap areas in memory
- See if the reference refers to data which has been allocated.
  - Language requirements on pointers referring to start of object or at known offset can simplify this
  - Difficult for languages which allow flexible interior pointer locations(C)
- May need to choose to not allocate blocks referred to by observed literals which point into the heap range.

Accurate Pointer Finding useing tagges values

Many languages include a type tag with each pointer

Two Approaches:
- Bit Stealing
  - Align to make lower bits zero, and then use them to mark a pointer as a pointer
  - Reduces number of bits that can be used by numeric values, may require de-mangling for external numeric libraries to work.
    - Some processors have instructions for working with tagged integers
- Big Bag of Pages
  - Associates each block with a type
  - Better numeric convenience
  - Allows for memoization of value allocation, if the value is already in the block for the given type, simply use that instead
    - Called hash-consing
    - Should use weak references for bookkeeping structures internal to table/blocks

Accurate Pointer Finding in Objects

Objects typically associate pointer offsets with the type-level information such as method dispatch
- Can alter offsets to improve cache performance with modern set-associative caches

May be easier to partitions all pointers from all non-pointers
- If polymorphism is retained into runtime, can become very problematic
- Better idea may be to place pointers before reference offset into object, and values after it.

Some systems require that pointer-chasing methods actually be written by programmer

Possible to avoid objects headers entirely in statically-typed languages, but this might cause more work for collector when parsing the heap.

Accurate Pointer Finding in Stacks and Registers

Some runtimes heap-allocate call stacks to improve parsability

Others make the stack look like heap objects(GHC)

Three issues:
- Finding frames / activation records in stack
  - Also needs to be done by exception handler, continuations
  - Many runtimes have a reference field in each frame which points to the start of the previous frame.
  - May include maps from return addresses to functions which called that address
  - Stack maps
    - Metadata added to each frame on the stack.
    - Collection during initialization of stack map is dangerous
    - Single map per frame not always possible
      - Stack Maps can't know type of polymorphic reference. A pointer in a list of one type is treated the same as a pointer in a list of another.
      - Needs to check types in call predecessor to find.
    - Compiler will need to cooperate, propagate type information through the backend.
- Finding pointers in each frame
- Dealing with calling / stack layout conventions

Finding pointers in registers
- Difficult because we can't impose invariants on registers as easily as on stack fields. Fewer registers demands more flexibility.
- Compiler optimizations may lead to register holding obfsucated pointer.
- If called function must preserve register contents, cannot know if register contains pointer. Must get information from calling function.
  - If stack doesn't have fields that refer to previous frame, necesitates another way to unwind the stack.
  - Solution:
    - Store metatdata in stack table that says which fields the registers are saved to
    - Move from top of stack downwards, 'unsaving' registers and seeing where they were filled in calling frame. After traversal, we will have pointers that the collector can update
    - After collector runs, walk back up stack, and resave modified pointers to stack fields
    - Make sure that we don't process roots from stack twice if they are also in the heap

Compressing Stack Maps
- Stack maps can occupy a lot of space relatively
- Compressing them makes sense
  - Multiple safe-gc points have same map
  - If pointers placed similarly together, even more have same map

Accurate Pointer Finding in Code

Pointers in code(including dynamically generater) may refer to data in the heap

General solution is to have compiler create a side table that marks embedded pointers.

Target Objects that Move
- If a target of an embedded pointer moves, must alter the pointer in the code
  - May not be able to alter code segment because of OS permissions
    - Costly to change permissions temporarily
      - Caches don't automatically update when code modified
      - Must specifically force invalidation
        
        Expensive
    - May be easier to disallow embedded references
    - If code is moved, then you must alter the return addresess and function pointers on the stack.
- Handling Interior Pointers
  - Interior pointers refer to object but aren't the 'standard reference'
  - Can be tricky. Can be outside of object itself. Ex: C allows pointers one location beyond end of the array to be considered valid.
  - Problem lies in finding object that pointer belongs to
  - Solutions:
    - Use a table or bitmask to mark the start of each object per block
    - May simply store references to power-of-two blocks and parse area containing pointer
    - Big Bag of Pages can use constant object size to simply find start using rounding
  - While they may be rare, storing structures to handle interior pointers can add space/performance overhead
- Handling Derived Pointers
  - May be treated similarly to interior pointers
  - We know that if the address is p + i, must belong to object at p if i < size(typeof(p))
  - If i is large, may point outside of p.
  - May refer to something like p₁ - p₂, the distance between the two pointers
  - If i is constant, can make tidy pointer, or standard reference address, at compile time.
  - If the tidy pointer made by addition can be optimized away, we must explicitly require that the compiler keep around a computed tidy pointer for each derived pointer.
  - For moving collector, must compute distance moved and add to derived pointer expression
  - May prohibit certain code optimizations

11.3 Object Tables

Rather than treating references as pointers to the objects, can treat them at pointers to values in a table or as offsets into that table.

Allows for straightforward relocation of data. Only one pointer needs to be updated, rather than multiple.

Should be weak references from table

May be beneficial to store attributes and bookkeeping in table rather than in object headers.

Can compact the table itself.

Can be problematic with interior or derived pointers

May be required by some language semantics

Ex: Smalltalk's object swapping. Would otherwise require heap parsing.

11.4 References From External Code

Many managed languages allow interop with other languages, those other languages may modify the heap of the managed language

System IO, file handles also an issue

Must consider external reference when tracing

Only during external function call

Must ensure external code can find internal object

Moving collectors a problem

May provide layer of indirection, so that moving the object in internal heap is safe
- Called handles. Are pointers into the "registered objects" table.
- May not work if external code doesn't follow access protocol. Ex: system calls and IO

May simply allocate the objects which will be "pinned" into non-moving space
- Difficult to determine. May offer programmer a pin/unpin function.

Options for collectors

Defer collection of pinned object's region while pinned

Immediately collect and move to non-moving region

Collector can tolerate non-moved objects. Can be inefficient/problematic.
- Will introduce fragmentation likely

Other language features may require pinning, such as keeping interior pointers on the stack. It may be easier to pin than handle rewriting.

11.5 Stack Barriers

If we need to scan a stack while a thread is running, we can place a stack barrier and not allow thread to return past that level of the stack

Allows for incremental scanning

Hijacks return address of frame that would return past it

Scanning thread can move barriers as it scans.

Can also be used to mark stack frames that have not changed since last sweep

Can be used to limit pause time

11.6 GC-safe points and mutator suspension

Two questions for a given address of code being executed: is it safe to pause and what is the size of the stack map table.

Certain sequences must be atomic/not interrupted for GC, such as write/read barriers and object initialization.

Need to specify "safe points."

Threads may be frozen at unsafe points

Must either interpret remaining instructions, or wake up until it hits a safe point

Needs maps + safe points anywhere that allocation can occur, or when a function call may happen which will do allocation

In order to avoid unbounded thread wait times, may want other safe points.

In each loop

At each function entry and return

Must make these "GC checkpoints" since they won't ever allocate and actually end up in GC

Need way to activate these checkpoints when thread restarted

Polling: At safepoint check if moving for collection

Patching: Insert a collection yield call at safepoint. Harder to get right, but faster.

Thread notification may be synchronous or async

If async, need to wait for thread to hit safe point and respond so still sync.

11.7 Garbage Collecting Code

Many languages dynamically create and compile code

Need way to free code that won't be executed

Need runtime support to mark code that won't be executed, difficult to trace because frequently refered to by symbol table or global variables

Special Cases:

Closures = function that keeps environment of bindings
- Can determine which variables are used by code and which can be pruned from closure, freeing other closures most likely

Classful OO
- Normally objects point to classes, code. Pointer ignored by tracer because in non-collected, non-moving area.
- Need to scan pointers, costly
- Might want to only run on special cue.
- Also must ensure that either class unloading/loading is idempotent(immutable class variables for example) or that nothing references class in future.

Code which has been optimized, and previous version still needed by executing code
- May need RC, tracing
- On-stack replacement allows old code to be freed.

11.8 Read and Write Barriers

Prologue

Without read/write barriers, are many situations where mutator can hide pointers from collector.

Must consider any references from outside subset of heap being collected as roots, since can't know if they're garbage.

Can break into the detection and recording of interesting pointers.

Engineering

Barriers usually apply checks before allowing an action to progress

Inlining very desirable
- Can't inline code sequences which are long
  - Solution is to inline the fast path and call out to the slow path on success.
  - Should ensure that slow path orders checks to match as soon as possible.

Must be able to access data structures quickly as well

Requires cooperation of compiler
- May need to not optimize the allocator sequences for atomicity requirements.

Precision of Write Barriers

Whether to include metadata and to record more types of interesting pointers

Choice of trade off of more work for the mutator or collector
- Collector will have to generate the data that the mutator doesn't save

May be required to provide some information for some collection schemes.

Decisions:
- How accurately to record pointer writes?
  - May be cheaper to log everything
  - Depends on remembered set implementation. May be costly to store unnecessary information.
  - May have to filter for concurrent/incremental collectors.
  - May want to inline some filters and call out to others
- At what granularity to record pointer location?
  - Most precise is address of field in object.
    - Will make remembered set larger
  - May simply point to object, let collector scavenge
  - May use cards(fixed-size regions of memory) or pages.
    - Simply mark card/page as dirty and then let collector traverse
- Should the remebered set be allowed to contain duplicates?
  - Card/page and sometimes object - granularity will avoid duplicates.

Collector's cost:

Cards/pages depends on either number of cards/pages dirty

Pointer without duplicates Number of objects modified

Pointer with duplicates Number of pointer writes

Hash Tables

For pointer / object deduplication, remembered set must be a set.

Adding items must be fast

Must have one per generation

Example: Circular hash table for each generation, with linear hashing

Strategy: With multiple generations/remembered sets, let mutator place interesting pointers in "scratch" location, and let collector sort them out.
- Removes need for mutator to find correct target generation

Hash tables used by train algorithm

Sequential Store Buffers

Faster recording of pointers

Can have one per thread

"Bump a pointer" insertion
- Can avoid out of bounds check using hardware paging.
  - Very costly to handle trap. Must be faster than running the check at each iteration.
- Can also use architecture-specific tricks
  - Solaris UTRAP for example.

May be used directly as remembered sets, or may be front-end to hash tables
- Choice depends on whether sets are discarded at the end of each collection, on collection scheme.

Overflow Action

SSBs and Hash tables can overflow
- Can enlarge or empty into another, larger structure if being used as frontend

Card Tables

Divides heap into fixed-sized cards

Marks a card dirty when an interesting pointer written into a field in that card

Moves costs from mutator to collector

May implement as bit array(one bit per card), but not best choice
- Bit manipulations not atomic. Software synchronization costly.
- Usually use array of bytes
  - Exist many efficient ways to zero an arbitrary byte, so usually dirties by making 0.
- Can reduce overhead by instead marking card which is standard reference to object.

Cards tables very small compared to heap usually.

Card size is a compromise between space usage and collector time

Promote objects in generational collector to clean card to avoid needless scanning next cycle(pg. 199 for clarification)

Since most cards will be clean, will waste time in sequential scan of cards looking for dirty cards
- May want to create set of larger cards, and use it to find the indices of the array of smaller cards to sweep.

In generational collector with no interesting pointers in nursery, only needs cards for older generation.

Crossing Maps

Start of card may not necessarily align with start of objects. Since very large objects can have fields many cards apart, can cause problems in scanning cards.

Need a crossing map that determines how objects cross cards.

One entry for every card. Stores the offset into the card that the first object starts.

Depends on whether cards record slots or objects.
- Slot-recording must store index to last object in card, negative value if no objects start in card
- When given card, can move backwards in array until a non-negative offset is found. This card contains start of object, at offset in crossing map.

Object-recording can traverse relevant card from first object.

Summarising Cards

If not all objects promoted en masse, some cards will remain dirty between collections.

Rather than re-scan these cards for pointers, why not store the interesting pointers in a hash table?

May also store information related to generation of referent in table, other information the collector must compute each sweep.
- Lead to 80% improvement in 5-generation SML collector

Hardware Virtual Memory Techniques

May intercept page eviction and collect 'dirty' bit at this time.

May also intercept any modifications of a non-nursery page, and set dirty bit at this point.
- Poor choice on modern machines, but done historically
- Using virtual memory for locking can lead to 'trap storms' when many pages(tospace, newly grown older generation) need to have attributes modified at end of collection.

Works without compiler support.

Write Barrier Mechanisms in Summary

No winner in performance comparison between different strategies.

Write barriers usually have costs less than 2%.

For each platform, a different technique might lend itself to easier implementation.

Chunked Lists

A common data structure used throughout runtimes/compilers is the chunked list

Consists of a linked list of nodes containing arrays.
- Typically makes sizes and starting position of blocks a value which is a power of two.

Desirable for parallel collectors because can take individual chunks. Less contention for lock to modify linked list.

11.9 Managing Address Space

Some algorithms require large, contiguous memory regions.

Some 32-bit machines can't provide that space.

Dynamic libraries can be located anywhere in address space.

Many do it to allow latency-sensitive code to be fast by allowing them to do simple pointer comparisons to determine allocation space.

Better to not require contiguous memory, if table lookup cost is acceptable.

Solution is to work with power-of-2 sizes frames. May mark frames with generations or other attributes.

Can use software emulation of virtual memory, or simply use the OS's virtual memory hardware.

Terminology

Reserving = Using the managed system's notion of requesting addredd space for a use.

Allocating = Requesting memory from the operating system.

11.10 Applications of Virtual Memory Page Protection

Prologue

Many GC-semantic checks can be done cheaply through configuration of the virtual paging hardware. A violation will be caught as a memory protection fault.

Modifying protections can be expensive

Useful when dealing with uncooperative code.

Double Mapping

System maps same page at two addresses, with different protections at each address.
- Ex: Preventing fromspace access in concurrent copying collection by setting address mutator knows as "no access" and then mapping somewhere else for collector to use.

May be harder in 32-bit systems

Some caches index by virtual address. Cache coherency can be threatened by double mapping.

Application of No-Access Pages

Uses:
- Unconditional read barrier
- Detect null pointer dereferences(trying to access address 0, or near it)
- Guard Paging
  - Doesn't need to check that allocator pointer is in bounds if page followed by no-access page.
  - Can use same idea for stack or heap overflows.
- Can allow large logical space but small virtual memory space

11.11 Choosing Heap Size

Larger heap sizes generally lead to less thrashing and more throughput.

Too big a heap can lead to excessively large working set, take up too much data in bookkeeping data structures to perform well.

Schemes:

After each collection, collect system and heap usage statistics to decide how much heap space is safe to use before next collection

If collection frees less than a certain fraction of the heap, expand the heap. This has no provisions for shrinking the heap.

If the user can inform the program of the target working set size, the collector can grow the heap until that size is met. Good observed performance.
- Also marked fromspace as discardable to virtual memory manager

Can rely on special OS syscalls to inform program how much memory it may use

Can use number of allocation stalls as indicator of memory load
- Tries to grow nursery every collection until stalls, and then shrinks heap to last observed size that didn't have stalls.

If we have knowledge of all programs running on the system, we can figure out heap size for maximum throughput for all processses.

Dynamic heap sizing is also possible. Change heap size in reaction to page faults

11.12 Issues to Consider

Initialization/allocation requirements for stability and performance must be considered above all

Degree that allocation sequence can be inlined.

Whether we can spare a single pointer-register for bump allocation

Not an option with too few registers or if register allocation would be too difficult.

Whether more information(pointer location, etc) can be passed on by compiler or whether conservative collection can be used.

Chapter 12: Language-Specific Concerns

12.1 Finalization

Prologue

Some resources which lie outside of the scope of the collector(file handler, network connections, etc) require that tasks be done when done with it

Need a way to perform thes freeing actions when the objects encapsulating that resource are freed.

Store table of finalisers for objects. An object is finaliser-reachable if it can only be reached from the finalizer table.

When Do Finalisers Run?

Problem: If they are run during collection, they may need to allocate on the (currently full) heap.

Solution: Run after collection

Create a queue of finalisers to run and process after collection

To run before mutator threads resumed or not?
- If so, possible deadlocks if expect response from thread or needs to acquire locks. Faster finalizer, lower throughput.

Ideally, finalizer semantics should not limit optimizations in runtime.

Which Thread Runs a Finalizer?

Idea: Use seperate thread to run finalisers.
- Finalisers must now be reentrant.
- Constructors and finalisers for the same type must be safe to run concurrently.

Can Finalisers Run Concurrently with Eachother?

If they must be safe to run concurrently with arbitrary non-finalizer code, they should be safe to run concurrently with eachother.

Can Finalisers Access the Object Which Became Unreachable?

Finalizer needs ability to specialize for given resource. Needs ability to take arguments(object fields) or to access the object.

If so, then collector must trace references of garbage object and ensure they survive

When are finalized objects reclaimed?

Some, more dynamic languages, allow for a finalizer to run more than once. Others may not make guarantee that it runs only once but programmer may not explicitly call multiple times.

What happens if there is an error in the finalizer?

Finalisers will have access to the language's error recovery mechanism. Not a GC problem.

Is there a guaranteed order to finalization

Sane default is to guarantee finalization in the order of reachability. If A and B are garbage, and A encapsulates or references B, but B only references the resource, free B first.

Cyclical references will require programmer understanding of semantics of finalisers in question.

The Finalization Rate Problem

If compiler optimizes away references to freed objects with finalisers, collector may free resource and then return to function stack referring to dead object's resource.

Solution is to provide a keepalive function that the compiler won't optimize away.

Finalisers and locks

Finalizers may modify global data structures related to resources, but appear to run concurrently to programmer-provided code.

Solutions include:
- Locking everything, always
- Adding finalization calls to a queue, calling them at a programmer or compiler-defined "safe point", and running all finalizers at that point
  - Pain in making sure location is safe.
  - If the thread running the finalizer already has a shared lock, it will run without the lock-aquiring semantics that the programmer designed.
  - Ex: Why java will not run finalizers while the thread has any user-visible locks.

12.2 Weak References

Prologue

Using Pointers to find object references can have suboptimal behavior

There may be data that we want a reference to, in order to use when there are other objects associated with it. If no other objects refer to it, we have no use for it.

These are called weak pointers, or weak references.

May be useful to be notified on nullifying these references, so that we can remove them from data structures.

Tracing collectors can implement by not chasing weak pointers during marking phase, and then altering them with new object address or nullifying them during collection phase.

Collector must be able to identify the reference as weak.

Solutions:
- May alter pointer's less significant bits to record information
  - Requires de-alteration to dereference
- May create weak reference object. Only special to collector, can behave like normal object to user code.

System must give programmer way to specify reference as weak

Additional Motivations

Caches, other programs which use a space-memory tradeoff. Getting "low space" feedback from the runtime can help simplify cache expiry logic.

Supporting Multiple Pointer Strengths

Prologue

We use the phrase x-reachable if it can be reached from the roots by pointers of strength x, and no higher. x* reachable is reachable by at least level x.

Strengths:

Strong	Collector never clears
Soft	Collector clears if space is at a premium
Weak	Collector always clears if only weakly-reachable
Finaliser	Reference from table of finalizers to object with finalizer.
Phantom	Collector must clear. Does not have ability to dereference.

Harder to implement with reference counting. Need to store more metadata, have consistent way to decide when to clear soft references.

Detecting cycles must account for these semantics

Using Phantom Objects to control finalization order
- Phantom objects can have a phantom reference to one object, but a strong reference to another. When the first object runs through finalization, the phantom reference is cleared. Then the second can run.
- Phantoms purposefully designed to have weaker reachability than finalizers for this reason.

Race in weak pointer clearing
- Finalizer race condition involving compiler optimizing references away can happen here

Notification of Weak Pointer Clearing
- Placement of cleared pointers in a queue allows for each notification

12.3 Issues to Consider

Finalization and weak pointer semantics determined at language design time.

Both of these features add complexity to tracing that should be considered from the onset. Do not bolt them on later.

Concurrent access and clearing of weak references is an issue

May be better to clear during infrequent stop-the-world of otherwise concurrent collector.

Chapter 13: Concurrency Preliminaries

13.1 Hardwarse

Threads

Thread states = ready, blocked, or running

Threads can move processors when descheduled
- Can use processor affinity

Simultaneous multithreading or hyperthreading (multiple processors with same execution pipleline) not actual threading, but similarly named

Multiprocessor = More than one processor

Multicore/CMP/Many-core = many processors on same integrated circuit

Mutithreaded software runs concurrently on a multiprocessor

Multiprogrammed = software executing multiple threads on a single processor

Interconnect

Shared memory differentiates local concurrent software from distributed systems

Access mediated by bus or interconnect
- Bus = single point that all requests pass through
- Granularity of memory acccess a tradeoff between bandwidth and number of busses
- Busses a bottleneck on 8-16+ core systems

SMP vs NUMA
- SMP = All processors have equal access time to memory
- NUMA = Each processor has a small bank of memory that they have faster access to

Memory

Because of concurrent access to memory, hard to describe a single state

Alright because we only need to access each unit of memory, and that access is consistent

Caches

Memory access is slow relative to processors, so we use faster and smaller caches

If needed address in cache, called a cache hit

If needed address not in cache, called a cache miss
- On cache miss, we bring memory's values into cache
- If cache full, we must evict a cache line. Evicted line called victim.

If we write the data to memory immediately on write, is called write-through. Otherwise called write-back if only when flushed.

Cache Organization
- Cache Location
  - Fully Associative
    - Any memory addresses lines can reside in the cache together at the same time
  - Direct-Mapped
    - Each address has a specific place in cache it must reside
  - k-way Set Associative Caches
    - Each address has k possible cache locations in memory to reside
    - Can evict however many it needs
  - Victim caches
    - On eviction, a line is put in a fully associative victim cache
    - Allows to easier recalling of victims
- Cache level relationship
  - Strictly inclusive if must be held by higher level if held by lower level
  - Strictly exclusive if may not be held by higher level if held by lower level
  - Can be neither, may allow anything

Coherence

Multiple levels of cache may get out of sync with one another, hold different valuse for same address

Common cache coherency protocol: MESI

State	Description
Modified	The cache is the only one holding a copy of hte line, has been modified but not written back to main memory.
Exclusive	Only one cache holds the address, not modified.
Shared	Other caches may hold the line, all have same value
Invalid	Cache does not hold line

To satisfy read, must hold in MES state. How to satisfy write depends on state of address.

Point is that there is only one writer for a given line, and no caches hold disagreeing values.

Does not scale well to many processors.

False sharing = if multiple processors frequently access and update different data in the same cache line, the line will be moved and invalidated between processors frequently. "Ping-ponging"

Performance example: Spin locks
- Poor because threads on different processors will contend for line with lock and invalidate it frequently.

13.2 Hardware Memory Consistency

Prologue

Writes to a given address have a total order, and processors won't see a state change without a write having happened between both accesses.

Order of writes to more than one address do not correspond to actual changes in memory or cache
- Ex: Write buffer that summarizes multiples writes by choosing later writes rather than writing both earlier and later values. A read will fetch the value from write buffer if it's in there.
- Ex: Compilers may reorder instructions
- Almost universally done for performance gains

Non-atomic, relaxed ordering can break program invariants

Three access types: read, write, atomic.
- Atomic uses test-and-set to ensure atomic read-modify-write

Fences and happens-before

Fences prevent certain types of access happening on one side of a point to happen on the other

Alternative is aquire-release pair which prevents things from happening before the aquire or after the release.
- Useful for protecting critical sections

Consistency models

Strict consistency => All operations have a total ordering everywhere in system

Sequential consistency => A partial ordering between processor program orders exist.

Weak consistency => All atomic operations are total fences

Causal consistency => Enforces that reads not be reordered across writes that may alter the read's value.

13.3 Hardware Primitives

AtomicExchange and Test-and-set are both fairly simple. They set a value unconditionally and return the old value. This can be seen as setting a value conditionally if the semantics are "only proceed if a certain value was returned."

Compare and Set

Takes an old and new value and if the old value is equal to the value currently in the address, write the new value. Returns whether or not updated.

Compare and Swap = Same as compare and set but returns old value

Trap is that the value might have changed multiple times and is now equal to the older value being checked for

Ex: Misreading an int for a pointer and finding it the same bits and proceeding.

Called the ABA problem

Load-linked / Store Conditionally

Use processor's cache coherence mechanism to detect changes at a memory address.

Solves the ABA problem very well

Atomic Arithmetic: Atomic increments and decrements

Not as powerful at other primitives. Have finite consensus numbers

A primitive with a consensus number of N allows for correct creation and selection of a value between N threads

More powerful primitives

Multi-word compare-and-swap can use a counter in the second word to avoid the ABA problem

More useful to update two non-adjacent words atomically
- Some chips like 88000 have such primitives

Overheads of atomic primitives

Expensiveness of atomic primitives lead to disuse by programmers, and error

Expensive for two reasons:
- Cache Coherence
  - Must aquire exclusive access to line
- Memory fences sacrifice optimizations by reordering

13.4 Progress Guarantees

Important to guarantee that the threads contending will make progress, bound wait times

Guarantees:

Wait-free: Every thread can always make progress regardless of actions of other threads.
- Typically involves threads helping eachother along
- If a thread's action will make another thread already holding resources unable to proceed, the first thread will wait until the other thread leaves its critical section

Obstruction-free: If given a period of isolated execution, each thread will complete in a finite number of steps.
- Easier than wait-freedom, but may require scheduler cooperation
- Threads can tell if they are contending and will back off

Lock-free: Infinitely often, a thread will finish in a finite number of steps
- Requires only that at least one contender makes progress at any given time.

Guarantees and Concurrent Collection

Parallel collectors use multiple threads to collect, but stop mutation threads to collect.

Concurrent collectors perform some actions while mutation threads run

Concurrent collection has implementation edge cases
- When objects are freed or become garbage during collection
- Proof of termination
  - Mutators may need to be throttled back to avoid making more work than the collector threads can complete
  - Need to prevent mutators from using all memory before collection finishes.

May offer different guarantees for different portions.

13.5 Notation Used for Concurrent Algorithms: Not relevant. Used for describing pseudocode.

13.6 Mutual Exclusion

Need to balance progress guarantees and mutual exclusion

Peterson's Algorithm good tool

13.7 Work Sharing and Termination Detection

Parallel algorithms need a way to detect termination of parallel algorithm

Simple if processing a finite amount of data, but sometimes threads create more work to process

Can push work to other threads or pull from other threads. Pulling called work-stealing

No work may be lost

Counters may become bottlenecks, cause cache contention

May use a seperate thread whose only job is to detect terminations

Workers set a busy flag. Detector finds.

Needs to track jobs pushed or pulled to atomically scan all flags.

Rendevous Barriers

Common to need all threads to hit same phase of collection.

May simply need all therads to reach a given point, called a rendevous barrier

13.8 Concurrent Data Structures

If structure accessed rarely, can simply lock it entirely with a mutex, but won't scale

Concurrent structures exist that allow a interleaving of certain actions on structure.

Is linearizable if overlapping actions has same effect as seperating them.

Each action has a point at which it can be observed to have occurred. Called the linearization point.

Variety of locking patterns that can be used:

Name	Description
Coarse-grained locking	One large lock for whole structure
Fine-grained locking	Lock subsets or individual elements. Needs to ensure that locking order consistent or will cause deadlock.
Optimistic locking	Searches the structure without locks, and then locks, and verifies that the structure hasn't changed since searching, and right elements locked.
Lazy update	Use metadata to mark an element as being removed or altered, release lock, and then aquire lock to make change. This prevents blocking reading threads.
Non-blocking	Use of atomic primitives to remove need for locking.

Concurrent Stacks

13.9 Transactional Memory

Implementation:

All reads and writes in a transaction should appear to not interleave with any other threads'.

Should allow over many independent words.

Attempts to commit the transaction may fail. If succeeds, changes seen. If fails, software may respond by retrying or doing other steps.
- Allows transactions to execute speculatively.

Atomic, Consistent, and Isolated.

If two transactions will interfere with eachother, one must be killed to allow the other to proceed.

Atomicity methods
- Undoing
  - Creates an undo log, actually makes writes to main memory.
  - On failure, undoes instructions in log.
- Buffering
  - Create a scratch buffer and store all writes to it. On success, write to memory. On failure, discard.
  - More work for successful commits, and success is common case.

Conflict detection
- Can use eager or lazy methods
- Eager:
  - Checks all accesses for conflicts
- Lazy:

Use for collection

Caveats:
- STM has a significant overhead
- Hardware transactional memory may have gotchas, not obvious how code maps to transactions, may have strange cache behavior.
- May interfere with guarantees because the retry behavior is less understandable than hand-written locking
- May require very careful performance tuning.

Interaction with languages with transaction memory

Memory manager may cause more transactions to fail, cause many retries

May have collector and mutator fight for transaction, constantly restart
- Harder to handle with hardware transactional memory

Transaction logs need to be part of the root set, or garbage created during a failed transaction would cause premature freeing

Undoing allocation may require a lock on allocation pointers or free-list metatdata, undesirable

If non-transaction actions allowed during transaction, need to make sure that allocated memory is initialized.

13.10 Issues to Consider

Hard to get right

What guarantees are actually necessary?

What atomic primitives are supported on our architectures?

Chapter 14: Parallel Garbage Collection

14.1 Is There Sufficient Work to Parallelize?

Goal is to reduce overhead by exploiting hardware

In stop-the-world, will shorten pause times

In concurrent schemes, will shorten GC cycle time

Must be worth it in the tradeoff of the overhead of synchronization primitives

In practice, not many stalls because the object graph is not very deep. This is amenable to parallelism.

Harder to parallelize tracing, easier to parallelize "sweep" activities like fixing references.

14.2 Load Balancing

Needs to properly allocate work to threads, or will be mostly sequential execution

Has coordination cost

Methods

May schedule all work ahead of time, such as with a sweep that may work by dividing by region. Called static balacing

Most tasks need dynamic balancing. Use a heuristic to divide labor.
- Good solution is to over-partition the work into more pieces than threads and allow threads to take them off a queue.
  - More accurately partitions based on real work than heuristics

14.3 Sychronization

Over-diving into too small granules of work will have synchronization overhead, despite better processor use

Need to prevent double-copying because of problems like future updates only affecting one version, inconsistent state.

Synchronization may be removed where actions are idempotent and repeating work not too costly.

14.4 Taxonomy

Processor-centric algorithms

Gets work granules of different sizes

Usually by work-stealing

No consideration of locality

Memory-centric algorithms

Take locality into account.

14.5 Parallel Marking

Prologue

Marking is 3 actions: Getting an object, marking it, and adding children to a work-list

All known-algorithms processor-centric
- No synchronization required if work-list is thread-local
- Marking can be idemponent(set a bit)
- Synchronization required for work-stealing

If marks are stored in bitmap, access needs synchronization.

May need to move work from thread-local store

Processor-centric Techniques

Work stealing
- If work-list empty, try to steal from other thread's
- Needs to lock, but can give up if unable to aquire lock and move to next thread's queue
- Problem with processing large array of pointers. Local work stack will overflow.
- Solution is to push arrays onto global work queue as smaller lists of pointers
- Only requires sychronization on taking of work, faster than other algorithms which always use synchronization
- Termination for work stealing
  - Use of an atomic counter possible, but contention can lead to serialization
  - Thread-local counters allow for introspection without locks
  - Detection involves setting a "No threads stole work" flag, trying to steal work, on failure checking if the "no threads stole work" flag has been unset by thread on aquiring work. If so, terminates.
  - May be worthwhile to have lock for termination detection

Grey Packets
- Divide work into packets and have threads contend for work packets
- Each thread has an output packet, and steals work to add to input packet
  - Adds children of marked object to output packet
  - Called grey packets because input and output packets are both grey in tricolor abstraction
  - Allows for packet object prefetching.
  - Only require synchronization to get input packet and put output packet
  - Termination state is that the total number of empty packets(outputs with no grey elements) is equal to the total number of packets.

Channel
- Create a channel(queue with single reader and writer) between each thread to each other thread
- Store channels in circular buffer
- Push children of marked object onto other thread's queue
  - If no children, take work from global pool and push instead.
- No synchronization needed, and performs better than work stealing on computers with many processeors

14.6 Parallel Copying

Differs from marking in that may only copy once

Processor-Centric Techniques

Work dividing
- Each copying thread given a stack of work to do.
  - Stacks offer easier synchronization
- Synchronization works with "rooms," divided into pushing rooms and popping rooms
  - Must always have one empty pushing and popping room
  - New thread moves to non-empty popping room and pops work out of it, putting work generated onto it's stack. Leaves pop room when done.
  - Once all threads leave pop room, locks pop room to future thread enterance and empties queue into push room's shared stack. Leaves push room when done.
  - Last thread to leave push room unlocks pop room.
- Problem with waiting. Because work time varies between threads, may have significant wait time.
  - Can fix by making threads leave pop room as soon as they get work, since working doesn't depend on shared stack.
- Terminates when shared stack is empty when last thread leaves push room, and no objects working outside of the rooms.

Copying in parallel
- If all alocation done in same region, need to synchronize on allocation.
- May race to write "busy" pointer into object's forwarding slot.
  - If use thread-local buffers for allocation, can race to write forwarding location directly.
    - May need to un-allocate if another thread has already copied in the interim.
- If local allocation buffer bounded, NUMA architectures may suffer due to allocation on another processor
  - NUMA should be baked into the processor

Memory-Centric Techniques

Per-thread fromspace and to-space
- Risks poor load-balancing and space overflow

Block-structured heaps
- Over-partition fromspace and tospace and have threads compete for granules of fromspace
- Small blocks a good idea
- A good way to fight wasted space due to fragmentation in each subset of tospace is to use big-bag-of-pages.
- Naive implementation uses breadth-first copying.
  - Seperates parents from children, poor locality

Channels
- Removes need for atomic synchronization primitives
- Designed for NUMA
- Over-partition heap, consider each subset a work-list
- If a thread sees a refernce in it's own partition, adds to work list
- Otherwise sends to other thread through channel

Card Tables
- Difficult to balance load with card tables
- Straightforward partition has problems because cards tend to have uneven amount of live data, work
- Solution is to overpartition

14.7 Parallel Sweeping

Embarassingly parallel

Either partition heap statically or over-partition

Adding garbage to free list becomes bottleneck

Segregated fits a good solution, also thread-local allocation

Giving each thread multiple blocks a good way to reduce contention

14.8 Parallel Compaction

Easy to parallelize since we know the future location of a piece of garbage after marking(since we know amount of live data seen before it during marking.)

Multiple sweeps required:

1. Calculate forwarding addresses and insert into headers

2. Update all references in objects

3. Copy objects

Can do each in parallel, but need all threads to reach rendevous before next phase starts.

Chapter 15: Concurrent Garbage Collection

15.1 Correctness Of Concurrent Collection

Must be able to retain all reachable objects, and must eventually complete collection cycle

Tricolor Abstraction

Easy to reason about correctness with tricolor abstraction

Collection can be seen as turning a portion of grey(unscanned) objects into white(dead) and black(live) ones.

Concurrent mutation can make mutator and collector's views of world inconsistent

If mutator changes white or grey objects it is fine because collector still needs to scan

If mutator changes black(already scanned) objects, is alright as long as new references already known as reachable.

Can still cause problems by freeing live objects

Lost Object problem

Two failure scenarios
- A mutator can wide a white objects directly reachable from grey object by inserting reference into fields of already-scanned objects and deleting link from grey object
- Can hide object transitively reachable by inserting pointer into chain into already-scanned objects, and then breaking pointer chain in unscanned objects.

Only possible to lose objects if two conditions hold:
- Mutator stores a pointer to white object in black object
- All paths from any grey objects to the white object is destroyed

Strong and weak tricolor invariants:
- Weak triolor invariant: All white objects pointed to by a black object are grey protected.
  - To prevent both from holding, we must make sure that all references from black objects to white objects are tracked
  - Non-copying collectors ensure this because of traversal pattern, all references from black objects are black
- Strong tricolor invariant: There are no pointers from black objects to white objects.
- Strong implies weak, but not inverse
- Ensuring strong is sufficient for copying collector

Precision
- Varying precision of live object tracking mean holding onto some garbage for a duration
- Stop the world has maximal precision, concurrent schemes typically trade precision for concurrency

Mutator color
- Useful to consider mutator thread as if it were an object, and roots like references.
- If a mutator is allowed to become grey, need to be rescanned
- May be multiple mutator threads, need to handle

Allocation color
- Need to consider which color to color new allocations
- Must not let mutator's references to allocated cell break tricolor invariant.

Incremental update solutions
- Need a write barrier to inform collector of mutator actions which would violate the invariants

Snapshot-at-the-beginning solutions
- If a reference to an object removed during collection, need to conservatively treat as root
- Requires write barrier

15.2 Barrier Techniques for Concurrent Collection

Collector can respond to concurrent mutation in only three useful ways

Add to wavefront(black/white divide) by shading an object grey if it was white (Adding a root essentially)
Advance wavefront by scanning object to make black
Retreat the wavefront by marking black object grey ('unscanning')

Any other actions(reverting object to white or shading black without scanning) would break invariants.

Grey Mutator Barrier Techniques

Assumes a grey mutator

Use an interstion barrier to prevent storing white pointers in black objects

Because grey, no read barrier needed

Techniques:
- On writing a pointer into a black object, make object grey to require re-traversal. (Steele)
- Can mark any object with a pointer written into it grey since idempotent if grey and harmless if white. Removes need to check state of object being modified. (Bohem)
- Always shade target of inserted pointer black, regardless if reference later removed. Ensures progress of concurrent threads. (Dijkstra)

Black Mutator Techniques

Can use a read barrier to mark any white objects grey as soon as their reference is read, written, or deleted.

Can use virtual memory hardware to implement access barrier of black fromspace.

Can do converse and mark deleted references black to prevent removal of last observed pointer freeing live objects.

Completeness of mutator techniques

No other organization schemes can preserve the tricolor invariant under concurrent collection(Pirinen 1998)

Can make variations by coarsening granularity and promptness of these techniques.
- Ex: Mark targets of altered references grey and re-scan
- May cause difficulties in ensuring termination

Strong invariant techniques need only to ensure that one of the two conditions cannot hold, don't need to do any more

Weak invariant / snapshot techniques need black mutator

Concurrent write barrier mechanisms

Prologue
- Mutator barriers need to store references to grey objects in a structure, which collector needs to use to color relevant objects
- Structures need to be safe for both mutator-collector concurrency and mutator-mutator concurrency
- Can use a log to record grey objects, also card tables.
- Card tables
  - Dirties a bit to state the grey color an object in a range of addresses
  - What is grey(targets of interesting pointers as well as simply transitive references through stack's roots) varies with collection technique
  - Card table becomes collector's work list
    - Mutator can re-dirty cards so collector needs to scan repeatedly
  - One level card tables
    - Card is in one of three states: dirty, refining, or clean
    - Mutator barriers set to dirty using store instruction, no expensive atomic primitives
    - Collector sets to refining without atomic instruction, starts cleaning
    - Collector tries to set to clean with atomic primitive. (test and set with test being a check if it is still refining). If unable to write, mutator must have set to dirty while collector was refining, must re-scan.
  - Two level card tables
    - Overhead of card table is in scanning cost, since most will be clean.
    - Cheaper to keep a card table which records location of dirtied cards at a much lower(2 ** n) granularity.
    - Concurrency issue: Mutator dirties fine grained and then coarse-grained cards while collector reads coarse-grained and then fine-grained. May consider all clean in race. Preserving atomicity may need primitives with overhead in mutator.
- Reducing Work
  - Don't need to recursively mark children in mutator if we know that during collection we will have to trace through children on the dirty card.
    - Avoids traversing graph twice
  - Reduce number of dirty cards by only marking dirty if untraced.
    - Can keep a card table of traced objects
    - Can use local allocation buffer to record allocations in a buffer to trace. When the buffer overflows, trace all and mark all as clean.
      - Leaves some garbage for next cycle, but faster and ensures progress of threads.

Issues to consider

In some systems, mutator barriers do not harm throughput more than lower collection times add

Ex: A multi-phase payment system where a GC pause means a timeout, which requires reprocessing.

Costs vary by collection scheme.

Ex: References counting = high, while mark-sweep is low

Large impact on pause time is whether precision must be high, or if we can allow garbage to "float" from one sweep into the next, where it will be freed in the next sweep.

Very useful for large heaps where the amount of live objects is very high. Even with a parallel collector, pause times would be long.

Downsides:

Needs more headroom because allocating during collection.

Variants:

Generational partially-concurrent hybrid collector
- Manage older generation frequently
- Since most objects die young, nursery collection is quick enough
- Avoids mutator barrier to younger objects because stop-the-world
- Biggest performance loss in concurrent collectors is that they mark objects allocated during collection black always, but most objects die young
  - Hybrid is solution

Chapter 16: Concurrent Mark-sweep

Prologue

Mark-sweep very easy to parallelize

Also fairly performant because don't require read barrier.

Write barrier is 2-3% overhead in benchmarks. Read barrier is 20% overhead.

16.1 Initialization

Rather than running at exhaustion, collectors can run concurrently with allocation

When to trigger new mark phase critical

Must not run out of memory before collection finished

Simple implementation is to require all allocation to do a bit of work(hijack mutator thread)

Needs to stop the world to get mutator roots usually

Can use grey reference access barrier to avoid stopping all mutators, incrementally add work as mutators yield to collection work, but at least one mutator needs to be scanned to start work list.

16.2 Termination

If all mutator threads black, can terminate since no grey pointers held, so anything not scanned is garbage.

Grey mutator termination requires scanning mutator roots to ensure that no new references introduced to unmarked objects.

16.3 Allocation

New objects colored according to mutator

Since black mutator could not have known white/grey object reference, we allocate black objects.

Grey mutator has options:

Can simply color black during marking and add to roots, and white otherwise

If object initialized with all fields(immutable languages) we can color based on references. If none white, can allocate black. Must allocate white.

Allocating white may lead to high allocation rate in mutator preventing prompt termination

16.4 Concurrent Marking and Sweeping

Need way to differentiate between newly allocated data and marked garbage.

Solution = new color for abstraction: purple

After marking, set unmarked objects to be pending freedom

Allocate new objects without this mark

On sweeping, only removed marked objects

Problem is that this requires touching all objects in heap before ending phase of paused mutators

Solution is to interpret same colors differently during different collection cycles
- Easy solution is an 'even-odd' allocation scheme. Live data switches sign, allocation created new sign-ed objects, and sweep frees old sign.

16.5 On-The-Fly Marking

Prologue

Alternative to stopping world to scan roots is to scan each mutator independently

Uses handshake to ask each mutator to stop at next safe point

Scans roots of mutator, and then sends on way

Can use stack barrier to limit mutator's actions to lower frames, so that collector can scan higher roots
- Makes handshake and pause much shorter

Write barriers for on-the-fly collection

Existance of white and black mutator threads intermixing allows for deletions from white mutator and addition of reference by black mutator, to lead to black mutator being only reference to white object/violation of strong tricolor invariant

Doligez-Leroy-Gonthier

Leveraged ML immutability to allocate immutable data locally and mutable data in shared heap

Use on-the-fly marking with series of handshakes for phases of collection.

Java Implementation
- No thread-local heaps, single shared heap
- Used queues between mutators and collectors to insert grey objects into to avoid rescanning thread's roots

Sliding views

Uses sliding views and thread-local buffer of all fields of an object before mutator first modified by collector since marking started.

All mutators handshake on phase change, and then disable buffers.

16.7 Issues to Consider

Much more difficult than non-concurrent collection usually

While generational collection improves expected pause times, can't help with worst-case pause times. Concurrent collectors offer shorter and more predictable pause times.

To truly bound pause times, must be both concurrent and on-the fly.

Concurrent mark-sweep has same problems with fragmentation and slower free-list allocation that serial mark-sweep has

Mark sweep has easier heap coherency model

Has many possible variations of implementation, each choice with performance impacts

Chapter 17: Concurrent Copying & Compaction

Prologe

Must protect collector and mutator threads from eachother both

Black mutator tospace invariant: Black mutator must hold only fromspace pointers

Otherwise we'd never scan them

In absence of read barrier, grey mutator cannot find pointers in to tospace objects in fromspace objects(since copying collector doesn't alter fields)

Called grey mutator fromspace invariant

At collection cycle termination, we must only have tospace pointers, so roots must be forwarded somehow.

Mutator updates to fromspace objects must be propagated to tospace.

17.1 Mostly-Concurrent copying: Baker's algorithm

Maintaining a tospace invariant requires only replacing roots with addresses of tospace objects

Roots now black, but point to grey objects with fromspace references.

Introduce a black mutator read barrier to prevent a mutator accessing a fromspace pointer

On access, atomically copy target of pointer and return forward reference.

Mostly concurrent, mostly-copying

Mostly-copying collectors fit well with mostly-copying, we can pin ambiguous root references in mutator threads.

Used historically with page protection to allow concurrent copying for compilers without stack maps

Read barrier only needs to aquire atomic roots when forwarding, so can avoid costs of synchronization until last minute.

17.2 Brooks's Indirection Barrier

Hold forwarding reference in every object header. Points to either self or tospace copy.

Uses write barrier to prevent modification of object when copying.

Since mutator now grey, needs to scan stack for fromspace references and replace before swapping semispaces.

17.3 Self-Erasing Read Barriers

Read barriers more costly than write because more common

Read barriers conditional, forward object if not in tospace

Have closure contain a code pointer, and compile two copies of code. One does a baker-style copy while the other directly chases references.

When collecting, set bit in closures to execute baker code. When not, don't.

Beauty is that when code executed if object has already been copied, never enters read barrier.

Downside: Duplication of code. 25% larger binaries in ghc.

17.4 Replication Copying

Mutator references old objects, never pays instruction/cache/space overhead of brook's indirection pointers

When copied, write barrier logs modifications of the old object, modifies new

When all of tospace scanned, mutator's roots scanned, and write log empty, terminates

Requires pausing to ensure termination conditions hold for all threads when termination signalled.

Also needs to stop to rewrite pointers in roots.

17.5 Multi-Version Copying

Both previous versions must still pause world

Processors have own fromspace and tospace subset.

One tospace, multiple fromspaces. Accumulate old, invalid copies in fromspaces.

Barrier on access, so mutators must search for new reference and update root.

Secret is to decouple fromspace reclaimation from phase change, or 'flip'.

Each processor scans tospace for fromspace references and fixes.

Processor flips when it's tospace is full. Can't free fromspace until all references have left it.

Upon collecting a processor checks bits set by other processor to determine if they've flipped

If so, it updates all of the fromspace references it has in it's to/from spaces to point to new tospace

When all processors have done so, fromspace discarded.

This has the effect of ensuring termination because if the mutator outpaced the collector, the mutators would all scan and eventually flip until enough fromspace was released.

17.6 Sapphire

Prologue

If from/tospace divided equally between threads/cores, may end up with multiple threads unable to allocate until one thread frees.
- Largest problem on non-parallel systems.

Solution is to place collection work asymmetrically on one or more collector threads.
- Can modify priority to set mutator-collector balance

New objects allocated as black in seperate newspace that isn't scanned
- Has write barrier to track references written into it.

Collector Phases

MarkCopy
- Overview
  - All reads are from fromspace, all writes are duplicated
    - Buffered changes, eventually consistent
  - At application-level thread synchronization points, all made consistent

Flip
- Sets all roots to point to tospace copies
- Until this point, all mutators can refer to tospace/fromspace interchangeably
  - Need barrier for all pointer comparison operations.

Volatile Fields(Java, etc)
- Require synchronization at every read/write between fromspace and tospace

17.7 Concurrent Compaction

Compressor

Scan roots, mark objects, and compute new compaction address of all live objects from knowledge of live data size and offset of object

Set mutator roots to fromspace and unpause

Use access barrier on tospace to copy an object if not alredy there.

Concurrently copy fromspace to tospace, and update references, but don't copy children. Don't need to in order to know their new reference, since we know all future addresses.

Extensive use of double mapping and virtual memory hardware, which is expensive so batching required

Pauseless

More intelligently only protects objects actually being copied, rather than all tospace pages.

Protect them incrementally while copying and updating references(known without having to copy children)

17.8 Issues to Consider

Pauses/Poor real-time behavior of baker

On any access with a baker-style access barrier of tospace, every access may take unknowable amount of time because many have to recursively copy object and references through it.

Chapter 18: Concurrent Reference Counting

Prologue

Previous reference counting optimizations(including elimination of cycles) involved stopping the world for tracing or batch changes of references.

18.1 Simple Reference Counting Revisited

Difficult to do concurrently because increment/decrement of references might interleave in the mutator even if reads/writes atomic

One solution to lock each object before altering fields.

Lockless solution would be nice

Primitives for single memory location not enough because another mutator can drop reference to zero and free between pointer load and reference increment. Attempt to increment might corrupt another object allocated in that location.

If we can alter two words, that's sufficient because we can access address and count concurrently.

18.2 Buffered Reference Counting

All reference count changes placed in a buffer that a single thread scans and executes.

All increments executed before decrements to prevent premature freeing

Doesn't fix problem of making increment/decrement and pointer load/store atomic.

Can do so by giving each thread a local buffer and an epoch count and using ragged epochs. When local buffer full, get new one and increment epoch. When all epochs reach a level, local buffers of that level and below processed(all increments first) together.

Variant of deferred reference counting.

18.3 Concurrent, Cyclic Reference Counting

Trial deletion has problems

Can't guarantee that the subgraph it retraces hasn't changed since last trial deletion.
Pointer deletions may seperate graph
Reference counts may go out of date

Only current good solution is to run reference count concurrently, and set a mark, and then free that next cycle if it's still garbage.

In theory, may leave garbage around indefinitely

18.4 Taking a Snapshot of the Heap

Refinement of coalesced reference counting

Problem with previous description is that while we can restart threads once we get their local buffers, if we're going to decrement children's counts before freeing object, we need to know state of object at the time the change happened, not what's left from the mutator.

Solution:

If object clean, old reference still valid

If object not, mutator's log buffer has the change in it, so we can find children by reading state of object before reference change in buffer.

18.5 Sliding views

Don't need to stop all mutator threads.

Only collect states of a subset of objects each time, fix subset's references each time

Applications:

Age-based
- Use sliding reference counting for long-lived data and mark-sweep for nursery

With only a subset of the heap in view, we can use trial deletion and use graph made from local allocation buffers, prevent graph from changing under it's feed.

18.5 Issues to Consider

Exist large literature on ways to safely reclaim memory for insertion/deletion with certain data structures, might want to specialize for language.

Chapter 19: Real-Time Garbage Collection

Prologue

All earlier algorithms not truly real-time because they didn't guarantee progress of mutator

Thread scheduling may starve mutator thread when many collector threads

19.1 Real-Time Systems

Impose operational deadline on tasks

May be soft: example is video. Pause degrades performance

Should use goal of maximum collector time, time slice, and failure rate

May be hard: Pause causes failure. Example: ignition system

Performance less important than predictability of performance

WCET: Worst case execution time of a task

Embedded realtime systems have tight space constraints

Needs specific collection scheme, different than previous concurrent gc schemes.

19.2 Scheduling Real-Time Collection

Incremental collectors will impose cost on all mutation

Concurrent collectors as before

To maintain steady-state space usage needs to free and recycle objects at rate of mutation.

Must avoid fragmentation to provide constant space as well

Moving has costs

Alternative methods

Name	Description
Work-based	Imposes cost on mutator actions
Slack-based	Runs work during slack portions of collector time-slice. Can be significant in periodic real-time systems.
Time-based	Schedules a constant time slice of collection while world stopped

19.3 Work-Based Real-Time Collection

Classic baker algorithm

Fixed-size objects assumed

Read barrier
- Reading copies into tospace. Bounded by object size.
- Also marks a bounded number of objects

Has specific bounds:
- Space = 2R(1 + 1/K)
- R = reachable space
- K = Adjustable time bound, defined as number of fields scanned each allocation

Parallel, concurrent replication

Adds concurrent and parallel collector

Because work-based, can suffer from variation in distribution of pauses.
- Difficult to obtain bounded mutation times
  - Solutions exist, 19.5

Algorithm
- On read, copy to tospace. Also copy concurrently.
- All copies to tospace leave a forwarding address in header of previous object. All reads by mutator in future go through this address
- Fromspace and tospace divided between threads
- When collecting, all allocation allocates into both fromspace and tospace, with forward.

Uneven work and its impact on work-based scheduling

If we use worst-case execution time, we have the problem that the worst-case is much worse than the expected case
- Array of things with references pathalogical worst case

We can bound worst case by breaking down objects and arrays into fixed-size units
- Places overhead on access, space

Considered a nail in coffin of work-based scheduling
- Opinion is that overly starves mutator

Problem is that even when non-copying and fairly constant, allocation storms cause massive variance in time of GC
- Response has been to treat mutator work as something that must be scheduled like any other real-time job.
- Leads to better worst-case and expected-case ratios

19.4 Slack-Based Real-Time Collection

Implementation: Hendriksson

Forbids collection during high-priority tasks

GC work delayed until no high-priority tasks executing
- Also called the slack-period

Low-priority tasks do collector work during allocation

During high-priority tasks, record omitted GC work and perform it during slack

Threads for collecting high-priority tasks have lower priority than high-priority work but higher priority than low-priority work

Incremental copying is done by low-priority work in standard concurrent copying collector way
- All from-space accesses copy into tospace. All allocation done in tospace.
- Enforces constraint that no tospace object refers to a fromspace object.

High priority tasks do less mutation cost
- Objects evacuated lazily. allocate in fromspace but don't actually forward through fromspace header
- High-priority collection threads during slack time do copying
- Uses unconditional read indirection barrier and points to fromspace header through forwarding address, avoids mutator overhead

Collector works as coroutine through well-defined yield points.

Sceduling the collector work

Let W_max be the maximum amount of work that must be done starting at the start of slack time to allow the next high-priority interval to not run out of memory

Let F_min be the amount of memory that the high-priority interval will need

Minimum GC ratio = GCR_min = W_max / F_min

Current GC ratio = GCR = Performed GC work / Amount of new objects in fromspace = W / A
- Allocation increases A, collection increases W
- A throttle by yielding to GC until GCR_min >= GCR
- Ability to arbitrarily enter a high-priority period means that space overhead may be required to ensure that all tospace allocation is satisfiable.
  - Programmer must know worst-case allocation needed by high-priority task.
  - Not as bad as it sounds because high-performance, hard real-time tasks typically use little memory

Execution overheads

Since caches alter memory access time, instruction count not accurate measure of time
- Needs to be tested repeatedly to know predictable cache behavior, or assume worst-case time
- Collection in mutation has synchronization overhead.
- Forwarding pointer overhead on reads.
- Allocation is simple bump-a-pointer allocation.

Programmer input

Programmer must inform GC about worst-case allocation requirements and period in which it runs.

19.5 Time-Based Real-Time Collection: Metronome

Mutator utilisation

Supporting predictability

Analysis

Prologue

Treats minimum mutator utilisation as an input

For java

Concurrent incremental mark-sweep, with occasional compaction to avoid fragmentation

Uses deletion write barrier to enforce weak tricolor abstraction

Uses brooks-style forwarding pointers

Mutator utilisation

Ensures minimum mutator utilisation. All other time to discretion of collector, can give to mutator or use.

Is a trade-off between space and mutator utilisation.

Split time into quanta, ensure that for a given window the MMU ratio is maintained

Scheduler will usually schedule collection until MMU about to be violated and then back off

When considered over 'windows' of time between deadline, leads to mini-pauses or jitter where it collects for a while and then backs off.

Supporting Predictability

Fragmentation a threat to predictability. Many considerations

Name	Description
Arraylets	Break arrays and variable-sized objects into fixed sizes to fight fragmentation.
Read barrier	To avoid performance cost of brooks-style read barrier, eagerly change references in roots and on stack to point to tospace when copied.
Scheduling Collector	Use 'alarm thread' to check if it should switch to a collection quantum and if so then tell mutator threads to reach safe-point and yield to it. Colector thread collects.
Suspending mutator threads	At pause will reach safe point, mutator threads store all references on stack/in roots in specific location. On reload will read them back in, to allow faster changes.
Ragged root scanning	Deeply nested stacks a problem, may not allow collector to scan during quanta. Solution is concurrent scans and a write barrier into scanned stacks to prevent overlooking references.

Analysis

Metronome gave formalized model for concurrent, realtime GC

Parameters
- A(t) is instantaneous allocation rate at time t
- G(t) is instantaneous gabage generation rate at time t
- P is GC processing rate as function of live data
- Memory allocated, garbage created is simple integral
- Mutator quantom and collector quantum are time that each can run without yielding
- Only works if parameters have accurate estimate.

Minium mutator utilization = ( mutator quantum * (time elapsed / Mutator quantum + collector quantum) + x ) / (time elapsed)
- x = Size of remaining mutator quantum

Can be used to analyze work-based and other scheduling approaches

Robustness

If parameters are poorly known, only graceful way to degrade is to slow down allocation rate

Can slow down allocation rate is to use generational scheme
- Nursery filters into primary heap
- Removes short-lived data, reduces total floating memory
- Unacceptable in naive approach due to variation in pause time when nursery overfills during mutator slice
- Syncopation a solution
  - Always copy out of nursery during collector slice
  - Use nursery survival rate as a parameter so to size nursery so only one evacuation

19.6 Combining Scheduling Approaches: Tax-and-Spend

Prologue

Metronome best on uniprocessors because of need to stop all mutation

Work-based collectors have order of magnitude worse latency than time-based

Slack-based fragile, will break when no slack; under load

Tax-and-spend combines all of them.

Performant, cut metronome latencies down to a third of previously

Mutators must do an amount of collection work dependent on desired MMU.

Slack time used by collector to give MMU 'credit'.

Taxation can be scheduled during any safe point, but always checks when thread-local allocation buffer exhausted.

Tax-and-spend scheduling

MMU surperior to maximum pause time because it allows for computation of clustering of pauses that would cause a missed deadline.

MMU can vary by thread to prevent missing deadlines with tax-and-spend

Can also use other cores to offload collection work

Per-thread scheduling
- Must track mutator activity to determine tax to be paid
- Must avoid letting tax get too high, such that a deadline would be missed.
- By interleaving collection and mutation in same thread, you prevent the OS scheduler from placing some other mutator or an unrelated process on the core after the mutator yields.
- Allowing MMU to vary between threads important when allocation rates vary

Tax-based vs slack-based scheduling
- Slack-based degrades with no slack
  - Poor for interactive real-time systems with sustained peaks of use
- Time-based scheduling may introduce jitter because may pause until MMU is almost violated, such as at start of new window of time consideration.
  - Ensures that for time periods smaller than a window, mutator responsiveness inconsistent
  - Tax-based more consistently realtime

Combining tax-based and slack-based scheduling
- Tax-and-spend combines all the other scheduling approaches by using an economic model.
- Mutator threads pay back tax by collecting incrementally during allocation
- Dedicated mutator threads with low priority exist to work during slack
  - Deposit work done as a 'credit' in a global pool for mutators to take from
  - One per processor usually
  - Must sleep itermittently to allow mutator threads which appear to schedule over them
- Tax over all threads must be high enough to finish collection before memory runs out
- When tax due tries to withdraw from global credit bank
  - If can, then collector is keeping up with mutator so doesn't need to pay tax.
  - Treats slack on uniprocessor and unexploited parallelism on a multiprocessor the same.

Tax-and-spend prerequisites

Needs an underlying collector that's both incremental and conucrrent. Should be parallel on multiprocessors.

Metronome changes for synchronization without halting mutator threads:

Name	Description
Raggeed epochs for global consensus	Needs global states to hold (such as change of fromspace/tospace pointers in read barrier) for all threads for phases of collection to change. Uses an epoch number. Threads that change state during yielding to collection increment their epoch. Once the lowest epoch among any threads is value of phase change started, can assume all mutators have hit that phase.
Phase agreement with 'last one out'	Global phase. When mutator thread starts to pay tax, increments worker count. If no work being done, and if worker count = 0, then it atomically changes phase and decrements worker count.
Per-thread callbacks	All threads are requested to make change(flush thread-local buffer, copy out roots from mutator thread stack) by global thread whenever collection being done and hits step that demands it. They're given a callback to perform to do that step. If blocking on IO or executing native code, prevented from returning and action performed on their behalf.
Priority boosting to ensure progress	If higher-priority work threads saturate processor, above steps can be starved. Need to boost threads requesting actions from other threads.

19.7 Controlling Fragmentation

Prologue

Fragmentation can cause violation of real-time space bound.

Can compact to avoid fragmentation

Can also use arralyets and object-lets to have fixed size allocation blocks, at cost of some internal fragmentation and higher memory usage

Concurrent replicating copying collectors have problem of keeping both copies of object consistent. Locking doesn't scale, and can cause mutator thread starvation.

Compressor and Pauseless algorithms both rely on traps and page-level synchronization and so suffer from poor MMU.
- Work-based, and causes trap storm on phase. (Interferes with all threads)

Lock-freedom very desirable in a realtime system. Ways to obtain it follow:

Incremental Compaction in Metronome

Metronome non-copying, compacts while threads are stopped

Tax-and-spend version doesn't perform any compaction

Bacon(2003) uses segregated-fits and incrementally defrags size classes in slow/infrequent path of allocator
1. Sort pages by number of unused objects per page from dense to sparse
2. Set the allocation page to the densest non-full page in resulting list.
3. Set the page to evacuate to the sparsest page in same size class
4. While target number of pages to evacuate in size class not met, and page to evacuate does not equal page in which to allocate, evacuate from sparsest to next allocation page, moving to next page when allocation page filled.

Moves minimal number of objects and frees maximal number of pages

Always choosing densest can lead to hopping pages and poor locality, can require minimum number of cells in allocatin page

Incremental Compaction on Uniprocessors

To obtain atomicity on a uniprocessor can simply disable scheduler interrupts or only allow thread-switching during a gc safe-point.

If brooks forwarding pointers used, can optimize access.
- In replication, forwarding pointer of each copy points to other copy. If only one, pointer points to self.
- Mutator uses whichever address it has, no synchronization needed.
- On write, write to writes to both the origional and copy. (Kalibera 2009, Java)
- Double write is cheap, and makes reads much faster.
- Many more reads than writes.
- Doesn't harm cache through needing both lines in memory for each read.

Incremental Compaction on Multiprocessors

Stopless: lock-free garbage collection
- Ensures lock-freedom
- Doesn't double write
- On start of collection, threads race to allocate a "wide" version of the object and set redirection pointer to that object.
  - Wide copy has state for each field that is either 'in origional', 'in wide copy', or 'in copy'. All set to 'in origional' at first.
  - On write, write to wide copy and set field to 'in wide copy'.
    - Also done by copying collector when it copies object with reference in that field
  - On read, read from origional or copy based on state.
  - Once all fields 'inWide', copying collector makes 'narrow' version in tospace and puts address as forward reference in tospace
  - Copies fields to copy, and set's wide's field status to 'in copy'
- Downsides:
  - Doesn't work on double-word fields if double-word writes not atomic on architecture
  - Has space overhead of 3 copies of object for a period
    - Can use space filled by copies of objects emptied from pages during incremental compaction to hold copy.

Staccato: best-effort compaction with mutator-wait freedom
- Doesn't require atomic instructions, but can use memory fence at GC safe points to ensure up-to-date references
- Collector reserves bit in forwarding address to state that object is currently being copied.
- Algorithm:
  1. Set copying bit using compare-and-set. Mutators access forwarding pointer without atomic operations, so changes take time to propagate. If a mutator accesses the object, the copying bit is unset atomically with compare-and-set. If this fails, copying finished and should go through reference.
  2. Wait for ragged synchronization at a read memory fence at a safe point.
  3. Perform a read fence of object to ensure that changes made by mutators between setting bit and ragged synchronization are seen by collection thread.
  4. Allocate copy, and copy fields
  5. Set read fence so mutators see field changes.
  6. Once ragged synchronization happens, make copy refer to itself through forwarding address, and unset copy bit using compare-and-swap. If fails, old object must have been moved by another mutator. Abort.
- Can amortize memory fence overhead since many objects usually evacuated from/to same page.
- If collector tries to incrementally copy objects with high mutator locality, the mutator threads will cause an "abort storm".
  - Unlikely because only sparsely populated pages moved so unlikely to move objects all allocated together recently.

Chicken: best-effort compaction with mutator-wait freedom for x86
- Uses x86 memory model to know strong read ordering. Only needs to abort on write

Clover: Guaranteed compaction with probabilistic mutator lock-freedom
- Each move selects a random value alpha and sets invariant that mutator won't write it into heap
  - If it tries to, block mutator.
  - Best to try an illegal value for the program, such as a pointer out of memory range or a NaN variant not used by hosted language.
- Once a field is copied, it is replaced with alpha via compare-and-swap.
- If mutator accesses an alpha, it reads//writes field to second copy.

Stopless vs Chicken vs Clover
- Stopless can't guarantee progress to collector because of potentially infinite writes to wide copy's fields.
- Chicken guarantees progress, but aborts copies
- Clover may interleave with repeated changes to a field. It must keep copying, and is unable to compare-and-swap afterwards because of field changes.

Fragmented Allocation

All compaction techniques have overhead, so process must trade throughput for fragmentation

Solution may be to only allocate in fixed-size granules with no constraints on locality.

Arrays split into a trie of array
- Problem is that access time is variable, not constant now
- Java allows arrays with no maximum length, so can't prove deadline on memory reads.
- Solution: Schism by pizlo
  - Uses spine of arraylets, stole from metronome.
  - Access is O(1) because field locations computed from values in a 'sentinel fragment' in header of start of object or array. Becomes one layer of indirection. Is standard reference
  - Problem of dynamic arrays
  - Memory needed bounded by 1.3104 * size of live set.
  - When space for sequential allocation of all arraylets, can mark in sentinel fragment and use conventional access techniques.

Cards/pages	depends on either number of cards/pages dirty
Pointer without duplicates	Number of objects modified
Pointer with duplicates	Number of pointer writes

Sunday, September 28, 2014

Chapters 13-19

Chapter 13: Concurrency Preliminaries

13.1 Hardwarse

Threads

Interconnect

Memory

Caches

Coherence

13.2 Hardware Memory Consistency

Prologue

Fences and happens-before

Consistency models

13.3 Hardware Primitives

AtomicExchange and Test-and-set are both fairly simple. They set a value unconditionally and return the old value. This can be seen as setting a value conditionally if the semantics are "only proceed if a certain value was returned."

Compare and Set

Compare and Swap = Same as compare and set but returns old value

Load-linked / Store Conditionally

Atomic Arithmetic: Atomic increments and decrements

More powerful primitives

Overheads of atomic primitives

13.4 Progress Guarantees

Important to guarantee that the threads contending will make progress, bound wait times

Guarantees:

Guarantees and Concurrent Collection

13.5 Notation Used for Concurrent Algorithms: Not relevant. Used for describing pseudocode.

13.6 Mutual Exclusion

Need to balance progress guarantees and mutual exclusion

13.7 Work Sharing and Termination Detection

Parallel algorithms need a way to detect termination of parallel algorithm

Simple if processing a finite amount of data, but sometimes threads create more work to process

May use a seperate thread whose only job is to detect terminations

Rendevous Barriers

13.8 Concurrent Data Structures

If structure accessed rarely, can simply lock it entirely with a mutex, but won't scale

Concurrent structures exist that allow a interleaving of certain actions on structure.

Concurrent Stacks

13.9 Transactional Memory

Implementation:

Use for collection

Interaction with languages with transaction memory

13.10 Issues to Consider

Hard to get right

What guarantees are actually necessary?

What atomic primitives are supported on our architectures?

Chapter 14: Parallel Garbage Collection

14.1 Is There Sufficient Work to Parallelize?

Goal is to reduce overhead by exploiting hardware

In stop-the-world, will shorten pause times

In concurrent schemes, will shorten GC cycle time

Must be worth it in the tradeoff of the overhead of synchronization primitives

In practice, not many stalls because the object graph is not very deep. This is amenable to parallelism.

Harder to parallelize tracing, easier to parallelize "sweep" activities like fixing references.

14.2 Load Balancing

Needs to properly allocate work to threads, or will be mostly sequential execution

Has coordination cost

Methods

14.3 Sychronization

Over-diving into too small granules of work will have synchronization overhead, despite better processor use

Need to prevent double-copying because of problems like future updates only affecting one version, inconsistent state.

Synchronization may be removed where actions are idempotent and repeating work not too costly.

14.4 Taxonomy

Processor-centric algorithms

Memory-centric algorithms

14.5 Parallel Marking

Prologue

Processor-centric Techniques

14.6 Parallel Copying

Differs from marking in that may only copy once

Processor-Centric Techniques

Memory-Centric Techniques

14.7 Parallel Sweeping

Embarassingly parallel

Either partition heap statically or over-partition

Giving each thread multiple blocks a good way to reduce contention

14.8 Parallel Compaction

Easy to parallelize since we know the future location of a piece of garbage after marking(since we know amount of live data seen before it during marking.)

Multiple sweeps required:

Can do each in parallel, but need all threads to reach rendevous before next phase starts.

Chapter 15: Concurrent Garbage Collection