Adventures In Garbage Collection: Chapters 9-12

9-12

Chapter 9: Generational Garbage Collection

Prologue

Tracing collectors have best case when heap is mostly garbage, but long-lived objects tend to exhibit pathologically bad behavior for copying and mark-sweep

Ex: Mark-sweep has dense prefix near bottom of heap, copying moves data unnecessarily

Solution: Segment the heap into generations and collect the youngest generation(nursery) most frequently. Tenure into older generations after surviving a number of sweeps.

Most of the time use copying

Has good mark/cons ratio, so should be faster for young generation

Can tune pause times / frequency of pauses by altering nursery size

Only improves average-case time. Worst case still needs to sweep whole heap.

Downsides:

Garbage sticks around in older generation longer

Cross-generational pointers need bookkeeping, will likely impose mutator overhead

Hard to tune perfectly

9.1 Example

Minor collection = Collection of only the nursery

Need to decide how/when to promote objects

Collector needs a way to measure the age of objects to decide if they've survived enough minor collections to be tenured

If a collector only sweeps one generation at a time, must record inter-generational pointers
- Can create inter-generational pointers via mutator use in program or via promoting an object between generation. Must track both of them.
  - Requires a mutator write barrier and a copy barrier. Both must check for creation of inter-generational pointer.
- Inter-generational pointers can add false roots to collections. Until older generation collected, garbage in old generation can keep young objects alive. Can even lead to promotion. Called "nepotism".

9.2 Measuring Time

Two choices: Bytes allocated or seconds elapsed

Seconds Elapsed

Considers the distribution of pause time

Good for ensuring requirements such as maximum gc pause time

Bytes Allocated

For actual reasoning, bytes allocated between life and death give more information about object

Largely machine/environment dependent

Tricky in multithreaded systems

Reality

Most of the time, the number of collections survived is used as a proxy for bytes allocated

9.3 Generational Hypotheses

Weak Generational Hypothesis: Most objects die very young

Appears valid across many languages/domains

Frequent for most data to only survive one cycle, and remaining minority of objects to be immortal.

Strong Generational Hypothesis: Younger objects are more likely to die than older ones, even if they are not allocated in the last generation

Not decidedly correct

Evidence to contrary:
- Tend to see many objects die in clumps because programs operate in clumps
- Immortal objects can be created at arbitrary locations in program flow

9.4 Generations and the Heap Layout

Can treat generation management in many different ways

Ex:
- Fixed size of nursery vs ratio of heap
- Homogenous generation or generation segmented by ages
- Shared large-object subspace or per-generational one

Must decide sizes of generations

Benefits of small collection space:
- Fast collection times because small generation

Downsides to small generations:
- More frequent collections
- Less time for objects to die, will have to copy objects between semispaces(or even to older generation) and waste time.
- Older generation will fill too fast because of a higher promotion rate
  - Larger older generation negates many benefits of generational collection.
  - Larger older generation also causes nepotism because objects promoted too soon likely to both become garbage and refer to young objects
  - Younger objects are modified more frequently. Tenured objects require a mutator overhead on writes for inter-generational references. By promoting too early, we will spend a lot more time in this barrier.
  - Will "dilute working set" by messing up assumptions about behaviors of spaces.

9.5 Multiple Generations

Adding more generations solution to need to have both short pause times for nursery and infrequency of major collections.

Problem: If the collector promoted all live objects during minor collection, then it promoted objects made right before collection

Objects made right before collection haven't had time to become garbage, which they likely will

Solution: Multiple Generations. Can sweep the intermediate generation with frequency between nursery and oldest generation. Will free these freshly made objects much sooner than if they were tenured.
- Helps prevent nepotism
- Downsides:
  - Most systems collect younger generations when older collected, because only require write barrier for old-to-young which are less frequent
  - Ensures that pause time will be between nursery and major collection. Longer than nursery, but more frequent than oldest.
  - Increasing number of generations increases number of inter-generational pointers which increases the time spent in the mutator's barrier
  - Increases the size of root set, and may have pointers from garbage into nursery's root set keeping large garbage structure alive.

Most collectors only have 2 generations
- Clever control of promotion rates used instead of multiple generations

9.6 Age Recording

En Masse Promotion

Easiest strategy is to have each generation but oldest have a single semispace and copy live objects "en masse" to the next generation on collection.
- Can lead to 50% to 100% higher promotion rate than if we require that an object survive at least once collection before promotion from nursery
  - Increasing survival requirements for copying above 2 has diminishing returns.

Aging Semispaces

Create a generation partitioned into semispaces and copy objects back and forth between semispaces a number of times before promoting semispace as a whole.

Either all are copied to another internal semispace or all are copied to next generation, depending on generation's age as a whole.
- Allows oldest objects in generation time to die, but youngest in generation may still be prematurely promoted. Can avoid with bookeeping.

Bucket Brigade
- Idea is to create n buckets per generation and to copy from younger bucket to progressively older buckets. The oldest bucket is copied into the next generation.
- Important: All buckets in a generation collected at same time, inter-bucket pointers not an issue.
  - Use of steps in youngest generation can solve problem of mutator overhead with intermediate generations

Survivor Spaces and Flexibility

En Masse / Aging Semispaces both wasteful because need to reserve twice as much space as data for copying.

Solution is to use an eden + 2 survivor spaces structure
- Youngest generation is one large creation space(eden).
- On collection, live data in eden evacuated to younger survivor tospace.
- All objects in younger survivor fromspace evacuated to either older survivor tospace or to eden depending on age
- Works well because eden is very large(makes sense, weak generational hypothesis). Ex: HotSpot has ratio of 32:1 between eden and semispaces. Would need twice this space for copying reserve.

Can combine with aging semispaces by creating buckets in the survivor spaces. Can use to keep age bookkeeping easy.

9.7 Adapting to Program Behavior

Prologue

If program does not behave according to weak generational hypothesis, a generational garbage collector's performance will suffer

Might be managed with a way to control tenuring.

Desirable to alter behavior of generations(ex: nursery space) to match behavior of mutator to avoid premature tenuring and to avoid holding on to excessive garbage.
- Consider the behavior of allocating frequently and then freeing in clumps.

Appel-style Garbage Collection

Three spaces: reserve, nursery, and old generation. On collection, all data moved to old collection and rest of space used for nursery and reserve. If space falls below threshold, full collection.

Reserve size is dynamically set this way

Promotes all inter-generational garbage into older generation, which is then freed at next major collection.

Can also work in block-structured heap or by using shrinking nurseries with mark-sweep/other noncopying algorithm of the older generation.

Pathalogical worst case: Quick allocation with very low promotion rate will lead to a shrinking nursery and heap thrashing with minor collections. A major collection never occurs though, because of low promotion rate.
- Solution: Major collection whenever young heap falls below certain size.

Feedback Controlled Promotion

Mostly related to pause time reduction

Demographic feedback-mediated tenuring
- Fights pause time caused by collective death of many promoted objects
- Volume of objects promoted used as predictor for length of next collection. Sets rate of promotion.
  - If survivors exceed a value, looks up survivor count in table which says age threshold for promotion of next generation.
- Can't move old objects into nursery.
  - Threatening-boundary solutions to this exist.
    - Move threshold between two generations, can move in either direction.
    - Must track more pointers because can't know at allocation time which generation things may lie in because of boundary motion.

Ergonomics from HotSpot
- Focuses on three soft goals. In order of priority:
  1. Maximum pause time
  2. Throughput
  3. Memory footprint
- Pause time addressed by shrinking generations, one with longest observed collection time first
- Throughput addressed by increasing generations in proportion to collection time
- Venegerov has ThruMax algorithm for reasoning about HotSpot information.
  - Focuses on ratio of free space in old collection after major collection to amount of space used during promotion
  - ThruMax invoked after steady-state of survivor spaces reached.

Ergonomics has many tunable options for end-user, most of which are interdependent.

9.8 Inter-Generational Pointers

Prologue

Roots for each generation contains pointers in registers, stacks, globals AND all references into the generation from other heaps not being collected.
- Includes older generations and spaces outside of the generational heap(large object, immortal, code)

Can be created three ways:
1. Object creation
2. Mutator updates to pointer slots
3. Promotion

Call these pointes "interesting pointers" because they must be kept track of.

Problem of "boot-image" objects, ones alive when system starts. Must keep track of references in objects, but costly to scan. Can use tracing, scanning, or remembered sets.

Remembered Sets

Holds the location that a potentially interesting pointer could be stored.

Stores the source, not the destination.
- Allows coping collector to update reference
- Goal is to track as few pointers as possible
- Needs write barrier to update reference by mutator and not collector

Pointer Direction

All stores do not need to be recorded
- Ex: Some languages store activation record on the heap. Can be parsed to determine current pointers in objects.

If young generation always collected whenever the older is, don't need to store pointers from young to old.

Most mutation is initialization, initialization must refer to older objects.
- Ex: Purely functional languages will only refer to older objects, except for thunk replacement.
- Ex: ML requires mutable variables be annotated. They're the only source of old-to-young pointers.
- Exception is OO languages like Java.
  - Many of these languages are written in an initialization-heavy way.
  - Programs vary heavily. Many of them will reassign a field very frequently. Can change policies for remembered set implementation.

In systems with multiple, independently collected heaps, may want to store mutation info to guess as spaces with most garbage. In this case, need write barrier in both directions.

9.9 Space Management

Minor GC is short, fast. Major GC slow because also sweeps younger generations.

Major GC may use copying for older generation, but bad choice because the space requirements for copying will make nursery allocation thrash smaller heap more often. Also moves long-lived data.

Choice of mark-sweep can be good for older generation

Allocation is slower for free-list, but cost deferred to collector since mutator allocates out of nursery.

Fragmentation can be controlled by infrequent mark-compact run
- Can ignore "dense prefix" of very long-lived data.

Copying collection usually used for nursery

Cost of tospace overhead not bad in average-case because few objects survive first collection

Space usage can be improved by switching to compacting collector "on the fly" during a collection

9.10 Older-first Garbage Collection

Other possibilities for age-based collection exist

Name	Description
Youngest-only(Generational)	Only collect youngest objects in heap
Oldest-only	Only collect oldest objects, on the premise that they are likely to die. Probably ineffective due to weak generational hypothesis
Older-first	Focuses on middle-aged objects. Gives them time to die without wasting time on "sediment"
Renewal Older-first	Consider age of the object to be time since last collected or time created. Mixes up object order in heap. Mutator overhead.
Deferred Older-first	Consider a sliding window of the heap. Move survivors to one side of the window, and march down toward the younger side. Start over at oldest side. Mutator overhead.

9.11 Beltway

Beltway collector exploits five generational insights:

Most objects die young
Avoid repeatedly collecting old objects
Exploit incrementality(bound size of spaces, use small nurseries, etc)
Small nurseries with sequential allocation are good for spacial locality
Objects need sufficient time to die(avoid pretenuring)

Can be configured to behave like any other copying collector

Beltway breaks collection into units called "increments"

Increments are ordered into queues called belts

Increments selected FIFO for collection

Usually selects oldest increment on youngest belt

Specifies if it should move survivors into increment on same belt or promoted to higher belt

Not inherently generational. Collects increment by increment, doesn't treat belt like generation and collect all of each belt.

Ex:

Simple semispace: One belt, two increments

Generational: One belt per generation
- Nursery increment on youngest belt may be of fixed size or allowed to grow(appel-style)
- Mutator must track references from older to younger increments and from older to younger belts

Because so configurable, attributes depenend heavily on configuration

9.12 Analytic Support For Generational Collection

Generational Collectors handle young objects well but old objects poorly.

Problems:

Collection of garbage in older generations not prompt

Long-lived objects must be copied between generations/spaces until they reach oldest generations

Solutions:

Can improve with collection of metrics to determine what to pretenure.
- Machine Learning possibly used here

Can analyze source to determine which objects likely to be created/changed/die together and keep objects in same space
- Good improvements seen. Doesn't need to be perfect.

Functional languages can know that immutable objects can only point to older objects.

9.13 Issues to Consider

If weak generational hypothesis does not hold for a program, generational GC will perform poorly.

Only improves expected pause times, not worst-case pause times because must do major collections eventually

Offers many tunable parameters which may be confusing/interrelated.

Number of generations. Most often two and an immortal generation. Intermediate generations used to improve premature promotion, but better ways to do that.

Promotion rate depends on nursery size. May set fixed size or vary at runtime to meet throughput goals.

Promotion can be controlled by setting a number of survived cycles before collection.
- En masse = 1 cycle.
- Alternatives are aging semispaces and eden/survivor spaces

Mutator cost dependent on organization

Higher promotion rates can place a higher burden on the mutator

If we demand a collection of younger generations when older generations are collected, only need to track young-to-old pointers.

9.14 Abstract Generational Garbage Collection

Same way that we see tracing as deferred reference counting, generational collection is a form of deferred reference counting which keeps track of remembered set and typically tries to sweep nursery before defaulting to full collection

Chapter 10: Other Partition Schemes

10.1 Large Object Spaces

Can use actual size or size relative to allocation blocks

Large objects expensive to allocate and likely to cause fragmentation( CH. 8 criteria )

Usually don't move because of associated cost, but eventually might need to

Common techniques include

Mark-sweep, and freelists

Allocating directly into old generation

Making a header that is treated as a regular object and points to non-moving data body, which is freed when header is.

Treadmill Collector

Makes circular, doubly linked list of objects with black, white, grey, and freed segments. Uses tricolour abstraction.

Allocates until free-buffer region of doubly linked list is empty

Frees as copying collector, but copies by changing pointers.

Allocation and copying are cheap, constant-time.

Smart pointer policies(which end of the grey area of list determines breadth or depth-first traversal.

Overhead of pointers, but removal of need for seperate tospace.

Problem of allocation of different object sizes
- For large objects, not a problem since actual data kept on seperate page.
- May store pointers on page, or seperately. Good argument for seperately for cache reasons.

Moving objects with OS support

Can simply change virtual address that page is mapped to

Also able to lazily initialize data pages, not load until referenced.

10.2 Topological Collectors

Mature Object Space GC / Train Collector

Collection of oldest generation has unbounded pause time(function of live data in heap)

Solution is to manage this oldest generation("MOS" or mature object space) as a set of areas of memory and collect/evacuate one region each collection

Each area is a "car", cars are stacked in a "train" queue.
- Important that this moves cyclical garbage into the same car, so can free all at once.
- Algorithm:
  1. Select lowest-numbered car from lowest-numbered train. Call it fromCar.
  2. If no remember set to train, and no roots to train, collect entire train.
  3. Otherwise if referenced by root, copy to higher-numbered car toCar in higher-numbered train, making one if needed.
  4. Copy everything in C that can be referenced by toCar to toCar, creating cars for overfill on the same train.
  5. Move promoted objects to train holding references to it.
  6. Move every object in the remembered set reachable from another train to that train.
  7. Copy any objects reachable from other objects in current train to the end of the current train.
- Benefits:
  - Incremental
  - Bounds amount of copying to size of car
  - Moves objects close to other objects which refer to them
  - Only need to track higher-to-lower train pointers
- Downsides:
  - Garbage cycles which span multiple cars may end up in situations where they are never collected.
  - Does not bound remembered sets, which may be huge for popular objects.

Connectivity-Based Collection

Lifetimes of objects correlated with connections.

Ex:
- Stack-only references are freed quickly
- Globals tend to be immortal or near-immortal
- Objects connected by chains of pointers die together

Idea: Create partitions based on connectivity. Chain partitions in DAG. Everything in partition can only be referred to by other things in partitions or by partition's predecessor in DAG.

Does not require write-barriers or remembered sets

Did not do as well as could be helped in practice. Evidence suggests better information about connectivity could improve performance.

Thread-local Collection

If objects local to thread, can perform stop-the-world on single thread to collect this "heaplet" of data.
- Must still synchronize in case of shared, mutable data.

Usually partitions into heaplets and single, shared space.

Objects in heaplet may reference other objects in heaplet or shared heap. Shared heap may never reference objects in heaplet. Thread may never reference other thread's data.

Need to perform accurate escape analysis to determine where allocation should be done, and if reference was legal.
- Can use mutator barriers to accomplish this

In case of immutable local data and shared mutable data
- Can simply place forwarding address in shared object's header. Future references will use the new object through the forwarding address, and existing uses will see the old value.
- Can use for generational as well

Stack Allocation

Allocating on the stack where possible has opportunities.

Requires that nothing with longer lifespan reference data on stack, escape analysis
- Read/write barriers would need to move to heap in this case.
  - Must be rare to be worth it because costly
  - Used by Azul System's hardware jvm appliances using hardware trap.

Stack allocation excels in same environment where generational collection excels.

Region Inferencing

Objects are allocated into regions, which are freed as soon as nothing in the region is needed by the program.

Ex: Real Time Specification for Java

Requires either programmer annotation or a significantly intelligent compiler

Compiler-directed escape analysis and region inferencing is difficult and slow. Best example = ML Kit Algorithm.

10.3 Hybrid Mark-Sweep, Copying Collectors

Prologue

Collectors can be described by choice of how much data to consider "enough" to evacuate a block, and the expected allocation portion of a block

Uncompacted heaps can eventually run into performance problems from fragmentation

Solution: Combine mark-sweep with copying
- Divide heap into k+1 windows, keep one empty
- During collection, choose one window to be a fromspace, and use free window as tospace. Use mark-sweep for rest of heap.
- Will take k collections to compact the heap, with a space overhead of 1/k.

Garbage First

Goal is that no more than x milliseconds every y time slice are used for GC

Used in hotspot

Divides heap into windows of same size. Allocates from current allocation window.

Allows for arbitrary window number
- Imposes higher write barrier, many more interesting pointers.

Can be combined with generational.

Immix and Others

Dimpsey / IBM Parallel Mark/Sweep GC
- Used thread-local allocation buffers
- Found that dense prefix was slowing down allocation time, made seperate freelist for allocation of buffer-sized memory.
  - Great performance improvements. Didn't need to scan dense prefix for large enough block. Also shortened freelists.

Immix Collector
- Uses liveness estimate from previous collection cycle to determine whether to use copying collection or marking on a block
- Allocates into either free or partially-filled blocks using a linear, next-fit strategy.
- Allows allocation into smaller "lines" in blocks
  - A line is 128 bytes, roughly the size of a cache line.
- When marking, can "implicitly" mark lines because expects empty lines after last line in sequence of lines to free.
- Distinguishes "medium-sized" objects from small ones, and bump-ballocates into an empty block.

Copying Collection in Constrained Memory

Idea is to split oldest generation into many block-sized windows and to copying-collect oldest generation one block at a time, during the same collection period.

Uses mark phase to create remembered sets

Lower space usage = fewer collections

Downsides:
- Each full collection touches every object twice: to mark and to copy
- Still has overhead: requires space for bookkeeping data structures for marking/collection

MC-squared collector

Relaxes need for blocks to be layed out sequentially by numbering them.

Allows for oldest generation to avoid copying full blocks by simply modifying numbers.

Marks incrementally.

Segregates popular objects into special block without remembered sets. (Reversibly marks as immortal)

10.4 Bookmarking Garbage Collection

Page faults have incredibly high performance overhead. Bookmarking collector prevents paging during collection and mitigates overhead for mutator.

Works with OS to choose page to evict.

Default choice likely to be least-used page, which for a copying collector is likely to be in the next tospace. Worst possible choice, since we will allocate from it soon. Fromspace page better choice

If the runtime must choose a page to evict, we "bookmark" live references in it before evicting it. We use this bookmark in collection, and then erase it when we next load the page.

Sets empty pages to be first to free, otherwise chooses victim and marks outgoing references.

Bookmarked objects never moved to avoid having to update pointers on non-resident page.

10.5 Ulterior Reference Counting

Is segregation of objects based on mutation rate.

Tracing great for mutation-high, allocation-high, short-lived data. Reference counting great for data with low mutation rate. Both algorithms are at worst-case for other input. Makes sense to use right algorithm for right data.

Uses a bounded-sized nursery with a copying collector. Survivors are promoted to the reference counted area. Pointers from the RC area to the tracing area are tracked.

Collects cycles when heap size falls below limit.

10.6 Things to Consider

Specialized behavior good for large objects, and objects without pointers.

Objects part of large pointer cycles should all be stored together to allow freeing it by freeing a single page.

Tends to allow for very-specialized collection schemes for objects known to have special traits

Chapter 11: Run-Time Interface

11.1 Interface To Allocation

Prologue

Languages require that the object be both allocated and initialized(to a varying degree)

Three phases to allocation/initalization

Step	Job	Description	Overseer
1	Allocate cell of correct size and alignment		Allocation subsystem of memory manager
2	System Initialization	Setting fields required for object to be legal(header bits, method dispatch)
3	Secondary Initialization	Setting fields after the object has been returned from allocation subsystem.

Languages which require complete initialization in steps 1 + 2 and forbid step 3 (haskell, ml), must pass an "initialization closure" to allocator to retrieve values from. Stems from problem of needing to specialize allocation to each type.
- Alternative(haskell) is to inline all initialization and call collector on exhaustion. Possible because allocation simple "bump a pointer"

Complete initialization allows for optimizations.
- A generational write barrier isn't necessary because all references must be to older objects.

Paramount that steps 1 + 2 be atomic with respect to collector/system. Otherwise may try to interact with an object lacking headers.

Possible variations on divisions of labor between steps.

Parameters to allocation(Step 1) may include
1. The size requested
2. An alignment constraint
3. Kind of object to allocate(for header fields, size, space to allocate into)
4. Type of object(Object type, including fields to initialize)

Post conditions for step 1: May or may not zero, object header may be filled or not, full or partial initialization

Must be atomic if inlined.

Speeding Allocation

Solution may be to inline the "common case" of success and call out to less common cases

Sequential allocation fast, especially if inlined
- If pointer to bump in register, assumes per-thread allocation areas

May make segregated fits as fast if we know the list to allocate from at compile time
- Can be as simple as "bump a pointer with checks" in sequential allocation.

Zeroing

Some languages require zeroing, others do not
- May be used for debugging, but not the standard

Zeroing in bulk, using optimized native libraries, are the prevailing wisdom

Zeroing during collection has poor cache performance. Need to zero somewhat ahead of allocator, but soon enough that page stays in cache

11.2 Finding Pointers

Collector must be certain that garbage has no reference to free, but may choose to not free garbage with no penalty rather than wasted memory in footprint

Conservative Pointer Finding

Treat every segment of bytes that could be a pointer value as a pointer value

Must be fast
- Filters out values that do not refer to any heap areas in memory
- See if the reference refers to data which has been allocated.
  - Language requirements on pointers referring to start of object or at known offset can simplify this
  - Difficult for languages which allow flexible interior pointer locations(C)
- May need to choose to not allocate blocks referred to by observed literals which point into the heap range.

Accurate Pointer Finding useing tagges values

Many languages include a type tag with each pointer

Two Approaches:
- Bit Stealing
  - Align to make lower bits zero, and then use them to mark a pointer as a pointer
  - Reduces number of bits that can be used by numeric values, may require de-mangling for external numeric libraries to work.
    - Some processors have instructions for working with tagged integers
- Big Bag of Pages
  - Associates each block with a type
  - Better numeric convenience
  - Allows for memoization of value allocation, if the value is already in the block for the given type, simply use that instead
    - Called hash-consing
    - Should use weak references for bookkeeping structures internal to table/blocks

Accurate Pointer Finding in Objects

Objects typically associate pointer offsets with the type-level information such as method dispatch
- Can alter offsets to improve cache performance with modern set-associative caches

May be easier to partitions all pointers from all non-pointers
- If polymorphism is retained into runtime, can become very problematic
- Better idea may be to place pointers before reference offset into object, and values after it.

Some systems require that pointer-chasing methods actually be written by programmer

Possible to avoid objects headers entirely in statically-typed languages, but this might cause more work for collector when parsing the heap.

Accurate Pointer Finding in Stacks and Registers

Some runtimes heap-allocate call stacks to improve parsability

Others make the stack look like heap objects(GHC)

Three issues:
- Finding frames / activation records in stack
  - Also needs to be done by exception handler, continuations
  - Many runtimes have a reference field in each frame which points to the start of the previous frame.
  - May include maps from return addresses to functions which called that address
  - Stack maps
    - Metadata added to each frame on the stack.
    - Collection during initialization of stack map is dangerous
    - Single map per frame not always possible
      - Stack Maps can't know type of polymorphic reference. A pointer in a list of one type is treated the same as a pointer in a list of another.
      - Needs to check types in call predecessor to find.
    - Compiler will need to cooperate, propagate type information through the backend.
- Finding pointers in each frame
- Dealing with calling / stack layout conventions

Finding pointers in registers
- Difficult because we can't impose invariants on registers as easily as on stack fields. Fewer registers demands more flexibility.
- Compiler optimizations may lead to register holding obfsucated pointer.
- If called function must preserve register contents, cannot know if register contains pointer. Must get information from calling function.
  - If stack doesn't have fields that refer to previous frame, necesitates another way to unwind the stack.
  - Solution:
    - Store metatdata in stack table that says which fields the registers are saved to
    - Move from top of stack downwards, 'unsaving' registers and seeing where they were filled in calling frame. After traversal, we will have pointers that the collector can update
    - After collector runs, walk back up stack, and resave modified pointers to stack fields
    - Make sure that we don't process roots from stack twice if they are also in the heap

Compressing Stack Maps
- Stack maps can occupy a lot of space relatively
- Compressing them makes sense
  - Multiple safe-gc points have same map
  - If pointers placed similarly together, even more have same map

Accurate Pointer Finding in Code

Pointers in code(including dynamically generater) may refer to data in the heap

General solution is to have compiler create a side table that marks embedded pointers.

Target Objects that Move
- If a target of an embedded pointer moves, must alter the pointer in the code
  - May not be able to alter code segment because of OS permissions
    - Costly to change permissions temporarily
      - Caches don't automatically update when code modified
      - Must specifically force invalidation
        
        Expensive
    - May be easier to disallow embedded references
    - If code is moved, then you must alter the return addresess and function pointers on the stack.
- Handling Interior Pointers
  - Interior pointers refer to object but aren't the 'standard reference'
  - Can be tricky. Can be outside of object itself. Ex: C allows pointers one location beyond end of the array to be considered valid.
  - Problem lies in finding object that pointer belongs to
  - Solutions:
    - Use a table or bitmask to mark the start of each object per block
    - May simply store references to power-of-two blocks and parse area containing pointer
    - Big Bag of Pages can use constant object size to simply find start using rounding
  - While they may be rare, storing structures to handle interior pointers can add space/performance overhead
- Handling Derived Pointers
  - May be treated similarly to interior pointers
  - We know that if the address is p + i, must belong to object at p if i < size(typeof(p))
  - If i is large, may point outside of p.
  - May refer to something like p₁ - p₂, the distance between the two pointers
  - If i is constant, can make tidy pointer, or standard reference address, at compile time.
  - If the tidy pointer made by addition can be optimized away, we must explicitly require that the compiler keep around a computed tidy pointer for each derived pointer.
  - For moving collector, must compute distance moved and add to derived pointer expression
  - May prohibit certain code optimizations

11.3 Object Tables

Rather than treating references as pointers to the objects, can treat them at pointers to values in a table or as offsets into that table.

Allows for straightforward relocation of data. Only one pointer needs to be updated, rather than multiple.

Should be weak references from table

May be beneficial to store attributes and bookkeeping in table rather than in object headers.

Can compact the table itself.

Can be problematic with interior or derived pointers

May be required by some language semantics

Ex: Smalltalk's object swapping. Would otherwise require heap parsing.

11.4 References From External Code

Many managed languages allow interop with other languages, those other languages may modify the heap of the managed language

System IO, file handles also an issue

Must consider external reference when tracing

Only during external function call

Must ensure external code can find internal object

Moving collectors a problem

May provide layer of indirection, so that moving the object in internal heap is safe
- Called handles. Are pointers into the "registered objects" table.
- May not work if external code doesn't follow access protocol. Ex: system calls and IO

May simply allocate the objects which will be "pinned" into non-moving space
- Difficult to determine. May offer programmer a pin/unpin function.

Options for collectors

Defer collection of pinned object's region while pinned

Immediately collect and move to non-moving region

Collector can tolerate non-moved objects. Can be inefficient/problematic.
- Will introduce fragmentation likely

Other language features may require pinning, such as keeping interior pointers on the stack. It may be easier to pin than handle rewriting.

11.5 Stack Barriers

If we need to scan a stack while a thread is running, we can place a stack barrier and not allow thread to return past that level of the stack

Allows for incremental scanning

Hijacks return address of frame that would return past it

Scanning thread can move barriers as it scans.

Can also be used to mark stack frames that have not changed since last sweep

Can be used to limit pause time

11.6 GC-safe points and mutator suspension

Two questions for a given address of code being executed: is it safe to pause and what is the size of the stack map table.

Certain sequences must be atomic/not interrupted for GC, such as write/read barriers and object initialization.

Need to specify "safe points."

Threads may be frozen at unsafe points

Must either interpret remaining instructions, or wake up until it hits a safe point

Needs maps + safe points anywhere that allocation can occur, or when a function call may happen which will do allocation

In order to avoid unbounded thread wait times, may want other safe points.

In each loop

At each function entry and return

Must make these "GC checkpoints" since they won't ever allocate and actually end up in GC

Need way to activate these checkpoints when thread restarted

Polling: At safepoint check if moving for collection

Patching: Insert a collection yield call at safepoint. Harder to get right, but faster.

Thread notification may be synchronous or async

If async, need to wait for thread to hit safe point and respond so still sync.

11.7 Garbage Collecting Code

Many languages dynamically create and compile code

Need way to free code that won't be executed

Need runtime support to mark code that won't be executed, difficult to trace because frequently refered to by symbol table or global variables

Special Cases:

Closures = function that keeps environment of bindings
- Can determine which variables are used by code and which can be pruned from closure, freeing other closures most likely

Classful OO
- Normally objects point to classes, code. Pointer ignored by tracer because in non-collected, non-moving area.
- Need to scan pointers, costly
- Might want to only run on special cue.
- Also must ensure that either class unloading/loading is idempotent(immutable class variables for example) or that nothing references class in future.

Code which has been optimized, and previous version still needed by executing code
- May need RC, tracing
- On-stack replacement allows old code to be freed.

11.8 Read and Write Barriers

Prologue

Without read/write barriers, are many situations where mutator can hide pointers from collector.

Must consider any references from outside subset of heap being collected as roots, since can't know if they're garbage.

Can break into the detection and recording of interesting pointers.

Engineering

Barriers usually apply checks before allowing an action to progress

Inlining very desirable
- Can't inline code sequences which are long
  - Solution is to inline the fast path and call out to the slow path on success.
  - Should ensure that slow path orders checks to match as soon as possible.

Must be able to access data structures quickly as well

Requires cooperation of compiler
- May need to not optimize the allocator sequences for atomicity requirements.
li>

Precision of Write Barriers

Whether to include metadata and to record more types of interesting pointers

Choice of trade off of more work for the mutator or collector
- Collector will have to generate the data that the mutator doesn't save

May be required to provide some information for some collection schemes.

Decisions:
- How accurately to record pointer writes?
  - May be cheaper to log everything
  - Depends on remembered set implementation. May be costly to store unnecessary information.
  - May have to filter for concurrent/incremental collectors.
  - May want to inline some filters and call out to others
- At what granularity to record pointer location?
  - Most precise is address of field in object.
    - Will make remembered set larger
  - May simply point to object, let collector scavenge
  - May use cards(fixed-size regions of memory) or pages.
    - Simply mark card/page as dirty and then let collector traverse
- Should the remebered set be allowed to contain duplicates?
  - Card/page and sometimes object - granularity will avoid duplicates.

Collector's cost:

Cards/pages depends on either number of cards/pages dirty

Pointer without duplicates Number of objects modified

Pointer with duplicates Number of pointer writes

Hash Tables

For pointer / object deduplication, remembered set must be a set.

Adding items must be fast

Must have one per generation

Example: Circular hash table for each generation, with linear hashing

Strategy: With multiple generations/remembered sets, let mutator place interesting pointers in "scratch" location, and let collector sort them out.
- Removes need for mutator to find correct target generation

Hash tables used by train algorithm

Sequential Store Buffers

Faster recording of pointers

Can have one per thread

"Bump a pointer" insertion
- Can avoid out of bounds check using hardware paging.
  - Very costly to handle trap. Must be faster than running the check at each iteration.
- Can also use architecture-specific tricks
  - Solaris UTRAP for example.

May be used directly as remembered sets, or may be front-end to hash tables
- Choice depends on whether sets are discarded at the end of each collection, on collection scheme.

Overflow Action

SSBs and Hash tables can overflow
- Can enlarge or empty into another, larger structure if being used as frontend

Card Tables

Divides heap into fixed-sized cards

Marks a card dirty when an interesting pointer written into a field in that card

Moves costs from mutator to collector

May implement as bit array(one bit per card), but not best choice
- Bit manipulations not atomic. Software synchronization costly.
- Usually use array of bytes
  - Exist many efficient ways to zero an arbitrary byte, so usually dirties by making 0.
- Can reduce overhead by instead marking card which is standard reference to object.

Cards tables very small compared to heap usually.

Card size is a compromise between space usage and collector time

Promote objects in generational collector to clean card to avoid needless scanning next cycle(pg. 199 for clarification)

Since most cards will be clean, will waste time in sequential scan of cards looking for dirty cards
- May want to create set of larger cards, and use it to find the indices of the array of smaller cards to sweep.

In generational collector with no interesting pointers in nursery, only needs cards for older generation.

Crossing Maps

Start of card may not necessarily align with start of objects. Since very large objects can have fields many cards apart, can cause problems in scanning cards.

Need a crossing map that determines how objects cross cards.

One entry for every card. Stores the offset into the card that the first object starts.

Depends on whether cards record slots or objects.
- Slot-recording must store index to last object in card, negative value if no objects start in card
- When given card, can move backwards in array until a non-negative offset is found. This card contains start of object, at offset in crossing map.

Object-recording can traverse relevant card from first object.

Summarising Cards

If not all objects promoted en masse, some cards will remain dirty between collections.

Rather than re-scan these cards for pointers, why not store the interesting pointers in a hash table?

May also store information related to generation of referent in table, other information the collector must compute each sweep.
- Lead to 80% improvement in 5-generation SML collector

Hardware Virtual Memory Techniques

May intercept page eviction and collect 'dirty' bit at this time.

May also intercept any modifications of a non-nursery page, and set dirty bit at this point.
- Poor choice on modern machines, but done historically
- Using virtual memory for locking can lead to 'trap storms' when many pages(tospace, newly grown older generation) need to have attributes modified at end of collection.

Works without compiler support.

Write Barrier Mechanisms in Summary

No winner in performance comparison between different strategies.

Write barriers usually have costs less than 2%.

For each platform, a different technique might lend itself to easier implementation.

Chunked Lists

A common data structure used throughout runtimes/compilers is the chunked list

Consists of a linked list of nodes containing arrays.
- Typically makes sizes and starting position of blocks a value which is a power of two.

Desirable for parallel collectors because can take individual chunks. Less contention for lock to modify linked list.

11.9 Managing Address Space

Some algorithms require large, contiguous memory regions.

Some 32-bit machines can't provide that space.

Dynamic libraries can be located anywhere in address space.

Many do it to allow latency-sensitive code to be fast by allowing them to do simple pointer comparisons to determine allocation space.

Better to not require contiguous memory, if table lookup cost is acceptable.

Solution is to work with power-of-2 sizes frames. May mark frames with generations or other attributes.

Can use software emulation of virtual memory, or simply use the OS's virtual memory hardware.

Terminology

Reserving = Using the managed system's notion of requesting addredd space for a use.

Allocating = Requesting memory from the operating system.

11.10 Applications of Virtual Memory Page Protection

Prologue

Many GC-semantic checks can be done cheaply through configuration of the virtual paging hardware. A violation will be caught as a memory protection fault.

Modifying protections can be expensive

Useful when dealing with uncooperative code.

Double Mapping

System maps same page at two addresses, with different protections at each address.
- Ex: Preventing fromspace access in concurrent copying collection by setting address mutator knows as "no access" and then mapping somewhere else for collector to use.

May be harder in 32-bit systems

Some caches index by virtual address. Cache coherency can be threatened by double mapping.

Application of No-Access Pages

Uses:
- Unconditional read barrier
- Detect null pointer dereferences(trying to access address 0, or near it)
- Guard Paging
  - Doesn't need to check that allocator pointer is in bounds if page followed by no-access page.
  - Can use same idea for stack or heap overflows.
- Can allow large logical space but small virtual memory space

11.11 Choosing Heap Size

Larger heap sizes generally lead to less thrashing and more throughput.

Too big a heap can lead to excessively large working set, take up too much data in bookkeeping data structures to perform well.

Schemes:

After each collection, collect system and heap usage statistics to decide how much heap space is safe to use before next collection

If collection frees less than a certain fraction of the heap, expand the heap. This has no provisions for shrinking the heap.

If the user can inform the program of the target working set size, the collector can grow the heap until that size is met. Good observed performance.
- Also marked fromspace as discardable to virtual memory manager

Can rely on special OS syscalls to inform program how much memory it may use

Can use number of allocation stalls as indicator of memory load
- Tries to grow nursery every collection until stalls, and then shrinks heap to last observed size that didn't have stalls.

If we have knowledge of all programs running on the system, we can figure out heap size for maximum throughput for all processses.

Dynamic heap sizing is also possible. Change heap size in reaction to page faults

11.12 Issues to Consider

Initialization/allocation requirements for stability and performance must be considered above all

Degree that allocation sequence can be inlined.

Whether we can spare a single pointer-register for bump allocation

Not an option with too few registers or if register allocation would be too difficult.

Whether more information(pointer location, etc) can be passed on by compiler or whether conservative collection can be used.

Chapter 12: Language-Specific Concerns

12.1 Finalization

Prologue

Some resources which lie outside of the scope of the collector(file handler, network connections, etc) require that tasks be done when done with it

Need a way to perform thes freeing actions when the objects encapsulating that resource are freed.

Store table of finalisers for objects. An object is finaliser-reachable if it can only be reached from the finalizer table.

When Do Finalisers Run?

Problem: If they are run during collection, they may need to allocate on the (currently full) heap.

Solution: Run after collection

Create a queue of finalisers to run and process after collection

To run before mutator threads resumed or not?
- If so, possible deadlocks if expect response from thread or needs to acquire locks. Faster finalizer, lower throughput.

Ideally, finalizer semantics should not limit optimizations in runtime.

Which Thread Runs a Finalizer?

Idea: Use seperate thread to run finalisers.
- Finalisers must now be reentrant.
- Constructors and finalisers for the same type must be safe to run concurrently.

Can Finalisers Run Concurrently with Eachother?

If they must be safe to run concurrently with arbitrary non-finalizer code, they should be safe to run concurrently with eachother.

Can Finalisers Access the Object Which Became Unreachable?

Finalizer needs ability to specialize for given resource. Needs ability to take arguments(object fields) or to access the object.

If so, then collector must trace references of garbage object and ensure they survive

When are finalized objects reclaimed?

Some, more dynamic languages, allow for a finalizer to run more than once. Others may not make guarantee that it runs only once but programmer may not explicitly call multiple times.

What happens if there is an error in the finalizer?

Finalisers will have access to the language's error recovery mechanism. Not a GC problem.

Is there a guaranteed order to finalization

Sane default is to guarantee finalization in the order of reachability. If A and B are garbage, and A encapsulates or references B, but B only references the resource, free B first.

Cyclical references will require programmer understanding of semantics of finalisers in question.

The Finalization Rate Problem

If compiler optimizes away references to freed objects with finalisers, collector may free resource and then return to function stack referring to dead object's resource.

Solution is to provide a keepalive function that the compiler won't optimize away.

Finalisers and locks

Finalizers may modify global data structures related to resources, but appear to run concurrently to programmer-provided code.

Solutions include:
- Locking everything, always
- Adding finalization calls to a queue, calling them at a programmer or compiler-defined "safe point", and running all finalizers at that point
  - Pain in making sure location is safe.
  - If the thread running the finalizer already has a shared lock, it will run without the lock-aquiring semantics that the programmer designed.
  - Ex: Why java will not run finalizers while the thread has any user-visible locks.

12.2 Weak References

Prologue

Using Pointers to find object references can have suboptimal behavior

There may be data that we want a reference to, in order to use when there are other objects associated with it. If no other objects refer to it, we have no use for it.

These are called weak pointers, or weak references.

May be useful to be notified on nullifying these references, so that we can remove them from data structures.

Tracing collectors can implement by not chasing weak pointers during marking phase, and then altering them with new object address or nullifying them during collection phase.

Collector must be able to identify the reference as weak.

Solutions:
- May alter pointer's less significant bits to record information
  - Requires de-alteration to dereference
- May create weak reference object. Only special to collector, can behave like normal object to user code.

System must give programmer way to specify reference as weak

Additional Motivations

Caches, other programs which use a space-memory tradeoff. Getting "low space" feedback from the runtime can help simplify cache expiry logic.

Supporting Multiple Pointer Strengths

Prologue

We use the phrase x-reachable if it can be reached from the roots by pointers of strength x, and no higher. x* reachable is reachable by at least level x.

Strengths:

Strong	Collector never clears
Soft	Collector clears if space is at a premium
Weak	Collector always clears if only weakly-reachable
Finaliser	Reference from table of finalizers to object with finalizer.
Phantom	Collector must clear. Does not have ability to dereference.

Harder to implement with reference counting. Need to store more metadata, have consistent way to decide when to clear soft references.

Detecting cycles must account for these semantics

Using Phantom Objects to control finalization order
- Phantom objects can have a phantom reference to one object, but a strong reference to another. When the first object runs through finalization, the phantom reference is cleared. Then the second can run.
- Phantoms purposefully designed to have weaker reachability than finalizers for this reason.

Race in weak pointer clearing
- Finalizer race condition involving compiler optimizing references away can happen here

Notification of Weak Pointer Clearing
- Placement of cleared pointers in a queue allows for each notification

12.3 Issues to Consider

Finalization and weak pointer semantics determined at language design time.

Both of these features add complexity to tracing that should be considered from the onset. Do not bolt them on later.

Concurrent access and clearing of weak references is an issue

May be better to clear during infrequent stop-the-world of otherwise concurrent collector.

Date: 2014-09-16T21:58-0400

Author: akyte

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0

Cards/pages	depends on either number of cards/pages dirty
Pointer without duplicates	Number of objects modified
Pointer with duplicates	Number of pointer writes

Tuesday, September 16, 2014

Chapters 9-12

9-12

Chapter 9: Generational Garbage Collection

Prologue

Tracing collectors have best case when heap is mostly garbage, but long-lived objects tend to exhibit pathologically bad behavior for copying and mark-sweep

Solution: Segment the heap into generations and collect the youngest generation(nursery) most frequently. Tenure into older generations after surviving a number of sweeps.

Only improves average-case time. Worst case still needs to sweep whole heap.

Downsides:

9.1 Example

Minor collection = Collection of only the nursery

Need to decide how/when to promote objects

9.2 Measuring Time

Seconds Elapsed

Bytes Allocated

Reality

9.3 Generational Hypotheses

Weak Generational Hypothesis: Most objects die very young

Strong Generational Hypothesis: Younger objects are more likely to die than older ones, even if they are not allocated in the last generation

9.4 Generations and the Heap Layout

Can treat generation management in many different ways

Must decide sizes of generations

9.5 Multiple Generations

Adding more generations solution to need to have both short pause times for nursery and infrequency of major collections.

Problem: If the collector promoted all live objects during minor collection, then it promoted objects made right before collection

9.6 Age Recording

En Masse Promotion

Aging Semispaces

Survivor Spaces and Flexibility

9.7 Adapting to Program Behavior

Prologue

Appel-style Garbage Collection

Feedback Controlled Promotion

9.8 Inter-Generational Pointers

Prologue

Remembered Sets

Pointer Direction

9.9 Space Management

Minor GC is short, fast. Major GC slow because also sweeps younger generations.

Major GC may use copying for older generation, but bad choice because the space requirements for copying will make nursery allocation thrash smaller heap more often. Also moves long-lived data.

Choice of mark-sweep can be good for older generation

Copying collection usually used for nursery

Space usage can be improved by switching to compacting collector "on the fly" during a collection

9.10 Older-first Garbage Collection

Other possibilities for age-based collection exist

9.11 Beltway

Beltway collector exploits five generational insights:

Can be configured to behave like any other copying collector

Beltway breaks collection into units called "increments"

Increments are ordered into queues called belts

Increments selected FIFO for collection

Not inherently generational. Collects increment by increment, doesn't treat belt like generation and collect all of each belt.

Ex:

Because so configurable, attributes depenend heavily on configuration

9.12 Analytic Support For Generational Collection

Generational Collectors handle young objects well but old objects poorly.

Problems:

Solutions:

9.13 Issues to Consider

If weak generational hypothesis does not hold for a program, generational GC will perform poorly.

Only improves expected pause times, not worst-case pause times because must do major collections eventually

Offers many tunable parameters which may be confusing/interrelated.

Mutator cost dependent on organization

9.14 Abstract Generational Garbage Collection

Same way that we see tracing as deferred reference counting, generational collection is a form of deferred reference counting which keeps track of remembered set and typically tries to sweep nursery before defaulting to full collection

Chapter 10: Other Partition Schemes

10.1 Large Object Spaces

Can use actual size or size relative to allocation blocks

Large objects expensive to allocate and likely to cause fragmentation( CH. 8 criteria )

Usually don't move because of associated cost, but eventually might need to

Common techniques include

Treadmill Collector

Moving objects with OS support

10.2 Topological Collectors

Mature Object Space GC / Train Collector

Connectivity-Based Collection

Thread-local Collection

Stack Allocation

Region Inferencing

10.3 Hybrid Mark-Sweep, Copying Collectors