Monday, September 1, 2014

Notes on Chapters 5-8 of Garbage Collection Handbook

Chapter 5: Reference Counting
- Invariant: An object is alive iff the number of references to it are greater than 0
- Each object has a reference count
- When references mutated, we increment or decrement the reference count. When it reaches 0, it is freed.
- Required write barrier. 
- Mutator uses barriers to keep state of object graph consistent when changes can be non-atomic to the GC

1.1 Advantages + Disadvantages:
Advantages:
- Costs are spread throughout computation time
- Operates well without much headroom
- Potentially free memory immediately
- Locality may be not worse than actual program since only actively used or modified objects referenced
- Does not need runtime assistance, doesn’t need to know roots
- Heavily used today
- Available in C++ through “smart pointers” or pointers with special logic





Disadvantages:
- Imposes mutator overhead
- Naive implementation requires frequent, unnecessary, costly reference modification
- Reference count must be atomic with respect to pointer load or store, not sufficient for RC system to be only consistent with respect to itself. 
- Makes read-only operations require a reference change, cache-busting
- Cannot reclaim cycles in references 
- Storing a full pointer for every object(and less cannot be stored because of the potential to reference everything in the heap) is costly when the size of the pointed-to object is considered.
- May still pause during large deallocations. Amortization of this can lead to higher space usage

1.2 Improving Efficiency
- Two improvements: reduction of barriers and/or use of cheaper synchronization

Solutions:
Deferral: Delay freeing and bookkeeping to some deferred reclamation phase
Coalescing: Avoid unnecessary reference updates by only altering references at checkpoints in the program
Buffering: Collects all reference count modifications into list and allows collector thread to enact changes

- All share common idea: No changes made until end of the current “epoch” of collection. 
- “Buffered” counting covered chapter 18

1.3 Deferred Reference Counting
- RC mutator overhead large compared to generational/concurrent algorithms
- Removal of unnecessary mutations effective compiler optimization; hard to do by hand

- Deferred used by many high-performance RC systems
- Idea is to only track references on the heap. Fast but leads to out-of-date counts until end of epoch, when a stop-the-world collector corrects counts.
- When reference hits 0, add object to “Zero Count Table” to check and free or not at end of epoch
- ZCT contains objects which may be alive but have a 0 count

- Very short collection compared to tracing collectors
- Collection consists of “marking” everything accessible from roots and freeing anything in ZCT not marked
- Alternatively can increment count on objects accessible from stack references via chasing the roots
- If “marking” consists of incrementing count, need to decrement everything on stack accessible from roots and add back to ZCT 

- Can reduce cost of pointer mutations by 80% or more.
- Still requires updates when heap object changed

1.4 Coalesced Reference Counting
- Handles overhead of modifying pointers

- If you increment and then decrement, can remove operation and have same effect if not collecting between the increment/decrement. 
- On mutation, creates new entry to “log” and object marked as dirty. The pointer update and the object reference are in log.
- Safe for concurrent use
- During collection cycle, objects in log are modified, avoiding duplicates and avoiding doing work if the pointer wasn’t actually updated

- Adds need for (shorter) pauses and space overhead for ZCT and logs

1.5 Cyclic Reference Counting
- RC can’t free cycles in object references

Solutions: 
1) Combine with occasional tracing
- Really just less frequent tracing collection, with less garbage

2) Discriminate between pointers that close cycles and ones that don't
- Strong references don’t close cycles. Weak ones do.
- Reference count only the strong cycles
- Has need to automatically ensure that strong pointers that would close a cycle become weakened, also weak pointers made this way must become strong if a mutation wouldn’t create a cycle. Either unsafe or not guaranteed to terminate. Difficult to get right.

3) Trial deletion
- Two invariants allow this:
— In any garbage internal pointer structure, all counts must be due to other pointers in the structure.
— Only the step of pointer deletion that leaves the count greater than 0 can be a step in which a garbage cycle was created

Algorithm:
- Decrement all counts maintained by references held by objects in suspected cycle
- If the count of any objects in the cycle is greater than 0, the structure is not garbage.

- Effective, concurrent “Recycler” book algorithm does this
- Some objects can never be members of cycles. Can be skipped.

1.6 Limited Field Reference Counting
- In reality, most objects have small counts. Whole pointer unnecessary.
- While it is possible for some objects to be very popular, possible to cap reference count at a low value and when an object exceeds this count it becomes ‘sticky’.
- Any sticky object will have it’s count set by a tracing collector that runs periodically.
- May even use single bit for “shared vs unshared”. May get good performance in some languages

1.7 Issues to Consider
Good: Can reference fewer objects, lower space overhead
Bad(For naive): Cannot reclaim cycles, takes all reads and writes, requires synchronization in concurrent use, may cause tight coupling of mutator to GC runtime.

Environment:
- Despite downsides, may be perfect for use of many objects without pointers to other objects(image and assets) and for objects that are large(size for pointer in header, expensive to compact or copy)
- Can be implemented as library, does not need runtime support if tracing/epochs not used.
- Hard to get right in concurrent environment

Advanced solutions:
- Advanced RC variants(deferred/coalesced) fix faults, but often require a stop-the-world pause that people sometimes use RC to escape.
- Tracing collectors for limited field counting need headroom, in ballpark of adding pointer fields to objects.
- Deferred/Coalesced still avoids synchronization overheads, but adds space overhead for ZCT/logs
- RC variants still scale very well with large heaps, can be used in hybrid collectors

Chapter 6: Comparing Garbage Collectors
- Choice of algorithm depends on needed behavior
- No algorithm fastest for all memory access patterns
- Most algorithms have a performance impact relative to the heap size, further muddles benchmarks

6.1 Throughput
- Collection itself *should* only account for small fraction of execution time.
- Moving cost to mutator from collector usually has negative impact on performance time.
- Mutator trouble can include accidental changes besides barriers, such as cache locality worsening due to relocation in copying/compacting
- Synchronization costs high, mostly affects naive RC
- Need to consider big-O algorithmic complexity as well as number of instructions and locality
- Locality important. Cost of chasing pointers dominates many algorithms

6.2 Pause time
- Modern GC may still pause for over a second on some systems
- RC a solution, but higher-performance variants will reintroduce a shorter pause for collection at the end of epochs

6.3 Space
- All GC has a space overhead
- All non-compacting algorithms need memory for their own structures
- Tracing / deferred RC requires headroom
- Needs about 3-6x as much memory as minimum for optimal performance.
- Naive RC has good space performance. Recently-freed objects can allow their space to be used for new allocations.
- All dead objects not always freed every cycle for performance(execution time) reasons.

6.4 Implementation
- Hard to implement correctly
- Errors frequently detected outside of collection phase
- Moving collectors must find every reference to a structure. Non-moving need only find one to prove the object alive.
- Conservative collectors must guess at whether value is reference or arbitrary data without runtime support, or accurate knowledge of object graph and stack layout. Allowed a space leak in exchange for safety.
- RC can be implemented as a library, but coupling to mutator introduces overheads. All operations must be correct.
- Programmer must ensure that inlining is done correctly by the compiler and that the common case is the inlined path.

6.5 Adaptive systems
- VMs frequently offer different collectors, each with tunable parameters. Hard to tune correctly.
- Self-tuning systems have been used in the field. Some try to meet parameters such as the maximum pause time(sunspot) while others use the heap size or a machine learning model fed the application to determine the collector and parameters

6.6 A unified theory of garbage collection 
Abstract GC
- Can define GC as a function which returns all objects reachable by the roots
- Use work list W of objects to process

Tracing GC
- Paints tracing as form of RC. Tracing counts all references during collection phase.
- Tracing computes the least fixed-point of the worklist W, the lowest counts to satisfy it.
- Abstract tracing can also be seen as partitioning of nodes into two sets via the tricolor abstraction: the white(count=0) nodes and the black(count > 0) nodes.

RC GC
- Can see similarity with tracing by making all reference changes into a deferred increment/decrement(buffering).
- All deferred increments processed before all deferred decrements. If decremented to zero, is freed.

- Both algorithms similar. 
- Difference example: 2-object cyclic garbage graph. Has two fixed-points, with each having 0 or 1. Tracing computes the least fixed-point, and frees it. RC computes the greatest fixed-point, and doesn’t. Partial tracing takes the most and gives us the least.

Chapter 7: Allocation
- How memory is freed places restriction on allocation
- GC traits:
— GC systems free space all at once. Moving collectors free large regions at once.
— GC systems have more knowledge of objects than explicit memory managed systems usually.
— GC systems encourage a difference style. The cheapness of heap use will make it used more often.

7.1 Sequential allocation
- Requires a free pointer and a limit pointer into a large free chunk of memory
- “Bumps” the free pointer up as you allocate

Attributes:
- Simple
- Efficient
- Better cache locality
- Non-moving collectors might have issue with it due to fragmentation

7.2 Free-list allocation
- Holds onto a list of free “cells” of memory and allocates them via some policy

7.2.1 First-fit allocation
- Provides first cell large enough to fill request
- If it can split a cell, it may if the remainder isn’t too small

Attributes:
- Small remainder cells accumulate near front of list, slowing down allocation
- May behave like best-fit since cells end up ordered more or less smallest-to-largest

7.2.2 Next-fit allocation
- As a way to skip over the collection of remainder cells, it starts the search from where the last one left off and when it hits the end of the list it starts at the beginning

Attributes:
- Objects created at different times are intermingled. Has locality and fragmentation problems.
- Access through the moving pointer has poor locality

7.2.3 Best-fit allocation
- Allocate the cell with the size closest to the request’s size
- Good performance

7.2.4 Speeding free-list allocation
- Allocating from a single monolithic list scales poorly to large memory
- Alternatives:
1) Use of a balanced binary tree of free-cells
- For first fit or next-fit, need to use Cartesian tree that indexes by both address and size
- Makes worst-case logarithmic time, not linear
2) Bitmap-fits allocation
- Use bitmap on the side to determine free-space on the heap
- Less vulnerable to corruption
- Minimize constraints on cell size by removing requirement that they have space to describe their size.

7.3 Fragmentation
- Can prevent a request from succeeding if the memory is available, but not available contiguously.
- Can cause a higher use of cache lines, pages, and overall address space 
- Optimal allocation is NP-hard
- Only complete solution is compacting or copying collection

7.4 Segregated-fits allocation
- Because most time spent finding cell of correct size, segregated fits uses lists of different sizes rather than allocating from one list
- One list per size class, with one for the largest objects(the s(k) class).
- For allocation to all but s(k), is O(1) time.

7.4.1 Fragmentation
- external fragmentation = Free cells that are too small to satisfy a request
- internal fragmentation = Space inside of allocated cells are wasted because of a need to round the size up to a larger allocation unit
- Segregated-fits trades the internal fragmentation for the number of size classes.

7.4.1.1 Populating size classes
- How to populate the size classes for segregated fits?
- Solutions:
1) Big bag of pages
- Choose a block size B, a power of two
- Allocate blocks of size B, with support for allocating a multiple of B
- If a block of size n*B+s is requested, allocate n blocks of size B, and then create a size class for s and split the next contiguous block into B/s blocks of size s. That block is tagged as being of cells of size s.
- Better to store this information external to the cells
- Able to associate information like types with cells through the given free-list, can save space
- Easy to recombine blocks: Only promote the B/s blocks of size s into a block of size B if all B/s blocks are free.

2) Splitting
- If we require allocation of certain size classes, then we can make classes of known remainder sizes. These remainders and the initial restricted size classes can become the classes.
- Ex: Buddy system. Required allocation of powers of two. Historical but now abandoned because created much internal fragmentation
- Modern buddy system variants exist that aren’t powers of two. Ex: fibbonaci. If ratio of sizes is smaller, less internal fragmentation.

7.5 Combining segregates-fits with first-, best-, and next-fit
- Can use the first, best, or next fit to allocate from f(k) or the largest, variable-sized size class

7.6 Additional considerations
Alignment
- Some types must be aligned on a given boundary
- Rounding up is wasteful
- May be able to find an allocation that satisfies boundaries of fields in objects without wasting space
— Ex: Placing an object with three words in the header on an odd alignment so that the following data is double-word aligned

Size Constraints
- Collection schemes sometimes need data in the object header, making a minimum size

Boundary tags
- Specify the size and vacancy of each cell
- May be possible to use a bitmask/bit table
- Not all algorithms and memory models need boundary tags 

Heap Parsability
- Need ability to traverse cells on heap
- Need to store object’s header
- Placed before data, needed to know how to interpret data
- If an object is overwritten in-place with a smaller object, a heap scanner may have problems parsing data over the end of the new object
- If object must be placed at offset for size alignment reasons, must figure out how to mark gap as filler data. 
— One solution is to place zeroes there and not allow headers to start with 0
— Another solution is to place objects storing information about the gap in the header position of the cell.

Locality
- Sequential allocation leads to good cache behavior with references
- Objects located near each other which are freed within short succession will coalesce and reduce fragmentation
— Objects allocated at around the same time tend to be freed at around the same time, so non-moving a good default

Wilderness Preservation
- Ask the OS for new heap memory for allocation only as a last resort

Crossing maps
- For segmented fits, must store the address of last object in each segment
- Makes it easier to find start of object from knowledge of an address inside the object, makes header parsability easier.

7.7 Allocation in concurrent systems
- Since atomicity is required for allocation steps, use of locks or atomic operations is needed for allocation from single list by multiple threads
- Allocation can become bottleneck this way
- Give each thread own allocation area, grow by allocating from shared pool
- Only interaction with pool must be atomic
- Can give larger chunks to threads allocating at a higher rate, require less global pool use
- Idea: Use processor-local, not thread-local allocation buffers
— Use transactions. If pre-empted in middle of allocation, restart allocation process from new processor’s pool.
— More instructions, but same latency. Better for many threads
- Idea: Let each thread pick up new pieces of memory through incremental sweeping. Cells allocated by a thread are moved into the thread’s free-lists.

7.8 Issues to Consider
- Requirements of collector should be required
— Non-moving like mark-sweep => free-list
— Copying/compacting collectors => Sequential allocation 
- Bit tables on the side for recording metadata are very helpful, improve performance. Adds some time to allocation
- Block-based allocation can reduce per-object overheads by associating properties with blocks. It slightly worsens internal fragmentation. Good solution to collectors with multiple spaces.
- Segregated first generally faster than free-list schemes
- Because collector frees in batches, free-cell recombining techniques less relevant. 

Chapter 8: Partitioning the Heap
- Benefits to varying algorithm, parameters, and frequency between heaps, and partitioning heaps

8.1 Terminology
- space: A logical set of objects that receive similar treatments. Uses one or more chunks of address space
- chunk: Contiguous memory which are often power-of-two size aligned.

8.2 Why to partition
Partitioning by mobility:
- Some objects may not be moved because of lack of communication between runtime and compiler, or because the resource is provided by the OS.
- If a library does not expect GC(expects constant address), we must pin the object.
- Moving objects require more work from the tracer than immobile.
- If using a conservative collector, cannot move objects referred to by references that cannot be changed. Sometimes this means heap references can be altered but not ones on the stack. A mostly-copying collector can be the solution, allowing movement for all "non-ambiguous” objects and objects not referenced to by pinned objects.

Partitioning by size
- Moving large objects is costly
- Large objects may be placed in a separate “large object space”
- Typically placed on separate pages
- Managed by non-moving collector

Partitioning for space
- Objects with a higher expected rate of allocation and mortality(more likely for fragmentation to be a problem) can use a non-fragmenting algorithm(usually slower) while other objects use an algorithm such as mark-sweep.

Partitioning by kind
- Advantages of determining kind by address:
1) Cache advantage of not needing to load type field
2) Placement of objects with a property into a single space allows for property to be associated with space, and not stored in header
3) Avoidance of scanning objects without pointers or using tracing for RC for objects that can’t be parts of garbage cycles.

Partitioning for yield
- Weak generational hypothesis: “Most objects die young”
- Better to frequently collect the subset of the heap most likely to contain garbage
- Allows some garbage in less-frequently collected spaces to hang around slightly longer, but provides much better throughput.

Partitioning to reduce pause time
- Restrict the size of the condemned space
- Ex: Use of regions with large number of internal pointers, can be freed if entire region has no external references
- Improves expected pause times, not worst-case if not enough garbage freed to allocate

Partitioning for locality
- Most algorithms touch many objects and have higher than necessary memory bandwidth requirements
- Tracing touches every live object
- GC can improve program locality by collecting newly-allocated objects and objects likely to contain many references to one another together.
- Generational GC can improve locality of both collector and mutator

Partitioning by thread
- Easier to handle the concurrency of collection if we can collect only the objects created and reachable by one thread.

Partitioning by availability
- Ex: Because remote object access is more costly than local access, and because remote objects tend to live longer than local objects, it is more performant to handle them differently
- Memory speed also varies(cache hit vs miss), so this reasoning can enforce good cache behavior by tracing collectors

Partitioning by mutability
- Recently created, mutable objects are mutated frequently, while immutable objects are frequently either tenured or discarded
- These different behaviors reward different allocator behavior

8.3 How to Partition
- Easiest way is to separate into non-overlapping, power-of-2-aligned chunks.
- Problem: Allocating all of these spaces contiguously is inefficient with regard to use of memory on 32-bit systems.
— Alternative is to implement spaces as separate chunks of address space. May require object’s address to be looked up in table / complicate access.
- May be easier to simply add bits to the object header and to determine space as a property than to physically separate objects in memory.
— Has upside of allowing cheap spaces changes at runtime. Ex: unpinning an object

8.4 When to Partition
- Options include: compile time, at allocation time, at collection time, and at mutator access time.
- Best known is generational, partitions by age, does so dynamically
- Collector can partition. Ex: Mostly-copying collector not moving pinned data
- Allocator “”. Ex: thread-local data, allocation into the large object space
- Compiler “”. Ex: immortal data, pre-tenuring objects based on knowledge of future behavior
- Mutator “”. Ex: moving on read/write barriers

No comments:

Post a Comment