Adventures In Garbage Collection: Readings of chapters 1-4

Chapter 1: Intro
- First GC demo was a mistake. Had exhausted all memory on IBM machine, and GC printed out statistics of memory slowly and used up rest of demo’s time.

- Runtimes allow allocation on heap rather than stack or static allocation.
-- Needed for factories, closures/returned functions

1.1 Explicit deallocation
- Heap memory can be explicitly deallocated like in C.
- Runs risk of programmer error.
- Two kinds of errors:
— Pointer to deallocated region = dangling pointer
—- Can be fixed with fat pointers, but not performant so only used for debugging
— Region with no pointers but is never deallocated = memory leak
— Incorrect deallocation can cause both problems
- Difficult for concurrent access
- Solutions: Allocation from a pool which is freed as a whole, “ownership” semantics that prevent sharing and passing by reference ex: smart pointers, unique_ptr and shared_ptr
— Do all libraries used by the programmer use the same approach? Does the API require mixing styles?
- “Liveness is a global property but freeing is a local choice”

1.2 Automatic dynamic memory management
- GC prevents errors in 1.1
- Guarantees to free all with 2 caveats:
— Tracing collectors may not free some unreachable memory
— Some algorithms may not reclaim some garbage for efficiency reasons
- Good software engineering because removes need to know semantics/implementation/lifespan of other loosely coupled components.
- Evidence that GC reduces development costs
- GC cannot prevent “space leaks,” reachable but infinitely growing structures or reachable but unused structures
- Doesn’t have natural parallel in explicit resource allocation/deallocation but those domains sometimes need collectors anyways

1.3 Comparing algorithms
- No “best algorithm.” Each is best for different metrics. Design choice of algorithm.
— JVM switches collectors on the fly

Attributes:
- Safety: Collector must never free live objects. This is difficult with “conservative collectors” with no runtime, which must deal with figuring out pointers from compiler output.
- Throughput: Overall application speed(mutator + collector). Since most time spent in mutator, not collector, good to defer costs to collector.
- Completeness / promptness: May not want to/be able to collect all garbage in heap every cycle. ex: Generational
- Pause time: GC pause times can be important in some contexts. Different ways to minimize it have their own problems. A few can have mutator overheads. Maximum pause times not everything. Measures are BMU and MMU, which measure the mutator’s CPU utilization. Show both mutator overhead and pause time(0 during gc pause).
- Space overhead: Sometimes GC imposes cost, other times hidden in existing object layout. Copying collectors have a second space. Also data structures needed by collector take up space.
- Good fit for a given language: Certain language features can make one choice much better than others. Ex: mutability, declarative vs imperative, ability to hook collector with object finalizers
- Scalability/portability: Use/requirement of hardware features and appropriateness for massive heaps as well as small ones.

1.4 Performance impact
- GC has been called slow. Is it?
- Overhead(execution time) dependent on memory use patterns and on hardware, makes comparison difficult
- Experimental evidence suggests that for large enough heaps, the performance overhead is small. For large enough, can approach 17%

1.5 Experimental methodology
- Many GC studies lack sufficiently rigorous methodology
- “Toy” benchmarks bad at imitating actual memory use patterns for GC
- Need to differentiate between steady-state and startup
- “” between interpreted and JIT-ed
- Need to record the distribution of performance over steady-state executions for real accuracy
- Needs to consider time with respect to size
- Small configuration changes lead to large performance changes. Also small changes in memory access patterns can decide the time and rate of collection. You won’t observe this jitter without recording the distribution and creating the interval.

1.6 Terminology and notation

Memory:
- The heap = A collection of memory. Either continuous or composed of chunks
- Granule = The smallest size of memory that can be allocated. Usually word or double-word
- Chunk = Contiguous group of granules
- Cell = A smaller group of granules
- Object = Area of memory broken down into slots or fields
- Slot/field = A scalar or reference
- Derived pointer = Pointer created by adding offset to base
- Interior pointer = Derived pointer to object field
- Block = aligned chunk of size
- Frame = A large 2^k size
- Page = unit of the virtual memory system
- Card = smaller page related to cross-space pointers
- Cache line = Size of memory exchangeable with cache

Mutator and collector:
- Mutator: Allocates new objects and mutates object graphs by changing reference fields. References may lie in heap objects as well as other “roots.”
- Roots: Finite set of pointers held in storage that is directly accessible to mutator without having to traverse other objects. Memory unreachable from roots is garbage. Usually consists of static/global storage and thread-local storage.
- Collector: Executes GC code. Once has been freed, memory cannot be reclaimed and can’t be accessed through other references. Exception is finalizers.

Notation for references / fields
- Treated as index into object, 0-indexed.
- Traditional C */& address and dereference operations
— ex: &N[i] = address of i-th field in N

Liveness
- live = Will be accessed in future by mutator.
- Liveness is undecidable. Don’t know if will access what we hold onto.
- Can be approximated with pointer reachability. Is conservative, but acceptable.
- M -> sub f N implies that a pointer in M points to N
- Liveness = reachability as far as a collector is concerned

Operations
- Assume allocate/free are provided functions for memory allocation
- New / Read and Write exist for the mutator
- Sometimes have read/write barriers that require the mutator communicate with collector somehow
- New and Allocate both return address of allocated memory
- Will mark things that must execute atomically with the “atomic” keyword

Sets
- A set contains the usual mathematical definition
- |S| is the cardinality, or number of elements, in the set S
- A multiset allows for repeated membership. Cardinality counts repeats
- A sequence is an ordered list of elements
- A tuple is a fixed-length sequence.

Chapter 2: Mark-Sweep GC

All GC algorithms have three tasks:
1. Allocate space for new objects
2. Identify live objects
3. Reclaim space occupied by dead objects

- Marking = Traversal of object graph from roots, setting flag of all live objects
- Tracing = Traversal of object graph from roots
- Sweeping = Free all objects not marked as live

2.1 The mark-sweep algorithm
- If mutator cannot allocate new memory, collector is called
- Collector is atomic, operates in stop-the-world mode

- Must prime work list / tabulate all objects
- Trace from roots, so roots placed in work list first
- Marking can be done by setting a bit/byte in side table or in object’s header
— Side table will cause more cache misses
- Marking typically called during traversal for efficiency, but not always(concurrent)
- Work list can be stack => depth-first traversal

Mark algorithm:
- Add roots to work list
- Add all objects they reference without mark bit set. Set their mark bit
- Terminate when none can be added because all live objects have mark bit set

Sweep algorithm:
- Free all dead objects
- Either flip mark bits or switch meaning of mark bit for next collector

Limits:
- Can cause heap fragmentation
- Sweeper must be able to find next object. May have to parse heap.

2.2 The tricolor abstraction
- All objects are either black(live), white(dead), or grey(in process of traversing children).
- All objects start out white, and are marked grey when encountered and black after processed and the references are processed.
- Fields are also colored
- Invariant: No black objects point to white objects.

2.3 Improving mark-sweep
- Smart cache use is paramount to performance.
- Mark sweep has poor temporal locality. Reference following accesses ram out of order. Hardware prefetching works poorly on reference chasing.

2.4 Bitmap marking
- Can either store bit in object header or in side table.

- Side bitmask can be per-block or a single one can be used for al heaps.
- Placing in same position in every block bad for set-associative caches.
- Not threadsafe
- Much more dense than a bit per object header
- Better cache performance because fewer lines are dirtied than accessing every object
- Conservative collectors with no runtime support may not have a place to store the bit in objects. Storing in side bitmask is less messy than prepending a header on every object.
- Bitmasks for blocks make it easy to tell if the entire block is garbage.
- Size of bitmap allows for fancier algorithms, like a finite stack which kicks off sweeping after the stack overflows and then re-traverses the bitmap to process the grey objects and fields.

2.5 Lazy sweeping
- Mark is O(L) where L is size of live data in heap. Sweep is O(H) where H is the size of the heap. Event though H>L, sweeping is usually faster than marking because you proceed linearly through heap as opposed to chasing pointers.
- Can prefetch objects to improve sweep performance.
- Can also lay out similarly-sized object sequentially for easy index+offset*size access. Great for hardware and software prefetching.
- Can sweep in parallel, because garbage can’t be accessed once no roots exist.

Lazy sweeping
- Reduces the time that allocators can’t run
- Have the allocator mark the sweep pointer until it hits an unmarked object, and allocate over unmarked objects.
- Commonly make a list of size classes from which you’ve allocated. Each will have a “reclaim list” of blocks not yet swept.
- If no free block, must either allocate from system or call collector to sweep and free memory.

Problems with lazy sweeping:
- Garbage in unswept blocks will never be freed
- If we group by size, less-used unswept blocks won’t be used to fulfill requests for commonly-needed sizes
- Solution: Mark objects / blocks with the number of the current cycle(or count module 2^K) and free all previous cycles upon the new cycle

- Lazy sweep is best when heap is mostly empty.
- Makes complexity proportional to amount of live data in heap.

2.6 Cache misses in the marking loop
- Even if you use a mark bitmask, you get cache misses when you access the object to traverse it’s references. But some object’s don’t have references.
- If we store similar types together(Lisp uses a "big bag of pages”) we can avoid accessing objects of a type without references.
- Marking dominates collection time. Spend a third of the time fetching first reference of each object
- Solution: Prefetch an object when it’s first grey. Only works if timed correctly.

Problem: Cache lines are fetched breadth-first and mark-sweep is depth-first.
Solution: Place a FIFO in front of the

- Tracing edges rather than nodes can increase the work list and more instructions, but will have better cache use.

2.7 Issues to consider
Mutator overhead: Has no mutator overhead in simplest form. More complex collectors are typically based off of mark-sweep, and they will sometimes have overheads.
Throughput: Mutators can’t be run during collection, so large stop-the-world pauses are possible.
Space usage: Much better than copying. Mark-bits that are stored in objects can have no overhead. More complex allocators are important.

To move or not to move?
- Non-moving allows for non-cooperative runtimes. Conservative collectors can’t modify objects’ references.
- Fragmentation of heap requires more frequent collection. Generations and segregated-fits allocation is a solution to this.

Chapter 3: Mark-Compact GC
- Allows for fast sequential allocation

Compaction orders:
- Arbitrary: Objects are relocated without regard for relationship or original order.
- Linearising: Objects are relocated so they are adjacent to related objects(points to, in same data structure, etc)
- Sliding: Objects are moved toward one end of the heap, maintaining allocation order.

3.1 Two-finger compaction
- Doesn’t allow for objects of varying sizes
- Knowledge of the volume of live data gives us the high-water mark of the region after compaction.

- Move low pointer and high pointer towards each other until they pass.
- When low pointer hits empty region, high region is moved down until you hit a live object. All objects must be same size because of this. Otherwise the fragmentation would become unwieldily. When you move the object to a lower memory address, you write the new location to the address of the old object.
- Lower pointer now points to high water mark. All pointers can now be corrected by checking the value at where they point.

- Downside is that the compaction is arbitrary and that the size of all objects must be the same.

3.2 The Lisp2 algorithm
- Allows for objects of varying sizes
- Three iterations, each does little work
- Fastest compaction, not accounting for page/cache
- Requires room for an address in each object.

3 passes:
1. Compute new location of each object
2. Update references
3. Compact objects

Since we’ve moving out of the region in which we start, we’re moving into freed memory.

3.3 Threaded compaction
- Lisp2 requires three complete passes and each object needs space for a forwarding address.
- Happens because lisp2 is non-destructive of moving objects. Two-finger knows all pointers point pass high-water mark.
- Solution is threading. Requires no sufficient room in object headers for address and that pointers can be distinguished.

- Works by reversing all pointers. All references to N can be found through N itself.
1. The address-sized space in N is made to point to one object M which points to N.
2. The next object is pointed to by the field of M which used to point to N.
3. All objects are chained in such a way

After N is moved, all pointers to N can be found by traversing this list and updating all pointers to point to the new N.
- Very cache unfriendly
- Multiple pointer changes / object accesses needed

3.4 One-pass algorithms
- To reduce number of passes, must store forwarding addresses on a side table.

Commonalities:
- Marking uses a bitmask per block. Places a bit for every granule of a live object.
- A table is used for forwarding addresses. Stores offset of first live object in each block.
- Can compute arbitrary forwarding address from bitmask and offset. Simply need to tally number of bits until object, moving from left to right. This is the number of bytes passed the offset.
- After sweep, these are used to recompute new addresses.

3.5 Issues to consider
- Is compaction necessary?: Mark-sweep can use less memory than other options and can be a good choice for tight memory or lack of type information of references.
- Throughput costs of compaction: Compacted heaps allow for fast allocation. It uses half of the memory of copying. It’s good for when the ratio of the heap to total memory is large. Compaction slower than copying or mark-sweep.
- Long-lived data: Immortal / long-lived data will be needlessly copied by copying collectors. Generations are sometimes bad way to handle this in constrained memory. Can simply find out where this ‘sediment’ gathers and never collect it.
- Locality: May preserve allocation order on heap or move randomly. Mutator’s collection hurt by arbitrary scrambling. Sliding compaction allows for constant time freeing via pointer bumping.

Limitations of mark-compact algorithms:
- Not threadsafe because pointer fields are temporarily destroyed
- Some can only manage fixed-size objects
- Some place restrictions on mutator

Chapter 4: Copying Garbage Collection
- Faster than compacting, but without mutator cost.

4.1 Semispace copying collection
- Move all live objects into front of the tospace from the fromspace, and record new address in fromspace.
- Fix forwarding addresses in tospace by examining fromspace.
- Then discard fromspace.

- No requirement of object header

Work list implementations:
- Stack as in mark-sweep.
- Cheney scanning: Uses grey objects in tospace as fifo and only requires a pointer to traverse the unscanned objects. http://en.wikipedia.org/wiki/Cheney's_algorithm

4.2 Traversal order and locality
- Small heaps(relative to total memory size) are faster for compacting collectors.
- Enabling sequential allocation makes copying work well for large heaps
- Copying collector can completely rearrange the object layout in heap, improving locality and making the mutator faster.
- Problems: Hard to tell future use, and optimum layout would be NP-complete to find.
- Solutions: Heuristics. Profiling, preserve allocation order, place children next to parents
- Copying orders have performance impacts.
— Many of the more clever copying orders show disappointing performance for non-tree-like structures.

4.3 Issues to consider
- Uses twice as much memory as mark-sweep or mark-compact.
- Easier to implement

- Allocation: Allocation is cheap, simply need to check for exhaustion and bump pointer. Good cache performance because of liner motion. Sequential allocation is slightly faster than free-list allocation, but so little time is spent in allocation that it may not be noticable.
- Space and locality: Half of heap space, so collects twice as frequently. Copying is faster than mark-sweep as long as the heap is sufficiently large.
- Moving objects: Some environments make moving objects impossible or costly. Ex: conservative GC. Pointer finding can be easier in mark-sweep. Hard to do concurrently. Also some objects(large ones or ones with many pointers) are expensive to move.

Questions from reading:
1. A number of mentions have been made on the ability to change collectors on the fly. How dynamically does changing the collector improve performance compared to the weight of the code required to check if it's time to change mode, and to handle the swap during a collector cycle. Where is the cost, the mutator or the collector?

Adventures In Garbage Collection

Monday, August 25, 2014

Readings of chapters 1-4

No comments:

Post a Comment

Blog Archive