Friday, October 31, 2014
A Great Dwarf 101
http://www.dwarfstd.org/doc/Debugging%20using%20DWARF.pdf
Thursday, October 30, 2014
DWARF resources
http://www.dwarfstd.org/doc/DWARF4.pdf
Parsing dwarf is not very fun. There are two main libraries which will do that for you, libdwarf and libbfd. Because libbfd(binary file descriptor) seems more arcane and more rigidly suited to working with elf/dwarf files external to the current program, I'm electing to use libdwarf. Documentation seemed hard to find, but it appears to be well-documented on sourceforge:
http://sourceforge.net/p/elftoolchain/wiki/libdwarf/
Proposal: Not-So Conservative Collection
I would use the type interface in in their types.h, and would use libbfd or libdwarf to parse the dwarf files.
A caveat would be that casting a pointer to an integer may lead to freeing memory in use. As this is a purposeful defeat of the type system, I’m not sure if there is a way to track this and compensate for it.
In more depth:
Proposal: Static Memory-Management Cost Reasoning in Garbage Collected Languages
Wednesday, October 22, 2014
Read Barrier Elimination Through Concurrency
Table of Contents
- 1. Eliminating Read Barriers through Procrasination and Cleanliness
- 1.1. Abstract
- 1.2. Introduction
- 1.2.1. Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce
- 1.2.2. Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.
- 1.2.3. Needs a read barrier because reference might refer to newly-exported object.
- 1.2.4. Read barrier cost is high, but removal is nontrivial.
- 1.2.5. Idea: Delay operation
- 1.2.6. Idea: Avoid stalling using language semantics
- 1.3. Motivation
- 1.3.1. Implementation considered is multimlton.
- 1.3.2. Has userspace concurrency, many green threads mapped to single kernel thread.
- 1.3.3. Each core has local heap, and there's a single shared heap.
- 1.3.4. Max heap set at 3x min heap.
- 1.3.5. Extensive optimization to avoid heap usage
- 1.3.6. On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.
- 1.3.7. If object has been moved, read barrier returns new address. Therefore conditional read barrier.
- 1.3.8. Mutator spends ~20% of execution time in read barrier.
- 1.3.9. Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.
- 1.3.10. Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.
- 1.4. GC Design and Implementation
- 1.5. Cleanliness Analysis
- 1.6. Write barrier
- 1.7. Target Architectures
- 1.8. Results
1 Eliminating Read Barriers through Procrasination and Cleanliness
1.1 Abstract
1.1.1 Goal is to avoid need for a read barrier
1.1.2 Use concurrency and stalling to avoid need for immediate copy
1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known
1.2 Introduction
1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce
1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.
1.2.3 Needs a read barrier because reference might refer to newly-exported object.
1.2.4 Read barrier cost is high, but removal is nontrivial.
1.2.5 Idea: Delay operation
- Read barriers only need to exist if there are forwarding pointers which exist between move and collection
- Can delay operations that create them
- We can stall threads which need to cause copy to shared heap, procrastinating the move
- GC is informed of these procrasinated operations at start of collection
- As long as there are enough threads, cost of procrastination simply cost of context switch.
1.2.6 Idea: Avoid stalling using language semantics
1.3 Motivation
1.3.1 Implementation considered is multimlton.
1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.
1.3.3 Each core has local heap, and there's a single shared heap.
1.3.4 Max heap set at 3x min heap.
1.3.5 Extensive optimization to avoid heap usage
1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.
1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.
1.3.8 Mutator spends ~20% of execution time in read barrier.
1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.
1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.
1.4 GC Design and Implementation
1.4.1 Threading system
1.4.2 Baseline collector(stop the world)
1.4.3 Local collector(split heap)
1.4.4 Remembered stacks
1.5 Cleanliness Analysis
1.5.1 Heap Session
1.5.2 Reference count
- We steal 4 bits in each object header to record number of references within the current session.
- 4 bits encode zero, one, localmany, and globalmany
- Zero => Only references from stacks, globalmany => at least one reference outside current session.
- Implemented as part of write barrier. Non-decreasing, so represents a maximum.
1.6 Write barrier
1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.
1.6.2 Delaying writes
1.6.3 Lifting objects to shared heap
1.6.4 Remote spawns
1.7 Target Architectures
1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.
1.8 Results
1.8.2 Performance
- Ported Bohem-Demers-Weiser conservative GC to test against
- Consecutive tests done by shrinking maximum allowed heap size and recording running time and actual heap utilization.
- Read-barrier-less had better performance, but had higher minimum heap size.
- Stop-the-world gc has faster mutator time than alternatives, but impacted scalability because collection was serial when world stopped.
- Bohem GC slow.
1.8.3 Impact of cleanliness
1.8.4 Impact of Immutability
1.8.5 Impact of heap session
A Control-Theory Approach to Heap Resizing
Table of Contents
1 Heap Sizing
1.1 Control Theory for Principled Heap Sizing
1.1.1 Intro
- Programs frequently have seperate phases with very different memory behavior, no static heuristic acceptable
- Control theory offers more formalism than trying to glue random, effective heuristics together.
- Paper describes the idea of having a GC controller and increasing the heap size until the heap thrashing decreases and a throughput goal is hit.
1.1.2 Sweet spots
1.1.3 Heap sizing in existing VMs.
- Existing VMs tend to increase heap size too optimistically, and decrease it more slowly than could be desired.
- Jikes RVM
- After each GC, manager determines new "resize ratio" based on short term GC overhead (ratio of gc time and time since last GC) and ratio of heap that was live
- Not explicitly goal-oriented in terms of throughput or heap size
- Doesn't take history into account, can flip-flop between states needlessly when live data size fluctuates back and forth quickly.
- There is no evidence that different GC algorithms will match these hard-coded heuristic thresholds
- After each GC, manager determines new "resize ratio" based on short term GC overhead (ratio of gc time and time since last GC) and ratio of heap that was live
- Hotspot
- Has a lot of tunable "ergonomics," and will resize heap to make best-effort to meet all priorities.
- Resizing rate much less flexible, ratio bounded between 0.95 and 1.2. Early on in execution, allows to grow at ratio 2.
- Venegrov has questioned whether the resize policies actually ensure progress towards goals.
- Has a lot of tunable "ergonomics," and will resize heap to make best-effort to meet all priorities.
1.1.4 Heap sizing as a control problem
- Should use control theory
- Formulating the Problem
1.1.5 Designing a Heap Size Controller
- Chose to treat entire GC system as black box, tune using Proportional Integral Derivative controller
- Use a target GC overhead to resize heap
- Only consider resizing after GC to avoid confounding variables in controller training.
- PID Controller
- PID Controller Theory
- Implementation
- Built on Jikes/MMTK
- Uses bytes allocated as time proxy
- Heap growth manager was modified to use the PID controller
- Source: http://sourceforge.net/p/jikesrvm/research-archive/40/
- Built on Jikes/MMTK
- Tuning:
1.1.6 Evauluation
1.1.7 Related work
- Heuristic Approaches
- Mathematical Models
- Control Theory Papers
Page Fault Driven Heap Size
Table of Contents
1 A Page Fault Equation for Dynamic Heap Sizing
http://www.math.nus.edu.sg/~mattyc/HeapSizing.pdf
This is an interesting paper. It more or less presents an equation,
1.1 Introduction
H is the heap size, M is the size of available main memory
1.1.1 Suggested policy out-performs JikesRVM's heap size mamager
1.1.2 The Problem:
- Frequent GC pauses in a small heap leads to significant performance degredation due to cache pollution.
- A large heap faces frequent page faults.
- Tuning intelligently can possibly lead to better performance that manual memory management can provide.
- Heap cannot be static, must be able to scale to very small and very large heaps.
- The number of page faults not simply a function of actual memory and of heap size, includes mutator + collector behavior.
1.1.3 Heap-Aware Page Fault Equation
1.1.4 General Page Fault Equation:
1.1.5 Heap sizing Rule:
- Uses the parameters for the page fault equation to know how to resize.
- Doesn't require OS or hardware modification or costly setup of polling, handlers, and callbacks for stalls and page references.
- Can automatically respond to changes in main memory(cloud hosting?)
- Allows for hotswapping of collectors if parameters found.
1.2 Heap-Aware Page Fault Equation:
1.2.2 Paper describes parameters in general equation
1.2.4 Experimental Validation
1.2.5 Can derive a Hmin, the smallest reasonable heap.
1.3 Heap Sizing
1.3.2 Experiments with static memory size
Wednesday, October 15, 2014
MMTK/Jikes RVM
jikes
MMTK
A garbage collection framework similar to the plai2 framework but less designed as a "toy" is the mmtk(Memory Management Toolkit) for the jikesRVM.
A quick overview of the interaction between the virtual machine and the mmtk framework can be found here: http://jikesrvm.org/Memory+Allocation+in+JikesRVM
Jikes is surprisingly both non-frightening and full-featured. One of the surprising things about MMTK is the presence of such optimizations as thread-local allocation interfaces while still retaining the simplicity of the memory manager interface.
If you're planning on making a collector for jikes, I believe that the best overview of the way that the collector is expected to be structured and how Jikes will use it can be found here: http://jikesrvm.org/Anatomy+of+a+Garbage+Collector
It comes with a large number of garbage collectors. They all satisfy a simple interface which captures the information that a collector needs.
Many of them will implement this interface, so it's worth looking at for the curious: http://jikesrvm.org/docs/api/org/mmtk/plan/Simple.html
There is a fairly straightforward tutorial on adding a simple mark-sweep collector and a quasi-generational hybrid copying/mark-sweep collector here: http://jikesrvm.org/MMTk+Tutorial
And I was glad to see that there was a comprehensive test harness for jikes. One of the most difficult things with plai2 is figuring out how to test things in isolation and together, and how to track test program misbehavior to a specific collector fault. http://jikesrvm.org/The+MMTk+Test+Harness The harness looks to be invaluable.
There's a canonical document here: http://cs.anu.edu.au/~Robin.Garner/mmtk-guide.pdf. While it appears to be unfinished and very empty, it does helpfully explain the naming convention used throughout the API.
Plans
Plans are ways of bring together collection, allocation, and heap policies into something that can be used by the MMTK. A plan is a package with classes for each of these.
Spaces
A collection of regions of memory, each with a given allocation and collection policy.
Barriers
MMTK currently supports write, but not read barriers.
Policies
A policy is a region of memory with a collection and allocation strategy.
Date: 2014-10-16T00:11-0400
Author: Alex Kyte
Org version 7.9.3f with Emacs version 24
Validate XHTML 1.0Hotspot Notes
notes
Table of Contents
Resources:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/gc\_interface
http://www.slideshare.net/waterfox1/an-introduction-to-jvm-internals-and-garbage-collection-in-java
http://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf
Interface
How does Hotspot interact with the garbage collector?
CollectedHeap
- Components
Name Description _gc_cause and _gc_lastcause, and _set_gc_cause Records the causes when collections are triggered. Needs to be set to a "no cause" state between collections _total_collections and _total_full_collections allocate_new_tlab(size), accumulate_statistics_all_tlabs, resize_all_tlabs, allocate_from_tlab Interact with thread-local allocation buffers. common_mem_allocate_noinit, common_mem_allocate_init Allocate a block with the given size. Initialization means zeroing out fill_with_array, fill_with_object_impl Initialize array or object that's been allocated. Name The heap strategy being used. It's an enum mostly named for collector types. Initialize, post_initialize Initialize the "universe", and do other setup tasks before allocation _is_maximal_no_gc If the heap has reached it's maximum size allowed without a GC max_capacity The amount of memory the heap that the vm can allocate for java objects made by the user's code, and not used for bookkeeping is_in_reserved, is_in_reserved_or_null Whether the pointer is in the reserved heap region, and optionally is null is_in_closed_subset, is_in_closed_subset_or_null Is in a closed subset(all objects in subset point only to other objects in subset) is_scavengable Whether object can move during a scavenge(partial collection) n_par_threads, set_par_threads Number of threads currently working on gc tasks Class_obj_allocate Allocate and initialize an instance of a class fill_with_objects, fill_with_object, etc Fill allocated raw memory with object(s) and arrays ensure_parseability Place heap in a parseable state if not(in middle of GC) unsafe_max_alloc Estimate what can be allocated without possibly causing a collection supports_tlab_alloc, tlab_capacity, tlab_max_alloc Related functions for tlab interaction can_elide_store_barriers Whether the compiler can avoid a store barrier. If so, it must later call a deferring function to defer the store action. It must also later trigger a flush of these deferred stores. collect, do_full_collection, collect_as_vm_thread Various ways to request a collection, requiring various levels of locks is_active? Whether the world is stopped to collect, false for concurrent part of concurrent collectors CollectorPolicy, AdaptiveSizePolicy The policies the heap has for collection and resizing, with associated costs for actions. millis_since_last_gc Time since last GC print_gc_threads, print_all_gc_threads, trace_all_threads Print / trace / debug threads use_parallel_gc_threads Whether parallel threads are being used
- SharedHeap
A SharedHeap is defined as a heap for the java language. It's an abstract class, but contains common infrastructure.- G1CollectedHeap
- A sharedheap which uses g1's parition strategy
- A sharedheap which uses g1's parition strategy
- GenCollectedHeap
- A generationally-partitioned sharedheap
- A generationally-partitioned sharedheap
- G1CollectedHeap
- ParallelScavengeHeap
ParallelScavengeHeap allocates pages from a survivor space in round-robin fashion. It differs from all other collection heaps in that it doesn't use the sharedheap structure. It is NUMA-aware.
Collectors
Generational
Interaction
- Why can't parNew and parallelOld work together?
"ParNew" (and "Serial") implements space_iterate() which will apply an operation to every object in the young generation. When collecting the tenured generation with either "CMS" or "Serial Old", the GC can use space_iterate() to do some work on the objects in the young generation. parallelOld does not implement this interfact. (Source: https://blogs.oracle.com/jonthecollector/entry/our\_collectors)
Young Generation
- ParallelScavenge
- "Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads.
- "Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads.
- ParNew
- The newer parallel collector. It will eventually replace the parallelScavenge collector
- Difference
It differs from "Parallel Scavenge" in that it has enhancements that make it usable with CMS. For example, "ParNew" does the synchronization needed so that it can run during the concurrent phases of CMS. (Source: https://blogs.oracle.com/jonthecollector/entry/our\_collectors)
- The newer parallel collector. It will eventually replace the parallelScavenge collector
- Serial
- "Serial" is a stop-the-world, copying collector which uses a single GC thread.
- "Serial" is a stop-the-world, copying collector which uses a single GC thread.
Tenured Generation
- ConcurrentMarkSweep
- "" is a mostly concurrent, low-pause collector.
- "" is a mostly concurrent, low-pause collector.
- SerialOld(Mark-sweep-compact)
- "Serial Old" is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.
- "Serial Old" is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.
- ParallelOld
- "Parallel Old" is a compacting collector that uses multiple GC threads.
- "Parallel Old" is a compacting collector that uses multiple GC threads.
G1(Garbage-first Garbage Collection)
Summary:
- Divides the heap into regions and collects as many as it can while abiding by the desired pause time.
- Specifies region to act as the new "younger" generation, which will be collected first.
- Collects regions with most garbage first("Garbage-first")
- Compacts to avoid fragmentation issues, but can still get internal fragmentation due to many under-filled heaps.
- Because there isn't a static divide between young/tenured generations, can't have a problematic imbalance in sizes of heaps.
Paper:
- 1 Intro
- Designed to be a soft real-time system, where a maximum percent x of the next y-duration timeslice is specified
- All inter-region pointers are tracked, not simply old-to-young
- A write barrier logs changes, and a concurrent thread updates remembered sets with this log
- Snapshot-at-the-beginning
- At beginning of marking period, mutator threads paused and roots collected. This snapshot pause is relatively short.
- Used to guarantee completeness
- Finds percentage of each heap that's garbage, to allow garbage-first collection to work
- At beginning of marking period, mutator threads paused and roots collected. This snapshot pause is relatively short.
- Hard real-time abandonment
- Rather than making collection interruptable at granularity of object moved, use granularity of each region
- Can estimate collection time for each region(function of the amount of live data), and try to fit within soft real-time bounds
- Allows higher throughput and less synchronization than requiring allowing arbitrary pauses
- Rather than making collection interruptable at granularity of object moved, use granularity of each region
- Designed to be a soft real-time system, where a maximum percent x of the next y-duration timeslice is specified
- 2 Data structures
- 2.1 Heap layout/regions/allocation
- All allocation done from same allocation region
- Allocation is "bump-a-pointer" of the "top" boundary marker in the current allocation region
- All non-large-objects are allocated into thread-local allocation buffers or tlabs
- Tlabs are allocated from a linked-list of free regions. Each region consists of many tlabs-sized subregions.
- Large and humongous objects are allocated seperately into heap regions outside of tlabs.
- All allocation done from same allocation region
- 2.2 Remembered Set Maintenance
- Each region has one hash-based card-table set per parallel GC thread, to avoid contention
- Write barrier which checks if change creates inter-region pointer. If not, or if new reference is NULL, does nothing.
- Else logs write into write log. If write log is empty, gets a new one.
- Concurrent Remembered Set Thread
- Has an "initiating threshold"
- When RS buffer size exceeds threshold, CRST processes them until it reduces back down to the RST.
- Use a seperate card table to determine which cards are "hot" by incrementing cards as processed. Hot cards only processed during evacuation pauses.
- Has an "initiating threshold"
- Only one remembered set thread runs. If the RS buffer is too full, mutator threads are forced to do RS work themselves in order to allocate.
- Each region has one hash-based card-table set per parallel GC thread, to avoid contention
- 2.3 Evacuation Pauses
- Stops mutator threads
- Choose a set of regions, and evacuate the live data in regions into other heap regions.
- Object movement must appear atomic, so world is stopped to allow incremental compaction.
- Parallelized
- Threads compete to claim sub-tasks in collection. No synchronization used after ensuring each object has only one thread working on it.
- Algorithm:
- Copy object into GC thread TLAB, and race to install brooks-style forwarding pointer
- Winner scans contents, and processes.
- Work-stealing load balances
- Copy object into GC thread TLAB, and race to install brooks-style forwarding pointer
- Stops mutator threads
- 2.4 Generational Garbage-First
- Can take advantage of the generational hypothesis with heuristic
- Designate region young when mutator allocation region
- Young regions always part of collection set
- Remembered set processing not required for young regions
- Can be toggled between "pure garbage-first" and "generational"
- Can take advantage of the generational hypothesis with heuristic
- 2.5 Concurrent Marking
- Uses snapshot-at-beginning. Considers objects allocated during concurrent marking as alive. Guaranteed to find all garbage that exists during snapshot.
- 2.5.1 Marking Data Structures
- A bitmap is used to mark. "Previous" bitmap was bitmap for last marking period. "Next" is under construction. Two bitmaps switch when marking completed.
- A bitmap is used to mark. "Previous" bitmap was bitmap for last marking period. "Next" is under construction. Two bitmaps switch when marking completed.
- 2.5.2 Initial Marking Pause
- Clears next marking bitmap.
- Stops all mutator marking threads, marks everything reachable from roots
- Each region records top at mark start(TAMS) for previous and next.
- Allows one to determine when objects are allocated during marking because they are above TAMS.
- Clears next marking bitmap.
- 2.5.3 Concurrent Marking Write Barrier
- If mutator removes pointers that exist at start, SATB guarantee can be violated.
- Needs a write barrier to store origional version of fields before overwriting
- It's only necessary to log writes to non-previously-null fields, and most writes are initialization writes so recording is cheaper than one would think
- If mutator removes pointers that exist at start, SATB guarantee can be violated.
- 2.5.4 Final Marking Pause
- Mutator threads paused and changes since initial pause are processed by marking threads. Marking is thus finished.
- Mutator threads paused and changes since initial pause are processed by marking threads. Marking is thus finished.
- 2.5.5 Live Data Counting and Cleanup
- When marking finished, GC thread counts amount of live data in each region.
- Find it by using marking bitmap, not traversing object graph
- Since concurrent modification can make counts imprecise, a final stop-the-world cleanup pause must finish counting process.
- Cleanup also swaps next and previous bitmap roles
- Next TAMS value copied into previous
- Cleanup phase sorts heap regions by expected GC efficiency.
- Is the ratio of amount of garbage to free over cost of collecting it.
- Garbage-only regions immediately reclaimed in this phase.
- For some programs, may recover significant space in this phase.
- Is the ratio of amount of garbage to free over cost of collecting it.
- When marking finished, GC thread counts amount of live data in each region.
- Uses snapshot-at-beginning. Considers objects allocated during concurrent marking as alive. Guaranteed to find all garbage that exists during snapshot.
- 2.6 Evaucation Pauses and Marking
- Never evacuates objects proven dead in last complete marking pass.
- Needs to ensure that next/previous bitmasks are valid with respect to movements.
- May only evacuate when marking thread's marking stack is non-empty. Otherwise marking could delay evacuation.
- Treats marking stack as source of roots.
- Never evacuates objects proven dead in last complete marking pass.
- 2.7 Popular Object Handling
- Popular objects handled differently in order to ensure smaller remembered sets and more efficient barrier.
- Popular objects placed in regions in small prefix of heap
- Prefix never part of general collection sets
- When popular regions reach a threshold, create a "popularity pause" to empty region.
- Popularity pause creates reference count for all popular objects and moves objects with counts below a threshold to ordinary heap regions.
- Benchmarks show that this behavior removes majority of remembered set enteries
- Popular objects handled differently in order to ensure smaller remembered sets and more efficient barrier.
- 2.1 Heap layout/regions/allocation
- 3 Heuristics
- 3.1 User Inputs
- Two user inputs: Upper bound on space usage, and the maximum portion of a time slice given to stop-the-world GC
- Whether to use generational mode
- Soft realtime goal of time-slice only counts stop-the-world, since concurrent collection can be seen as a constant tax on mutator.
- Two user inputs: Upper bound on space usage, and the maximum portion of a time slice given to stop-the-world GC
- 3.2 Satisfying a Soft Real-time goal
- 3.2.1 Predicting evacuation pause times
- Need to accurately predict added cost of adding a region to determine how many regions to evacuate.
- Keep adding regions until we would exceed pause time
- Use statistics to estimate the fraction of the data allocated since last marking which is live.
- Need to accurately predict added cost of adding a region to determine how many regions to evacuate.
- 3.2.2 Scheduling Pauses to meet a realtime goal.
- If we would otherwise miss deadlines, we can simply delay phases.
- Can cause mutator to allocate into regions which wouldn't be collected, to avoid exceeding pause times in future collections.
- If we would otherwise miss deadlines, we can simply delay phases.
- 3.2.1 Predicting evacuation pause times
- 3.3 Collection Set Choice
- Decide on the "GC efficiency" as the amount of garbage which will be freed over the cost of collecting the region.
- Can have low amount of live data and still have low efficiency because of a large remembered set
- In time between ending marking and starting the evacuation pause, costs may change due to changes in remembered set size. Need to recompute for most attractive subset at start of evacuation pause.
- Decide on the "GC efficiency" as the amount of garbage which will be freed over the cost of collecting the region.
- 3.4 Evacuation Pause Initiation
- Determine the hard margin, or portion of the heap that must be empty for there to be a to-space for collection. When this gets threatened, we always trigger a collection.
- In fully-young generational configuration:
- We know the number of regions that would cause the desired pause time
- When the number of allocation regions which have filled up is equal to this number, we collect.
- We know the number of regions that would cause the desired pause time
- In partially-young mode, we do evacuation paues at maximum allowed frequency to minimize pause time.
- Generational starts in fully-young, and then switches to partially-young after marking pass complete in order to collect attractive regions.
- Always use partially-young if heap very full, until heap occupancy drops.
- Determine the hard margin, or portion of the heap that must be empty for there to be a to-space for collection. When this gets threatened, we always trigger a collection.
- 3.5 Concurrent Marking Initiation
- In generational configuration, we define a soft margin such that when the margin is exceeded, we start a new marking pass.
- Allows us to try to collect more frequently to avoid hitting the hard margin, where a soft real-time goal can be violated because of collection time required to free enough of the heap to continue.
- Allows us to try to collect more frequently to avoid hitting the hard margin, where a soft real-time goal can be violated because of collection time required to free enough of the heap to continue.
- In pure garbage-first, we mark and then keep track of the efficiency of our evacuation pauses.
- Once they begin to decline(freeing less-attractive regions after more-attractive) we begin a new marking pass to mark more garbage and increase attractiveness.
- Once they begin to decline(freeing less-attractive regions after more-attractive) we begin a new marking pass to mark more garbage and increase attractiveness.
- In generational configuration, we define a soft margin such that when the margin is exceeded, we start a new marking pass.
- 3.1 User Inputs
- 5 Related Work
- Idea of dividing collection of entire heap into incremental collection of subheaps steams from train algorithm.
- Differs from train in the use of more remembered sets to allow choice of collecting any region.
- Differs from train in the use of more remembered sets to allow choice of collecting any region.
- Differs from metronome in that it sacrifices hard-realtime performance for higher throughput.
- Idea of dividing collection of entire heap into incremental collection of subheaps steams from train algorithm.
Shenandoah
Presentations http://rkennke.files.wordpress.com/2014/02/shenandoahtake4.pdf Strangeloop: https://www.youtube.com/watch?v=QcwyKLlmXeY
- New alternative to G1
- Eventual goal: An entirely pauseless collector for large-memory systems
- Current goal: <10ms pause times for 100+GB heaps.
- Not currently pauseless
- Not yet using round-robin root updating because makes difficult to ensure that object graph is maintained, and difficult to prove both systems simultaneously before done.
- Not yet using round-robin root updating because makes difficult to ensure that object graph is maintained, and difficult to prove both systems simultaneously before done.
- All other collectors stop-the-world during compaction to avoid problems of concurrent mutation
- Two solutions: Trap on access of objects in motion, or brooks-style forwarding pointers
- They chose forwarding pointers
- Two solutions: Trap on access of objects in motion, or brooks-style forwarding pointers
- Forwarding pointers remove need for remembered sets
- Remembered sets have scalability problems
- Large systems which pre-allocate space for scalability reasons frequently thrash cache lines containing card table
- G1 had massive remembered sets. Fell back on a "generational" configuration eventually which only had to remember old-to-young references.
- Forced generational paradigm, not truly garbage first.
- Forced generational paradigm, not truly garbage first.
- Forwarding pointers allow arbitrary choice of region collection without remembered sets. Truly garbage first.
- Large systems which pre-allocate space for scalability reasons frequently thrash cache lines containing card table
- Forwarding pointer lives in indirection object
- Divorces collection from allocation, doesn't risk randomly interrupting mutator threads.
- Reading an object in a from-region doesn't trigger evacuation. Barrier is unconditional since always chases forwarding pointer.
- Writing an object does.
- Also ensures that when any reference to an object in fromspace is written to an object in tospace, it is updated to point to a newly copied version of that item in tospace.
- System has invariant that writes only occur in to-regions.
- In order to avoid races in copying/installing forwarding pointer, GC and mutator threads copy and then compare-and-set the forwarding pointer.
- Loser undoes allocation.
- Loser undoes allocation.
- If mutator out-races the collector thread, collector must use thread's new location as object address and perform forwarding process again.
- Needs to ensure that anything alive at beginning of marking is still considered live, SATB is used.
- To avoid mutation while scanning the graph causing breakage, write barrier is used to mark references in overwritten field values
- To avoid mutation while scanning the graph causing breakage, write barrier is used to mark references in overwritten field values
- Needs to ensure that atomic compare and swap of reference by mutator threads only swap refereneces to tospace objects because otherwise CAS would fail because GC thread modified the previous reference, and because no tospace object should point to a fromspace object.
- Solution is to copy all referred-to objects into tospace.
- Solution is to copy all referred-to objects into tospace.
- Costs are bounded because of finite region size. Can't prevent progress for unbounded amount of time.
- Also ensures that when any reference to an object in fromspace is written to an object in tospace, it is updated to point to a newly copied version of that item in tospace.
- Reading an object in a from-region doesn't trigger evacuation. Barrier is unconditional since always chases forwarding pointer.
- Use work-stealing
- Compete to claim roots
- Process one child, push rest of children onto work-list.
- Compete to claim roots
- Overview
- Heap divided into regions. Concurrent marking tracks each region's live data.
- Regions with most garbage will join collection set.
- GC threads concurrently evacuate live data.
- At start of next round of concurrent marking, references in fromspace are updated.
- Fromspace is freed.
- Heap divided into regions. Concurrent marking tracks each region's live data.
- Downsides
- Forwarding pointers add a time and space overhead.
- Forwarding pointers add a time and space overhead.
- Generational hypothesis doubts
- Suspicion that generational hypothesis doesn't improve performance because most objects die young, but because remembered sets are smaller.
- Suspicion that generational hypothesis doesn't improve performance because most objects die young, but because remembered sets are smaller.