Wednesday, October 15, 2014

Hotspot Notes

notes

notes

Interface

How does Hotspot interact with the garbage collector?

CollectedHeap

The CollectedHeap interface defines the behavior of a heap which is garbage collected.
  • Components
    NameDescription
    _gc_cause and _gc_lastcause, and _set_gc_causeRecords the causes when collections are triggered. Needs to be set to a "no cause" state between collections
    _total_collections and _total_full_collections
    allocate_new_tlab(size), accumulate_statistics_all_tlabs, resize_all_tlabs, allocate_from_tlabInteract with thread-local allocation buffers.
    common_mem_allocate_noinit, common_mem_allocate_initAllocate a block with the given size. Initialization means zeroing out
    fill_with_array, fill_with_object_implInitialize array or object that's been allocated.
    NameThe heap strategy being used. It's an enum mostly named for collector types.
    Initialize, post_initializeInitialize the "universe", and do other setup tasks before allocation
    _is_maximal_no_gcIf the heap has reached it's maximum size allowed without a GC
    max_capacityThe amount of memory the heap that the vm can allocate for java objects made by the user's code, and not used for bookkeeping
    is_in_reserved, is_in_reserved_or_nullWhether the pointer is in the reserved heap region, and optionally is null
    is_in_closed_subset, is_in_closed_subset_or_nullIs in a closed subset(all objects in subset point only to other objects in subset)
    is_scavengableWhether object can move during a scavenge(partial collection)
    n_par_threads, set_par_threadsNumber of threads currently working on gc tasks
    Class_obj_allocateAllocate and initialize an instance of a class
    fill_with_objects, fill_with_object, etcFill allocated raw memory with object(s) and arrays
    ensure_parseabilityPlace heap in a parseable state if not(in middle of GC)
    unsafe_max_allocEstimate what can be allocated without possibly causing a collection
    supports_tlab_alloc, tlab_capacity, tlab_max_allocRelated functions for tlab interaction
    can_elide_store_barriersWhether the compiler can avoid a store barrier. If so, it must later call a deferring function to defer the store action. It must also later trigger a flush of these deferred stores.
    collect, do_full_collection, collect_as_vm_threadVarious ways to request a collection, requiring various levels of locks
    is_active?Whether the world is stopped to collect, false for concurrent part of concurrent collectors
    CollectorPolicy, AdaptiveSizePolicyThe policies the heap has for collection and resizing, with associated costs for actions.
    millis_since_last_gcTime since last GC
    print_gc_threads, print_all_gc_threads, trace_all_threadsPrint / trace / debug threads
    use_parallel_gc_threadsWhether parallel threads are being used
  • SharedHeap
    A SharedHeap is defined as a heap for the java language. It's an abstract class, but contains common infrastructure.
    • G1CollectedHeap
      • A sharedheap which uses g1's parition strategy
    • GenCollectedHeap
      • A generationally-partitioned sharedheap
  • ParallelScavengeHeap
    ParallelScavengeHeap allocates pages from a survivor space in round-robin fashion. It differs from all other collection heaps in that it doesn't use the sharedheap structure. It is NUMA-aware.

Collectors

Generational

Interaction

https://blogs.oracle.com/jonthecollector/resource/Collectors.jpg
  • Why can't parNew and parallelOld work together?
    "ParNew" (and "Serial") implements space_iterate() which will apply an operation to every object in the young generation. When collecting the tenured generation with either "CMS" or "Serial Old", the GC can use space_iterate() to do some work on the objects in the young generation. parallelOld does not implement this interfact. (Source: https://blogs.oracle.com/jonthecollector/entry/our\_collectors)

Young Generation

  • ParallelScavenge
    • "Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads.
  • ParNew
    • The newer parallel collector. It will eventually replace the parallelScavenge collector
  • Serial
    • "Serial" is a stop-the-world, copying collector which uses a single GC thread.

Tenured Generation

  • ConcurrentMarkSweep
    • "" is a mostly concurrent, low-pause collector.
  • SerialOld(Mark-sweep-compact)
    • "Serial Old" is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.
  • ParallelOld
    • "Parallel Old" is a compacting collector that uses multiple GC threads.

G1(Garbage-first Garbage Collection)

Summary:

  • Divides the heap into regions and collects as many as it can while abiding by the desired pause time.
  • Specifies region to act as the new "younger" generation, which will be collected first.
  • Collects regions with most garbage first("Garbage-first")
  • Compacts to avoid fragmentation issues, but can still get internal fragmentation due to many under-filled heaps.
  • Because there isn't a static divide between young/tenured generations, can't have a problematic imbalance in sizes of heaps.

Paper:

http://dl.acm.org/citation.cfm?doid=1029873.1029879
  • 1 Intro
    • Designed to be a soft real-time system, where a maximum percent x of the next y-duration timeslice is specified
    • All inter-region pointers are tracked, not simply old-to-young
    • A write barrier logs changes, and a concurrent thread updates remembered sets with this log
    • Snapshot-at-the-beginning
      • At beginning of marking period, mutator threads paused and roots collected. This snapshot pause is relatively short.
      • Used to guarantee completeness
      • Finds percentage of each heap that's garbage, to allow garbage-first collection to work
    • Hard real-time abandonment
      • Rather than making collection interruptable at granularity of object moved, use granularity of each region
      • Can estimate collection time for each region(function of the amount of live data), and try to fit within soft real-time bounds
      • Allows higher throughput and less synchronization than requiring allowing arbitrary pauses
  • 2 Data structures
    • 2.1 Heap layout/regions/allocation
      • All allocation done from same allocation region
      • Allocation is "bump-a-pointer" of the "top" boundary marker in the current allocation region
      • All non-large-objects are allocated into thread-local allocation buffers or tlabs
      • Tlabs are allocated from a linked-list of free regions. Each region consists of many tlabs-sized subregions.
      • Large and humongous objects are allocated seperately into heap regions outside of tlabs.
    • 2.2 Remembered Set Maintenance
      • Each region has one hash-based card-table set per parallel GC thread, to avoid contention
      • Write barrier which checks if change creates inter-region pointer. If not, or if new reference is NULL, does nothing.
      • Else logs write into write log. If write log is empty, gets a new one.
      • Concurrent Remembered Set Thread
        • Has an "initiating threshold"
        • When RS buffer size exceeds threshold, CRST processes them until it reduces back down to the RST.
        • Use a seperate card table to determine which cards are "hot" by incrementing cards as processed. Hot cards only processed during evacuation pauses.
      • Only one remembered set thread runs. If the RS buffer is too full, mutator threads are forced to do RS work themselves in order to allocate.
    • 2.3 Evacuation Pauses
      • Stops mutator threads
      • Choose a set of regions, and evacuate the live data in regions into other heap regions.
      • Object movement must appear atomic, so world is stopped to allow incremental compaction.
      • Parallelized
      • Threads compete to claim sub-tasks in collection. No synchronization used after ensuring each object has only one thread working on it.
      • Algorithm:
        • Copy object into GC thread TLAB, and race to install brooks-style forwarding pointer
        • Winner scans contents, and processes.
        • Work-stealing load balances
    • 2.4 Generational Garbage-First
      • Can take advantage of the generational hypothesis with heuristic
      • Designate region young when mutator allocation region
      • Young regions always part of collection set
      • Remembered set processing not required for young regions
      • Can be toggled between "pure garbage-first" and "generational"
    • 2.5 Concurrent Marking
      • Uses snapshot-at-beginning. Considers objects allocated during concurrent marking as alive. Guaranteed to find all garbage that exists during snapshot.
      • 2.5.1 Marking Data Structures
        • A bitmap is used to mark. "Previous" bitmap was bitmap for last marking period. "Next" is under construction. Two bitmaps switch when marking completed.
      • 2.5.2 Initial Marking Pause
        • Clears next marking bitmap.
        • Stops all mutator marking threads, marks everything reachable from roots
        • Each region records top at mark start(TAMS) for previous and next.
        • Allows one to determine when objects are allocated during marking because they are above TAMS.
      • 2.5.3 Concurrent Marking Write Barrier
        • If mutator removes pointers that exist at start, SATB guarantee can be violated.
        • Needs a write barrier to store origional version of fields before overwriting
        • It's only necessary to log writes to non-previously-null fields, and most writes are initialization writes so recording is cheaper than one would think
      • 2.5.4 Final Marking Pause
        • Mutator threads paused and changes since initial pause are processed by marking threads. Marking is thus finished.
      • 2.5.5 Live Data Counting and Cleanup
        • When marking finished, GC thread counts amount of live data in each region.
        • Find it by using marking bitmap, not traversing object graph
        • Since concurrent modification can make counts imprecise, a final stop-the-world cleanup pause must finish counting process.
        • Cleanup also swaps next and previous bitmap roles
        • Next TAMS value copied into previous
        • Cleanup phase sorts heap regions by expected GC efficiency.
          • Is the ratio of amount of garbage to free over cost of collecting it.
          • Garbage-only regions immediately reclaimed in this phase.
          • For some programs, may recover significant space in this phase.
    • 2.6 Evaucation Pauses and Marking
      • Never evacuates objects proven dead in last complete marking pass.
      • Needs to ensure that next/previous bitmasks are valid with respect to movements.
      • May only evacuate when marking thread's marking stack is non-empty. Otherwise marking could delay evacuation.
      • Treats marking stack as source of roots.
    • 2.7 Popular Object Handling
      • Popular objects handled differently in order to ensure smaller remembered sets and more efficient barrier.
      • Popular objects placed in regions in small prefix of heap
      • Prefix never part of general collection sets
      • When popular regions reach a threshold, create a "popularity pause" to empty region.
      • Popularity pause creates reference count for all popular objects and moves objects with counts below a threshold to ordinary heap regions.
      • Benchmarks show that this behavior removes majority of remembered set enteries
  • 3 Heuristics
    • 3.1 User Inputs
      • Two user inputs: Upper bound on space usage, and the maximum portion of a time slice given to stop-the-world GC
      • Whether to use generational mode
      • Soft realtime goal of time-slice only counts stop-the-world, since concurrent collection can be seen as a constant tax on mutator.
    • 3.2 Satisfying a Soft Real-time goal
      • 3.2.1 Predicting evacuation pause times
        • Need to accurately predict added cost of adding a region to determine how many regions to evacuate.
        • Keep adding regions until we would exceed pause time
        • Use statistics to estimate the fraction of the data allocated since last marking which is live.
      • 3.2.2 Scheduling Pauses to meet a realtime goal.
        • If we would otherwise miss deadlines, we can simply delay phases.
        • Can cause mutator to allocate into regions which wouldn't be collected, to avoid exceeding pause times in future collections.
    • 3.3 Collection Set Choice
      • Decide on the "GC efficiency" as the amount of garbage which will be freed over the cost of collecting the region.
      • Can have low amount of live data and still have low efficiency because of a large remembered set
      • In time between ending marking and starting the evacuation pause, costs may change due to changes in remembered set size. Need to recompute for most attractive subset at start of evacuation pause.
    • 3.4 Evacuation Pause Initiation
      • Determine the hard margin, or portion of the heap that must be empty for there to be a to-space for collection. When this gets threatened, we always trigger a collection.
      • In fully-young generational configuration:
        • We know the number of regions that would cause the desired pause time
        • When the number of allocation regions which have filled up is equal to this number, we collect.
      • In partially-young mode, we do evacuation paues at maximum allowed frequency to minimize pause time.
      • Generational starts in fully-young, and then switches to partially-young after marking pass complete in order to collect attractive regions.
      • Always use partially-young if heap very full, until heap occupancy drops.
    • 3.5 Concurrent Marking Initiation
      • In generational configuration, we define a soft margin such that when the margin is exceeded, we start a new marking pass.
        • Allows us to try to collect more frequently to avoid hitting the hard margin, where a soft real-time goal can be violated because of collection time required to free enough of the heap to continue.
      • In pure garbage-first, we mark and then keep track of the efficiency of our evacuation pauses.
        • Once they begin to decline(freeing less-attractive regions after more-attractive) we begin a new marking pass to mark more garbage and increase attractiveness.
  • 5 Related Work
    • Idea of dividing collection of entire heap into incremental collection of subheaps steams from train algorithm.
      • Differs from train in the use of more remembered sets to allow choice of collecting any region.
    • Differs from metronome in that it sacrifices hard-realtime performance for higher throughput.

Shenandoah

Presentations http://rkennke.files.wordpress.com/2014/02/shenandoahtake4.pdf Strangeloop: https://www.youtube.com/watch?v=QcwyKLlmXeY

  • New alternative to G1
  • Eventual goal: An entirely pauseless collector for large-memory systems
  • Current goal: <10ms pause times for 100+GB heaps.
  • Not currently pauseless
    • Not yet using round-robin root updating because makes difficult to ensure that object graph is maintained, and difficult to prove both systems simultaneously before done.
  • All other collectors stop-the-world during compaction to avoid problems of concurrent mutation
    • Two solutions: Trap on access of objects in motion, or brooks-style forwarding pointers
    • They chose forwarding pointers
  • Forwarding pointers remove need for remembered sets
  • Remembered sets have scalability problems
    • Large systems which pre-allocate space for scalability reasons frequently thrash cache lines containing card table
    • G1 had massive remembered sets. Fell back on a "generational" configuration eventually which only had to remember old-to-young references.
      • Forced generational paradigm, not truly garbage first.
    • Forwarding pointers allow arbitrary choice of region collection without remembered sets. Truly garbage first.
  • Forwarding pointer lives in indirection object
  • Divorces collection from allocation, doesn't risk randomly interrupting mutator threads.
    • Reading an object in a from-region doesn't trigger evacuation. Barrier is unconditional since always chases forwarding pointer.
    • Writing an object does.
      • Also ensures that when any reference to an object in fromspace is written to an object in tospace, it is updated to point to a newly copied version of that item in tospace.
      • System has invariant that writes only occur in to-regions.
      • In order to avoid races in copying/installing forwarding pointer, GC and mutator threads copy and then compare-and-set the forwarding pointer.
        • Loser undoes allocation.
      • If mutator out-races the collector thread, collector must use thread's new location as object address and perform forwarding process again.
      • Needs to ensure that anything alive at beginning of marking is still considered live, SATB is used.
        • To avoid mutation while scanning the graph causing breakage, write barrier is used to mark references in overwritten field values
      • Needs to ensure that atomic compare and swap of reference by mutator threads only swap refereneces to tospace objects because otherwise CAS would fail because GC thread modified the previous reference, and because no tospace object should point to a fromspace object.
        • Solution is to copy all referred-to objects into tospace.
      • Costs are bounded because of finite region size. Can't prevent progress for unbounded amount of time.
  • Use work-stealing
    • Compete to claim roots
    • Process one child, push rest of children onto work-list.
  • Overview
    • Heap divided into regions. Concurrent marking tracks each region's live data.
    • Regions with most garbage will join collection set.
    • GC threads concurrently evacuate live data.
    • At start of next round of concurrent marking, references in fromspace are updated.
    • Fromspace is freed.
  • Downsides
    • Forwarding pointers add a time and space overhead.
  • Generational hypothesis doubts
    • Suspicion that generational hypothesis doesn't improve performance because most objects die young, but because remembered sets are smaller.
Date: 2014-10-16T00:13-0400
Author: Alex Kyte
Org version 7.9.3f with Emacs version 24
Validate XHTML 1.0

No comments:

Post a Comment