Thursday, December 4, 2014

M3: Mixed Collection for Simplicity

1 M3: High Performance Memory Management from Off-the-shelf components

1.1 Abstract

1.1.1 Investigation as to whether complexity of existing collectors necessary

1.1.2 May simplify if programmer determines the collection strategy to use(tracing/refcounting)

1.1.3 Observe performance competitive with C when this strategy used

1.1 Abstract

1.1.1 Provides fragmentation tolerance with worst case time/space guarantees

1.1.2 Implementation in the Fiji VM, a JVM for mission-critical systems

1.1 Abstract

1.1.1 FPGAs are complex, programming at a higher level of abstraction is desirable

1.1.2 First complete GC in hardware, not hardware-assisted

1.1.3 Reasonable performance

ever more than %1 of resources on most high-end FPGAs

Uses completely concurrent snapshot algorithm

Single-cycle access to hte heap, mutator not stalled for even a single cycle

Can achieve 100%MMU with heaps as small as 1.01-1.04 absolute minimum

1 DieHard, Probabilitic Memory for Unsafe Languages

1.1 Intro

Memory errors such as buffer overflows, use-after-free, and uninitialized memory usage can compromise C/C++ applications

Tools like valgrind can help track down errors, but only work on errors observed during execution

Programs which detect error and fail fast can be undesirable if the end user prioritizes uptime

Failure-oblivious programs which drop illegal writes and give sentinels on illegal reads keep the program running but make no promises about correctness to the progammer.

Probabilitic memory safety is a suggested solution to this, DieHard is one such implementation

Status on the Bug Collector

My last few weeks have been spent working on the DWARF-based root scanner for C and C++ which was proposed in a previous post.

Where have I come, and what I have I learned?

The MMTK, a project typically associated with the Jikes RVM, is also used as part of the llvm-backed VMkit project. The goal of a common virtual machine framework seems very inspired by MMTK. It was a pleasant surprise to see that there is something to help prevent bitrot from getting to the project.

http://vmkit.llvm.org/publications/thesis.html

Thursday, November 13, 2014

Polyhedral GC

Looking at polyhedral optimization at LLVM, I cannot but wonder if this might be used to optimize graph traversal in the marking phase of collection.

http://polly.llvm.org/publications/grosser-impact-2011.pdf

Monday, November 10, 2014

A Good Article about Realtime GC, from The GC Handbook

http://michaelrbernste.in/2013/06/03/real-time-garbage-collection-is-real.html

Sunday, November 9, 2014

Migration from Conservative GC

https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/

Sadly, conservative collection has a fatal flaw: GC things cannot move. If something moved, you would need to update all pointers to that thing — but the conservative scanner can only tell you what might be a pointer, not what definitely is a pointer. Your program might be using an integer whose value happens to look like a pointer to a GC thing that is being moved, and updating that “pointer” will corrupt the value. Or you might have a string whose ASCII or UTF8 encoded values happen to look like a pointer, and updating that pointer will rewrite the string into a devastating personal insult.

So why do we care? Well, being able to move objects is a prerequisite for many advanced memory management facilities such as compacting and generational garbage collection. These techniques can produce dramatic improvements in collection pause times, allocation performance, and memory usage. SpiderMonkey would really like to be able to make use of these techniques. Unfortunately, after years of working with a conservative collector, we have accumulated a large body of code both inside SpiderMonkey and in the Gecko embedding that assumes a conservative scanner: pointers to GC things are littered all over the stack, tucked away in foreign data structures (i.e., data structures not managed by the GC), used as keys in hashtables, tagged with metadata in their lower bits, etc.

Their solution?

Over a year ago, the SpiderMonkey team began an effort to claw our way back to an exact rooting scheme. The basic approach is to add a layer of indirection to all GC thing pointer uses: rather than passing around direct pointers to movable GC things, we pass around pointers to GC thing pointers (“double pointers”, as in JSObject**). These Handles, implemented with a C++ Handle<T> template type, introduce a small cost to accesses since we now have to dereference twice to get to the actual data rather than once. But it also means that we can update the GC thing pointers at will, and all subsequent accesses will go to the right place automatically.

This is an interesting solution to the consistency problem. It would in theory allow a realtime collector to race to copy live data, using some kind of compare-and-set. The problem with this in the context of a C/C++ conservative collector is that it changes the semantics of pointer dereferencing and the type of references returned by malloc. For a language with complete control of the collector usage, it's a nice vindication of the trite adage that "all problems in CS can be handled with another layer of indirection."

Cache locality would suffer, one might think, because the pointer to the indirection object may become quite far away from the data the object points to. Therefore every memory access requires fetching two different locations into the cache in the worst case.

If all references are stored in in a single place, different processors may contend to have the cache line holding all of the entries. This is false sharing, called so because the processors will fight to have memory addresses in their cache line that share bits with the data they actually want. They're sharing the cache line without actually sharing any addresses.

That said, the change was a step toward the runtime and collection optimizations that led to the team's recent massive performance increases and the reduction of the resident memory used by the firefox web browser.

Wednesday, November 5, 2014

Status Report for GC through Self-Introspection through DWARF

DWARF makes finding the functions executing based on the given program counter fairly painless. I was able to get a call stack, and figure out the variables inside the function and the associated types.

Actually getting the value of the pointers requires running a simple state machine. There are a few locations that a variable could be in, and all of the relevant ones require a degree of stack walking.

How I know that polishing dwarf-based location finding will take a while

http://www.dwarfstd.org/doc/040408.1.html

There are many ways that libdwarf can encode variable locations. In fact, it's an entire state machine. That said, C doesn't use most of them. After getting the basic parsing working, this will be something to return to and polish up.

Monday, November 3, 2014

Previous space optimizations of the BDW GC

http://www.cs.rice.edu/~javaplt/311/Readings/pldi93.pdf

- Conservative collectors have come in many shapes and sizes. Some track all pointers in heap while others are more conservative.

- Severe space leaks in conservative collectors have been noted in the past.

- Negative performance results associated with collectors likely due to space leakage and associated problems such as increased memory pressure and likelihood of thrashing.

- Removal of low-hanging fruit such as static values which point into heap area can lead to great improvements.

- Only paper(at time of this paper) analyzing pitfalls of conservative GC: http://dl.acm.org/citation.cfm?id=79028

A Great Dwarf 101

While the documentation for dwarf is numerous, there are few guides on using it to get high-level information from memory addresses rather than the other way around. This pdf was fairly helpful in figuring out how to do so.

http://www.dwarfstd.org/doc/Debugging%20using%20DWARF.pdf

Thursday, October 30, 2014

DWARF resources

DWARF is the binary format used by debuggers to specify the types and other information related to values in the running program.

http://www.dwarfstd.org/doc/DWARF4.pdf

Parsing dwarf is not very fun. There are two main libraries which will do that for you, libdwarf and libbfd. Because libbfd(binary file descriptor) seems more arcane and more rigidly suited to working with elf/dwarf files external to the current program, I'm electing to use libdwarf. Documentation seemed hard to find, but it appears to be well-documented on sourceforge:

http://sourceforge.net/p/elftoolchain/wiki/libdwarf/

Proposal: Not-So Conservative Collection

The Boehm GC has a number of faults related to the fact that it is a conservative collector and cannot know if it is considering a pointer inside the heap or a value whose bits may be interpreted as such. The .NET framework gained performance gains which they considered significant when they made the switch to the ‘precise’ mode of the Boehm GC. I believe that this is evidence that greater access to type information could allow the Boehm GC to achieve higher performance for the many, many projects and runtimes which use it.

In a master’s thesis(http://rus.har.mn/files/project-report.pdf), student Russell Harmon showed that access to the standard DWARF debug symbols for an executable can be leveraged in order to get a level of introspection into C programs that would otherwise be near impossible to accomplish. These symbols can be either compiled into the executable or provided as a separate file. They contain information about the types of the variables used by the program, and can be used to find the type of an object on the stack.

I believe that there exists a way to use these debugging symbols to increase the effectiveness of the Boehm-Demers-Weiser collector by allowing for introspection of the types of otherwise ambiguous roots. There already exists two separate ways to communicate the type of objects being allocated to the collector, but they require that the program specify the type at each malloc site. Allowing the dwarf symbols be provided as another argument to the collector at initialization would allow for the collector to infer these types without modification of the program in use. This is more in line with the spirit of the GC, which spent significant effort to allow a programmer to include it rather than stdlib, and it would replace the malloc/free previously used by the programmer.

I would use the type interface in in their types.h, and would use libbfd or libdwarf to parse the dwarf files.

A caveat would be that casting a pointer to an integer may lead to freeing memory in use. As this is a purposeful defeat of the type system, I’m not sure if there is a way to track this and compensate for it.

In more depth:

All of the variables on the stack, or within scope when the collector is called, are noted by the Boehm collector. Each variable's type can be found through looking at the dwarf debugging symbols.

We don't know the type of objects on the heap, but all allocation in Boehm currently goes through an alternative implementation of malloc. This implementation looks into a structure to find a large enough contiguous list of cells of the allocator's heap area that it has gotten by using the os's standard allocator. It allocates these cells with a header which notes the start and end of the data allocated to the program inside of it. Therefore we know the size of every object on the heap, but we do not know the type.

The pain point for Boehm is not knowing whether something's a pointer, and what it's type is. It prevents moving the objects because it can't fix the references. It also means that an integer seen which would point into the heap if cast to a pointer, is considered a pointer. This leads to garbage not being collected.

At the start of a collection, Boehm currently takes the values on the stack and if any of them could be pointers into the heap, mark's the cells' header with a bit signifying it's not garbage. Then it sweeps through these objects, and if any of them contain values that might be pointers, it adds them to the work-list to mark and scan.

With dwarf introspection, the collector takes the values on the stack and if any of them may be pointers, it looks up their type in the dwarf data. It then writes that type into the header associated with the allocation cell-grouping pointed to. When traversing the object, the collector can look at the dwarf data to tell the offsets which may be pointers. If they point into the heap, it then looks up the type of the pointer in the dwarf data and writes that into the header for the pointed-to cell, and adds that pointer to the work-list. In such a way, all live data will be given a type.

Proposal: Static Memory-Management Cost Reasoning in Garbage Collected Languages

While features of dynamic languages tend to have costs associated with them again, a large number of real analyses and optimizations have shown that large programs in languages such as python, javascript, and ruby are frequently much slower than they need to be because the syntactic cheapness of rapid allocation hides the cost. This is in stark contrast to explicit memory allocation systems, in which operations likely to cause memory management to egregiously slow a given function require a significant amount of verbosity.

These can be caught in profiling, but this process tends to be time-consuming and prone to only addressing problems in the most frequent path.

Given information about the rate of degradation of a language’s performance as the heap becomes more full, and the relative costs of read/write barrier and inter-generational pointers, the total cost of a series of allocations can be guessed. The call graph can be used to statically determine the relative age of arguments to infer inter-generational pointers. When a data structure is abnormally short-lived, and the structure has a higher allocation cost, this system could recommend that the programmer consider solving the problem with more primitive structures or find a way to persist some of the information between calls. When garbage cycles are created in the same block in a reference counted language, the programmer could be gently informed that this cycle has a cost.

As this is a reasoning tool, static understanding of all memory access is not required for this to be useful or correct. Furthermore, I believe that it is a good platform for experimentation with static analysis of memory management activity. A compiler may one day use heuristics used here to determine refactors which could improve GC performance.

I would propose that this be implemented for the cpython interpreted environment as a reference implementation. This is due to the fact that there is write barrier, as it is reference counted. A cycle collector is used to break cycles. Cycle-collecting GC pause times have been considered an almost deal-breaking headache when relying on asynchronous IO for lower latency. The added complexity of the fact that the collector uses a generational scheme might require a degree of ‘interpretation’ of the allocation/deallocation statements in a block to determine the likelihood that it would cause or hasten the next expensive collection.

In order to make this information most useful, I would create bindings to this plugin from emacs/vim. I’m familiar with the scripting languages of these languages, and believe this to be the easier part.

I believe that much is to be gained by adding automated reasoning about memory management to the workflow of developers.

Wednesday, October 22, 2014

Read Barrier Elimination Through Concurrency

1. Eliminating Read Barriers through Procrasination and Cleanliness

1 Eliminating Read Barriers through Procrasination and Cleanliness

https://www.cs.purdue.edu/homes/suresh/papers/ismm12.pdf

1.1 Abstract

1.1.1 Goal is to avoid need for a read barrier

1.1.2 Use concurrency and stalling to avoid need for immediate copy

1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known

1.2 Introduction

1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce

No inter-local-heap pointers
No pointers from shared heap to local heap

1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.

1.2.3 Needs a read barrier because reference might refer to newly-exported object.

1.2.4 Read barrier cost is high, but removal is nontrivial.

1.2.5 Idea: Delay operation

Read barriers only need to exist if there are forwarding pointers which exist between move and collection
Can delay operations that create them
We can stall threads which need to cause copy to shared heap, procrastinating the move
GC is informed of these procrasinated operations at start of collection
As long as there are enough threads, cost of procrastination simply cost of context switch.

1.2.6 Idea: Avoid stalling using language semantics

Terminology: Cleanliness = A clean object can be copied without creating a forwarding pointer without interfering with language semantics.
Ex:
1. Immutable object
2. Objects only referenced by the stack in creating thread
3. Objects with known set of incoming references which can inexpensively be fixed on copy.
  1. Specialized write barrier can be used to make this less costly.
Effectiveness depends on
1. Most object being clean
2. Rarely have many popular, mutable objects
3. Sufficient concurrency to make stalling acceptable.
Implementation: MultiMlton. Doesn't require programmer intervention and change can be isolated to runtime.

1.3 Motivation

1.3.1 Implementation considered is multimlton.

1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.

1.3.3 Each core has local heap, and there's a single shared heap.

1.3.4 Max heap set at 3x min heap.

1.3.5 Extensive optimization to avoid heap usage

1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.

1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.

1.3.8 Mutator spends ~20% of execution time in read barrier.

1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.

1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.

1.4 GC Design and Implementation

1.4.1 Threading system

Makes one thread per core, pins it to it.
Lightweight threads have core affinity.
Threads yield at GC safe point if preempted.

1.4.2 Baseline collector(stop the world)

Single shared heap
Tlabs
GC requires thread synchronization on barrier
Cheny two-space copying, but switches to mark-compact if ends up using more than half of heap.
Uses appel-style generational collection
Runtime switches to generational if amount of live objects drops below a threshold.

1.4.3 Local collector(split heap)

Similar to baseline collection, but doesn't need to synchronize for local GC
Shared heap collected by single core after sychronization
Shared-heap collections infrequent because shared objects tend to live much longer than a tenured object in a generational collection scheme

1.4.4 Remembered stacks

Threads communicate over message-passing channels
If no reciever, sender blocks on channel
If channels in shared heap, would eventually pull most objects into shared heap. Undesirable
Channel stacks never moved to shared heap, so references in objects exported stored in a set of remembered stacks.

1.5 Cleanliness Analysis

1.5.1 Heap Session

Local allocation is bump-a-pointer
We have a reference into the heap called sessionStart. Before sessionStart is previous session. After sessionStart is current session.
Only need to scan current session in cleanliness analysis.
Sessions start on context switch, local GC, or export.

1.5.2 Reference count

We steal 4 bits in each object header to record number of references within the current session.
4 bits encode zero, one, local_many, and global_many
Zero => Only references from stacks, global_many => at least one reference outside current session.
Implemented as part of write barrier. Non-decreasing, so represents a maximum.

1.5.3 Cleanliness

An object is clean if:
1. It is immutable
  1. Trivially safe to copy
2. Object is a root, has zero references
3. Object is not root, has one reference
4. Object not root, in current session, and has local_many
  1. Can't know number of references without walking local heap(expensive) so we ask if all pointers originate from current session(LOCAL_MANY).
Lifting and fixing references in transitive closure in current session cheaper than walking whole local heap.

1.6 Write barrier

1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.

1.6.2 Delaying writes

If source of exporting write not clean, context switch to another thread.
Source is added to queue of objects to lift
If all threads associated with core blocked, trigger local GC and perform all lifting and reference fixing
Mutator never encounters forward reference.

1.6.3 Lifting objects to shared heap

Triggers shared heap GC if lacks memory for copy
Immutable objects copied, location added to immutable set, mutable objects forwarded
Stack walked to fix references
If local_many set for an object, current session walked to fix reference.

1.6.4 Remote spawns

Spawning a new thread can cause function closures to escape local heaps
Environment of function moved to shared heap, function closure added to target core's scheduler.
If closure clean, we move it immediately. Else, we procrasinate.

1.6.5 Barrer Implementation

Two local collectors implemented. One with read barriers, one which instead uses cleanliness/procrasination.

1.7 Target Architectures

1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.

1.8 Results

1.8.1 Benchmarks

Input size and threads tuned to ensure adequate core utilization

1.8.2 Performance

Ported Bohem-Demers-Weiser conservative GC to test against
Consecutive tests done by shrinking maximum allowed heap size and recording running time and actual heap utilization.
Read-barrier-less had better performance, but had higher minimum heap size.
Stop-the-world gc has faster mutator time than alternatives, but impacted scalability because collection was serial when world stopped.
Bohem GC slow.

1.8.3 Impact of cleanliness

Without cleanliness:
1. Had significantly more stalls/preemptions when making write
2. 49% of local collections were forced when concurrency exhausted, as opposed to 1% otherwise.
3. 28.2% slower execution time

1.8.4 Impact of Immutability

Benchmark done with all objects treated as mutable
11.4% negative performance impact

1.8.5 Impact of heap session

Without the local heap session(treat all local_many as reason to stall), significant increase in objects being tagged as non-clean

1.8.6 Related work

GHC - Multicore GC with local heaps - SPJ
1. Export only part of transitive closure to shared heap.
2. Mandates need for read barrier, but GHC already has read barrier to test for thunk evaluation.
Manticore

Created: 2014-10-22 Wed 17:05

Emacs 24.4.1 (Org mode 8.2.10)

Validate

A Control-Theory Approach to Heap Resizing

1. Heap Sizing
- 1.1. Control Theory for Principled Heap Sizing

1 Heap Sizing

1.1 Control Theory for Principled Heap Sizing

http://www.dcs.gla.ac.uk/~jsinger/pdfs/ismm13.pdf

1.1.1 Intro

Programs frequently have seperate phases with very different memory behavior, no static heuristic acceptable
Control theory offers more formalism than trying to glue random, effective heuristics together.
Paper describes the idea of having a GC controller and increasing the heap size until the heap thrashing decreases and a throughput goal is hit.

1.1.2 Sweet spots

If heap too large, paging will impact performance. If too small, will thrash heap.
Can see a parabolic curve of running time and heap size. Ahead-of-time choice of heap size by benchmarking tends to not work though.

1.1.3 Heap sizing in existing VMs.

Existing VMs tend to increase heap size too optimistically, and decrease it more slowly than could be desired.
Jikes RVM
1. After each GC, manager determines new "resize ratio" based on short term GC overhead (ratio of gc time and time since last GC) and ratio of heap that was live
  1. Is an index into a lookup table of hard-coded values found by trial-and-error
2. Not explicitly goal-oriented in terms of throughput or heap size
3. Doesn't take history into account, can flip-flop between states needlessly when live data size fluctuates back and forth quickly.
4. There is no evidence that different GC algorithms will match these hard-coded heuristic thresholds
Hotspot
1. Has a lot of tunable "ergonomics," and will resize heap to make best-effort to meet all priorities.
2. Resizing rate much less flexible, ratio bounded between 0.95 and 1.2. Early on in execution, allows to grow at ratio 2.
3. Venegrov has questioned whether the resize policies actually ensure progress towards goals.

1.1.4 Heap sizing as a control problem

Should use control theory
1. Control theory mature in engineering
2. Responds to both inputs and current deviation from goal.
3. Software is very much like such a system, since software phase changes and sudden need to perform expensive, less frequent mode, can be seen as change to goal.
4. Definitions:
  1. Steady-state error: The error that a system has after it's allowed to hit a steady state after trying to respond to a change.
  2. Transient response characteristics: How quickly a system responds to an input.
Formulating the Problem
1. What's the input, what's the control variable, what's the feedback mechanism?
2. Control variable = Heap resize ratio
3. Measurement variable = Short-term GC overhead(ratio of GC pause time to time since last collection)
  1. Can use others as well, but not done in paper.
  2. Finds median overhead for past 5 collections. Dampens sensitivity and oscillation.

1.1.5 Designing a Heap Size Controller

Chose to treat entire GC system as black box, tune using Proportional Integral Derivative controller
Use a target GC overhead to resize heap
Only consider resizing after GC to avoid confounding variables in controller training.
PID Controller
1. Why PID is appropriate
  1. Doesn't require a model of the system.
  2. Promises eventual zero steady-state error
  3. Takes history into account.
PID Controller Theory
1. PID takes in responsiveness configuration, responds by outputting a change or "gain" in the controller
Implementation
1. Built on Jikes/MMTK
2. Uses bytes allocated as time proxy
3. Heap growth manager was modified to use the PID controller
4. Source: http://sourceforge.net/p/jikesrvm/research-archive/40/
Tuning:
1. Can use the "empirical tuning method" which is effective, but dangerous in physical engineering.
2. Ideal for software tuning

1.1.6 Evauluation

See paper for charts, results.
Generally positive
Might be worrying is that frequent, drastic changes in live data cause frequent, drastic changes in heap size.

1.1.7 Related work

Heuristic Approaches
1. Brecht used Boehm GC. Grows heap by variably, finely-grained amount. Never shrinks heap.
2. Most recent work has explicit goal to avoid paging
  1. Yang requires changes to OS
  2. Isla Vista system detects risk of paging by detecting allocation stalls
    1. Grows aggressively until allocation stall detected, and then shrinks aggressively
  3. Hertz has different VMs communicate relative stall/paging rates to optimize collection of multiple VMs.
Mathematical Models
1. Sun varies size of heaps of multiple applications to hit performance goals
2. Tay and Zong use number of page faults to direct resizing strategy to reduce number of faults.
  1. Threshold values determined by benchmarking, requires new derivation on a new system or new system behavior.
3. Venegrov determines time not spent in GC for hotspot. Uses a study of the hotspot system to find custom tuning algorithm.
Control Theory Papers
1. Storm deals with database configuration, uses control theory to handle ratio and size of sub-heaps for specific types of cached assets in database
2. Gandhi uses control theory to optimize apache's CPU and memory utilization
3. Growing research area

Created: 2014-10-22 Wed 16:56

Emacs 24.4.1 (Org mode 8.2.10)

Validate

Page Fault Driven Heap Size

1. A Page Fault Equation for Dynamic Heap Sizing

1 A Page Fault Equation for Dynamic Heap Sizing

http://www.math.nus.edu.sg/~mattyc/HeapSizing.pdf

This is an interesting paper. It more or less presents an equation,

1.1 Introduction

H is the heap size, M is the size of available main memory

1.1.1 Suggested policy out-performs JikesRVM's heap size mamager

1.1.2 The Problem:

Frequent GC pauses in a small heap leads to significant performance degredation due to cache pollution.
A large heap faces frequent page faults.
1. GC actually causes more page faults than the mutator when the heap size exceeds main memory.
Tuning intelligently can possibly lead to better performance that manual memory management can provide.
Heap cannot be static, must be able to scale to very small and very large heaps.
The number of page faults not simply a function of actual memory and of heap size, includes mutator + collector behavior.
1. Since hard to isolate behavior's effects, simply considering rate of page faults.

1.1.3 Heap-Aware Page Fault Equation

Equation takes parameters specific to mutator, GC, and OS
See paper for equation.
Refinement of the existing general "page fault equation"

1.1.4 General Page Fault Equation:

"A page fault equation for modeling the effect of memory size" by Tay and Zou
Note: Paper with restricted viewership rights

1.1.5 Heap sizing Rule:

Uses the parameters for the page fault equation to know how to resize.
Doesn't require OS or hardware modification or costly setup of polling, handlers, and callbacks for stalls and page references.
Can automatically respond to changes in main memory(cloud hosting?)
Allows for hotswapping of collectors if parameters found.

1.2 Heap-Aware Page Fault Equation:

1.2.1 Page Fault Equation

Takes a number of variables that must be found or estimated.

1.2.2 Paper describes parameters in general equation

1.2.3 GC-specific parameters

Centers around M*, the memory usage of the heap.
1. Used in consideration of paging equation, which considers the page fault rate as a function of the actual memory and the program's memory usage.
M* = aH + b
1. We expect the memory footprint to grow as a linear function of the heap size
2. b = Measure of space overhead of collector / VM outside of heap.

1.2.4 Experimental Validation

No existing classical page fault analysis models able to account for thrashing effect of small page sizes, and the respective increase in performance by allowing a slightly larger-than-necessary heap.
Experimental evidence suggests that the page fault equation and heap resizing policy both hold.

1.2.5 Can derive a H_min, the smallest reasonable heap.

1.3 Heap Sizing

1.3.1 Heap sizing rule

Heap sizing rule falls out of the H > H_min, H < H_max requirements.

1.3.2 Experiments with static memory size

Consistently has less page faults for JikesRVM vs default resizing policy.
Since page faults will dominate execution time, also faster

1.3.3 Experiments with variable memory size

Outperforms Jikes significantly.

Created: 2014-10-22 Wed 16:57

Emacs 24.4.1 (Org mode 8.2.10)

Validate

Wednesday, October 15, 2014

MMTK/Jikes RVM

v jikes

jikes

MMTK
- Plans
- Spaces
- Barriers
- Policies

MMTK

A garbage collection framework similar to the plai2 framework but less designed as a "toy" is the mmtk(Memory Management Toolkit) for the jikesRVM.

A quick overview of the interaction between the virtual machine and the mmtk framework can be found here: http://jikesrvm.org/Memory+Allocation+in+JikesRVM

Jikes is surprisingly both non-frightening and full-featured. One of the surprising things about MMTK is the presence of such optimizations as thread-local allocation interfaces while still retaining the simplicity of the memory manager interface.

If you're planning on making a collector for jikes, I believe that the best overview of the way that the collector is expected to be structured and how Jikes will use it can be found here: http://jikesrvm.org/Anatomy+of+a+Garbage+Collector

It comes with a large number of garbage collectors. They all satisfy a simple interface which captures the information that a collector needs.

Many of them will implement this interface, so it's worth looking at for the curious: http://jikesrvm.org/docs/api/org/mmtk/plan/Simple.html

There is a fairly straightforward tutorial on adding a simple mark-sweep collector and a quasi-generational hybrid copying/mark-sweep collector here: http://jikesrvm.org/MMTk+Tutorial

And I was glad to see that there was a comprehensive test harness for jikes. One of the most difficult things with plai2 is figuring out how to test things in isolation and together, and how to track test program misbehavior to a specific collector fault. http://jikesrvm.org/The+MMTk+Test+Harness The harness looks to be invaluable.

There's a canonical document here: http://cs.anu.edu.au/~Robin.Garner/mmtk-guide.pdf. While it appears to be unfinished and very empty, it does helpfully explain the naming convention used throughout the API.

Plans

Plans are ways of bring together collection, allocation, and heap policies into something that can be used by the MMTK. A plan is a package with classes for each of these.

Spaces

A collection of regions of memory, each with a given allocation and collection policy.

Barriers

MMTK currently supports write, but not read barriers.

Policies

A policy is a region of memory with a collection and allocation strategy.

Date: 2014-10-16T00:11-0400

Author: Alex Kyte

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0

Hotspot Notes

notes

Resources:

http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/gc\_implementation
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/gc\_interface
http://www.slideshare.net/waterfox1/an-introduction-to-jvm-internals-and-garbage-collection-in-java
http://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf

Interface

How does Hotspot interact with the garbage collector?

CollectedHeap

The CollectedHeap interface defines the behavior of a heap which is garbage collected.

Components

Name	Description
_gc_cause and _gc_lastcause, and _set_gc_cause	Records the causes when collections are triggered. Needs to be set to a "no cause" state between collections
_total_collections and _total_full_collections
allocate_new_tlab(size), accumulate_statistics_all_tlabs, resize_all_tlabs, allocate_from_tlab	Interact with thread-local allocation buffers.
common_mem_allocate_noinit, common_mem_allocate_init	Allocate a block with the given size. Initialization means zeroing out
fill_with_array, fill_with_object_impl	Initialize array or object that's been allocated.
Name	The heap strategy being used. It's an enum mostly named for collector types.
Initialize, post_initialize	Initialize the "universe", and do other setup tasks before allocation
_is_maximal_no_gc	If the heap has reached it's maximum size allowed without a GC
max_capacity	The amount of memory the heap that the vm can allocate for java objects made by the user's code, and not used for bookkeeping
is_in_reserved, is_in_reserved_or_null	Whether the pointer is in the reserved heap region, and optionally is null
is_in_closed_subset, is_in_closed_subset_or_null	Is in a closed subset(all objects in subset point only to other objects in subset)
is_scavengable	Whether object can move during a scavenge(partial collection)
n_par_threads, set_par_threads	Number of threads currently working on gc tasks
Class_obj_allocate	Allocate and initialize an instance of a class
fill_with_objects, fill_with_object, etc	Fill allocated raw memory with object(s) and arrays
ensure_parseability	Place heap in a parseable state if not(in middle of GC)
unsafe_max_alloc	Estimate what can be allocated without possibly causing a collection
supports_tlab_alloc, tlab_capacity, tlab_max_alloc	Related functions for tlab interaction
can_elide_store_barriers	Whether the compiler can avoid a store barrier. If so, it must later call a deferring function to defer the store action. It must also later trigger a flush of these deferred stores.
collect, do_full_collection, collect_as_vm_thread	Various ways to request a collection, requiring various levels of locks
is_active?	Whether the world is stopped to collect, false for concurrent part of concurrent collectors
CollectorPolicy, AdaptiveSizePolicy	The policies the heap has for collection and resizing, with associated costs for actions.
millis_since_last_gc	Time since last GC
print_gc_threads, print_all_gc_threads, trace_all_threads	Print / trace / debug threads
use_parallel_gc_threads	Whether parallel threads are being used

SharedHeap
A SharedHeap is defined as a heap for the java language. It's an abstract class, but contains common infrastructure.
- G1CollectedHeap
  - A sharedheap which uses g1's parition strategy
- GenCollectedHeap
  - A generationally-partitioned sharedheap

ParallelScavengeHeap
ParallelScavengeHeap allocates pages from a survivor space in round-robin fashion. It differs from all other collection heaps in that it doesn't use the sharedheap structure. It is NUMA-aware.

Collectors

Generational

Interaction

https://blogs.oracle.com/jonthecollector/resource/Collectors.jpg

Why can't parNew and parallelOld work together?
"ParNew" (and "Serial") implements space_iterate() which will apply an operation to every object in the young generation. When collecting the tenured generation with either "CMS" or "Serial Old", the GC can use space_iterate() to do some work on the objects in the young generation. parallelOld does not implement this interfact. (Source: https://blogs.oracle.com/jonthecollector/entry/our\_collectors)

Young Generation

ParallelScavenge
- "Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads.

ParNew
- The newer parallel collector. It will eventually replace the parallelScavenge collector
- Difference
  It differs from "Parallel Scavenge" in that it has enhancements that make it usable with CMS. For example, "ParNew" does the synchronization needed so that it can run during the concurrent phases of CMS. (Source: https://blogs.oracle.com/jonthecollector/entry/our\_collectors)

Serial
- "Serial" is a stop-the-world, copying collector which uses a single GC thread.

Tenured Generation

ConcurrentMarkSweep
- "" is a mostly concurrent, low-pause collector.

SerialOld(Mark-sweep-compact)
- "Serial Old" is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.

ParallelOld
- "Parallel Old" is a compacting collector that uses multiple GC threads.

G1(Garbage-first Garbage Collection)

Summary:

Divides the heap into regions and collects as many as it can while abiding by the desired pause time.

Specifies region to act as the new "younger" generation, which will be collected first.

Collects regions with most garbage first("Garbage-first")

Compacts to avoid fragmentation issues, but can still get internal fragmentation due to many under-filled heaps.

Because there isn't a static divide between young/tenured generations, can't have a problematic imbalance in sizes of heaps.

Paper:

http://dl.acm.org/citation.cfm?doid=1029873.1029879

1 Intro
- Designed to be a soft real-time system, where a maximum percent x of the next y-duration timeslice is specified
- All inter-region pointers are tracked, not simply old-to-young
- A write barrier logs changes, and a concurrent thread updates remembered sets with this log
- Snapshot-at-the-beginning
  - At beginning of marking period, mutator threads paused and roots collected. This snapshot pause is relatively short.
  - Used to guarantee completeness
  - Finds percentage of each heap that's garbage, to allow garbage-first collection to work
- Hard real-time abandonment
  - Rather than making collection interruptable at granularity of object moved, use granularity of each region
  - Can estimate collection time for each region(function of the amount of live data), and try to fit within soft real-time bounds
  - Allows higher throughput and less synchronization than requiring allowing arbitrary pauses

2 Data structures
- 2.1 Heap layout/regions/allocation
  - All allocation done from same allocation region
  - Allocation is "bump-a-pointer" of the "top" boundary marker in the current allocation region
  - All non-large-objects are allocated into thread-local allocation buffers or tlabs
  - Tlabs are allocated from a linked-list of free regions. Each region consists of many tlabs-sized subregions.
  - Large and humongous objects are allocated seperately into heap regions outside of tlabs.
- 2.2 Remembered Set Maintenance
  - Each region has one hash-based card-table set per parallel GC thread, to avoid contention
  - Write barrier which checks if change creates inter-region pointer. If not, or if new reference is NULL, does nothing.
  - Else logs write into write log. If write log is empty, gets a new one.
  - Concurrent Remembered Set Thread
    - Has an "initiating threshold"
    - When RS buffer size exceeds threshold, CRST processes them until it reduces back down to the RST.
    - Use a seperate card table to determine which cards are "hot" by incrementing cards as processed. Hot cards only processed during evacuation pauses.
  - Only one remembered set thread runs. If the RS buffer is too full, mutator threads are forced to do RS work themselves in order to allocate.
- 2.3 Evacuation Pauses
  - Stops mutator threads
  - Choose a set of regions, and evacuate the live data in regions into other heap regions.
  - Object movement must appear atomic, so world is stopped to allow incremental compaction.
  - Parallelized
  - Threads compete to claim sub-tasks in collection. No synchronization used after ensuring each object has only one thread working on it.
  - Algorithm:
    - Copy object into GC thread TLAB, and race to install brooks-style forwarding pointer
    - Winner scans contents, and processes.
    - Work-stealing load balances
- 2.4 Generational Garbage-First
  - Can take advantage of the generational hypothesis with heuristic
  - Designate region young when mutator allocation region
  - Young regions always part of collection set
  - Remembered set processing not required for young regions
  - Can be toggled between "pure garbage-first" and "generational"
- 2.5 Concurrent Marking
  - Uses snapshot-at-beginning. Considers objects allocated during concurrent marking as alive. Guaranteed to find all garbage that exists during snapshot.
  - 2.5.1 Marking Data Structures
    - A bitmap is used to mark. "Previous" bitmap was bitmap for last marking period. "Next" is under construction. Two bitmaps switch when marking completed.
  - 2.5.2 Initial Marking Pause
    - Clears next marking bitmap.
    - Stops all mutator marking threads, marks everything reachable from roots
    - Each region records top at mark start(TAMS) for previous and next.
    - Allows one to determine when objects are allocated during marking because they are above TAMS.
  - 2.5.3 Concurrent Marking Write Barrier
    - If mutator removes pointers that exist at start, SATB guarantee can be violated.
    - Needs a write barrier to store origional version of fields before overwriting
    - It's only necessary to log writes to non-previously-null fields, and most writes are initialization writes so recording is cheaper than one would think
  - 2.5.4 Final Marking Pause
    - Mutator threads paused and changes since initial pause are processed by marking threads. Marking is thus finished.
  - 2.5.5 Live Data Counting and Cleanup
    - When marking finished, GC thread counts amount of live data in each region.
    - Find it by using marking bitmap, not traversing object graph
    - Since concurrent modification can make counts imprecise, a final stop-the-world cleanup pause must finish counting process.
    - Cleanup also swaps next and previous bitmap roles
    - Next TAMS value copied into previous
    - Cleanup phase sorts heap regions by expected GC efficiency.
      - Is the ratio of amount of garbage to free over cost of collecting it.
      - Garbage-only regions immediately reclaimed in this phase.
      - For some programs, may recover significant space in this phase.
- 2.6 Evaucation Pauses and Marking
  - Never evacuates objects proven dead in last complete marking pass.
  - Needs to ensure that next/previous bitmasks are valid with respect to movements.
  - May only evacuate when marking thread's marking stack is non-empty. Otherwise marking could delay evacuation.
  - Treats marking stack as source of roots.
- 2.7 Popular Object Handling
  - Popular objects handled differently in order to ensure smaller remembered sets and more efficient barrier.
  - Popular objects placed in regions in small prefix of heap
  - Prefix never part of general collection sets
  - When popular regions reach a threshold, create a "popularity pause" to empty region.
  - Popularity pause creates reference count for all popular objects and moves objects with counts below a threshold to ordinary heap regions.
  - Benchmarks show that this behavior removes majority of remembered set enteries

3 Heuristics
- 3.1 User Inputs
  - Two user inputs: Upper bound on space usage, and the maximum portion of a time slice given to stop-the-world GC
  - Whether to use generational mode
  - Soft realtime goal of time-slice only counts stop-the-world, since concurrent collection can be seen as a constant tax on mutator.
- 3.2 Satisfying a Soft Real-time goal
  - 3.2.1 Predicting evacuation pause times
    - Need to accurately predict added cost of adding a region to determine how many regions to evacuate.
    - Keep adding regions until we would exceed pause time
    - Use statistics to estimate the fraction of the data allocated since last marking which is live.
  - 3.2.2 Scheduling Pauses to meet a realtime goal.
    - If we would otherwise miss deadlines, we can simply delay phases.
    - Can cause mutator to allocate into regions which wouldn't be collected, to avoid exceeding pause times in future collections.
- 3.3 Collection Set Choice
  - Decide on the "GC efficiency" as the amount of garbage which will be freed over the cost of collecting the region.
  - Can have low amount of live data and still have low efficiency because of a large remembered set
  - In time between ending marking and starting the evacuation pause, costs may change due to changes in remembered set size. Need to recompute for most attractive subset at start of evacuation pause.
- 3.4 Evacuation Pause Initiation
  - Determine the hard margin, or portion of the heap that must be empty for there to be a to-space for collection. When this gets threatened, we always trigger a collection.
  - In fully-young generational configuration:
    - We know the number of regions that would cause the desired pause time
    - When the number of allocation regions which have filled up is equal to this number, we collect.
  - In partially-young mode, we do evacuation paues at maximum allowed frequency to minimize pause time.
  - Generational starts in fully-young, and then switches to partially-young after marking pass complete in order to collect attractive regions.
  - Always use partially-young if heap very full, until heap occupancy drops.
- 3.5 Concurrent Marking Initiation
  - In generational configuration, we define a soft margin such that when the margin is exceeded, we start a new marking pass.
    - Allows us to try to collect more frequently to avoid hitting the hard margin, where a soft real-time goal can be violated because of collection time required to free enough of the heap to continue.
  - In pure garbage-first, we mark and then keep track of the efficiency of our evacuation pauses.
    - Once they begin to decline(freeing less-attractive regions after more-attractive) we begin a new marking pass to mark more garbage and increase attractiveness.

5 Related Work
- Idea of dividing collection of entire heap into incremental collection of subheaps steams from train algorithm.
  - Differs from train in the use of more remembered sets to allow choice of collecting any region.
- Differs from metronome in that it sacrifices hard-realtime performance for higher throughput.

Shenandoah

Presentations http://rkennke.files.wordpress.com/2014/02/shenandoahtake4.pdf Strangeloop: https://www.youtube.com/watch?v=QcwyKLlmXeY

New alternative to G1

Eventual goal: An entirely pauseless collector for large-memory systems

Current goal: <10ms pause times for 100+GB heaps.

Not currently pauseless
- Not yet using round-robin root updating because makes difficult to ensure that object graph is maintained, and difficult to prove both systems simultaneously before done.

All other collectors stop-the-world during compaction to avoid problems of concurrent mutation
- Two solutions: Trap on access of objects in motion, or brooks-style forwarding pointers
- They chose forwarding pointers

Forwarding pointers remove need for remembered sets

Remembered sets have scalability problems
- Large systems which pre-allocate space for scalability reasons frequently thrash cache lines containing card table
- G1 had massive remembered sets. Fell back on a "generational" configuration eventually which only had to remember old-to-young references.
  - Forced generational paradigm, not truly garbage first.
- Forwarding pointers allow arbitrary choice of region collection without remembered sets. Truly garbage first.

Forwarding pointer lives in indirection object

Divorces collection from allocation, doesn't risk randomly interrupting mutator threads.
- Reading an object in a from-region doesn't trigger evacuation. Barrier is unconditional since always chases forwarding pointer.
- Writing an object does.
  - Also ensures that when any reference to an object in fromspace is written to an object in tospace, it is updated to point to a newly copied version of that item in tospace.
  - System has invariant that writes only occur in to-regions.
  - In order to avoid races in copying/installing forwarding pointer, GC and mutator threads copy and then compare-and-set the forwarding pointer.
    - Loser undoes allocation.
  - If mutator out-races the collector thread, collector must use thread's new location as object address and perform forwarding process again.
  - Needs to ensure that anything alive at beginning of marking is still considered live, SATB is used.
    - To avoid mutation while scanning the graph causing breakage, write barrier is used to mark references in overwritten field values
  - Needs to ensure that atomic compare and swap of reference by mutator threads only swap refereneces to tospace objects because otherwise CAS would fail because GC thread modified the previous reference, and because no tospace object should point to a fromspace object.
    - Solution is to copy all referred-to objects into tospace.
  - Costs are bounded because of finite region size. Can't prevent progress for unbounded amount of time.

Use work-stealing
- Compete to claim roots
- Process one child, push rest of children onto work-list.

Overview
- Heap divided into regions. Concurrent marking tracks each region's live data.
- Regions with most garbage will join collection set.
- GC threads concurrently evacuate live data.
- At start of next round of concurrent marking, references in fromspace are updated.
- Fromspace is freed.

Downsides
- Forwarding pointers add a time and space overhead.

Generational hypothesis doubts
- Suspicion that generational hypothesis doesn't improve performance because most objects die young, but because remembered sets are smaller.

Date: 2014-10-16T00:13-0400

Author: Alex Kyte

Org version 7.9.3f with Emacs version 24

Validate XHTML 1.0

Thursday, December 4, 2014

1 M3: High Performance Memory Management from Off-the-shelf components

1.1 Abstract

Tuesday, December 2, 2014

1.1 Abstract

1.1.1 Provides fragmentation tolerance with worst case time/space guarantees

1.1.2 Implementation in the Fiji VM, a JVM for mission-critical systems

1.1 Abstract

1.1.1 FPGAs are complex, programming at a higher level of abstraction is desirable

1.1.2 First complete GC in hardware, not hardware-assisted

1.1.3 Reasonable performance

1 DieHard, Probabilitic Memory for Unsafe Languages

1.1 Intro

Wednesday, November 19, 2014

Where have I come, and what I have I learned?

Sunday, November 16, 2014

Thursday, November 13, 2014

Monday, November 10, 2014

Sunday, November 9, 2014

Wednesday, November 5, 2014

Tuesday, November 4, 2014

Monday, November 3, 2014

Friday, October 31, 2014

Thursday, October 30, 2014

Wednesday, October 22, 2014

Table of Contents

1 Eliminating Read Barriers through Procrasination and Cleanliness

1.1 Abstract

1.1.1 Goal is to avoid need for a read barrier

1.1.2 Use concurrency and stalling to avoid need for immediate copy

1.1.3 Can avoid need to stall, use language semantics to allow references in fromspace or copy when small set of references to object are known

1.2 Introduction

1.2.1 Use of parallelism and thread-local heaps with single shared heap uses write barriers to enforce

1.2.2 Terminology: exporting writes = writes which incur a write barrier and export object to shared heap and set up forwarding pointer.

1.2.3 Needs a read barrier because reference might refer to newly-exported object.

1.2.4 Read barrier cost is high, but removal is nontrivial.

1.2.5 Idea: Delay operation

1.2.6 Idea: Avoid stalling using language semantics

1.3 Motivation

1.3.1 Implementation considered is multimlton.

1.3.2 Has userspace concurrency, many green threads mapped to single kernel thread.

1.3.3 Each core has local heap, and there's a single shared heap.

1.3.4 Max heap set at 3x min heap.

1.3.5 Extensive optimization to avoid heap usage

1.3.6 On exporting an object, the stack is walked and all references are fixed. Therefore roots never point to forwarding pointer.

1.3.7 If object has been moved, read barrier returns new address. Therefore conditional read barrier.

1.3.8 Mutator spends ~20% of execution time in read barrier.

1.3.9 Brooks-style unconditional barrier would reduce time, but since most objects 3 words or less, forwarding pointer would have significant overhead.

1.3.10 Less than 1/100 percent of reads end up being to forwarding pointers. Therefore read barrier significant cost.

1.4 GC Design and Implementation

1.4.1 Threading system

1.4.2 Baseline collector(stop the world)

1.4.3 Local collector(split heap)

1.4.4 Remembered stacks

1.5 Cleanliness Analysis

1.5.1 Heap Session

1.5.2 Reference count

1.5.3 Cleanliness

1.6 Write barrier

1.6.1 If not primtive, and exporting, and clean, copy transitive closure to shared heap.

1.6.2 Delaying writes

1.6.3 Lifting objects to shared heap

1.6.4 Remote spawns

1.6.5 Barrer Implementation

1.7 Target Architectures

1.7.1 Used 16-core linux machine, 48-core intel single-chip cloud computer, and 864-core azul vega 3.

1.8 Results

1.8.1 Benchmarks

1.8.2 Performance

1.8.3 Impact of cleanliness

1.8.4 Impact of Immutability

1.8.5 Impact of heap session

1.8.6 Related work

Table of Contents

1 Heap Sizing

1.1 Control Theory for Principled Heap Sizing

1.1.1 Intro

1.1.2 Sweet spots

1.1.3 Heap sizing in existing VMs.

1.1.4 Heap sizing as a control problem

1.2.5 Can derive a H_min, the smallest reasonable heap.