Thursday, October 30, 2014

Proposal: Not-So Conservative Collection


The Boehm GC has a number of faults related to the fact that it is a conservative collector and cannot know if it is considering a pointer inside the heap or a value whose bits may be interpreted as such. The .NET framework gained performance gains which they considered significant when they made the switch to the ‘precise’ mode of the Boehm GC. I believe that this is evidence that greater access to type information could allow the Boehm GC to achieve higher performance for the many, many projects and runtimes which use it.

In a master’s thesis(http://rus.har.mn/files/project-report.pdf), student Russell Harmon showed that access to the standard DWARF debug symbols for an executable can be leveraged in order to get a level of introspection into C programs that would otherwise be near impossible to accomplish. These symbols can be either compiled into the executable or provided as a separate file. They contain information about the types of the variables used by the program, and can be used to find the type of an object on the stack.

I believe that there exists a way to use these debugging symbols to increase the effectiveness of the Boehm-Demers-Weiser collector by allowing for introspection of the types of otherwise ambiguous roots. There already exists two separate ways to communicate the type of objects being allocated to the collector, but they require that the program specify the type at each malloc site. Allowing the dwarf symbols be provided as another argument to the collector at initialization would allow for the collector to infer these types without modification of the program in use. This is more in line with the spirit of the GC, which spent significant effort to allow a programmer to include it rather than stdlib, and it would replace the malloc/free previously used by the programmer.

I would use the type interface in in their types.h, and would use libbfd or libdwarf to parse the dwarf files.

A caveat would be that casting a pointer to an integer may lead to freeing memory in use. As this is a purposeful defeat of the type system, I’m not sure if there is a way to track this and compensate for it. 


In more depth: 


All of the variables on the stack, or within scope when the collector is called, are noted by the Boehm collector. Each variable's type can be found through looking at the dwarf debugging symbols.

We don't know the type of objects on the heap, but all allocation in Boehm currently goes through an alternative implementation of malloc. This implementation looks into a structure to find a large enough contiguous list of cells of the allocator's heap area that it has gotten by using the os's standard allocator. It allocates these cells with a header which notes the start and end of the data allocated to the program inside of it. Therefore we know the size of every object on the heap, but we do not know the type.

The pain point for Boehm is not knowing whether something's a pointer, and what it's type is. It prevents moving the objects because it can't fix the references. It also means that an integer seen which would point into the heap if cast to a pointer, is considered a pointer. This leads to garbage not being collected.

At the start of a collection, Boehm currently takes the values on the stack and if any of them could be pointers into the heap, mark's the cells' header with a bit signifying it's not garbage. Then it sweeps through these objects, and if any of them contain values that might be pointers, it adds them to the work-list to mark and scan.

With dwarf introspection, the collector takes the values on the stack and if any of them may be pointers, it looks up their type in the dwarf data. It then writes that type into the header associated with the allocation cell-grouping pointed to. When traversing the object, the collector can look at the dwarf data to tell the offsets which may be pointers. If they point into the heap, it then looks up the type of the pointer in the dwarf data and writes that into the header for the pointed-to cell, and adds that pointer to the work-list. In such a way, all live data will be given a type.


No comments:

Post a Comment