Wednesday, November 19, 2014

Status on the Bug Collector


My last few weeks have been spent working on the DWARF-based root scanner for C and C++ which was proposed in a previous post.

Where have I come, and what I have I learned? 



I have gained a deeper familiarity with the DWARF specification, gdb internals, gcc debugging insertion, and stack unwinding. Many of the libraries in use might be called a bit 'dusty.' Development is not very active, and documentation is more or less the source code. I feel as if this necessity to learn through exploration has taught me a bit about quickly navigating compilers and language tools.

Furthermore, this project made me more familiar with the CDECL. While I was familiar with the stack handling, actually walking the stack, finding register values, and using DWARF's expression language to find the location of variables in lexical blocks such as if/while statements definitely made me more familiar with the common patterns of variable allocation on the stack.

How is the library designed?


Libdwarf works well for debuggers, but it has a flaw when considered for a performance-critical operation. It doesn't read the entire file into a memory-backed structure. It keeps a file pointer and seeks through the file. It's undesirable to have every running program keep an unnecessary file handler open, and file IO is significantly slower than memory access.

After getting location finding working, the next thing that I did was to create a structure to read the memory into. I currently have functions and types laid out differently. Functions are a reference to a scope. Each scope refers to multiple roots, and to multiple child scopes. The child scopes form a DAG. When trying to find references for a given program counter(gotten through stack unwinding), you simply move through a list of functions until the top scope has a low and high program counter range that includes the given program counter. You then get all of the variables of references in that scope, and append the references found in child scopes.

Each reference has an index into a list of types. Each type has a number of uniform attributes, but also includes type-specific attributes. These include information such as the contents of structure types, the type that a pointer points to, the alternatives in a union, etc, etc.

Where am I?


I can currently call a function provided by the library and have it return to me a list of roots, each of which has a location (the location of the reference itself, not the value held) and an index into the type table. A collector would use this for marking, and would then overwrite the root values to point to the new locations.

A description of the name is likely in order. The garbage collection interface being developed makes heavy use of debugging symbols.

What next?


My code can be found linked below. I'll admit that the header api is the next thing that I need to refactor. There isn't much of a distinction between what's intended for external use, and what is declared simply for mutual recursion to be possible. Higher testing coverage(an increase from 0) is definitely necessary for this to be considered a project worthy of a modicum of trust. I intend to get the coverage for the most delicate functions done during the next week or so, while I work on collector integration.

My next step from here is to do a more thorough look into the internal data structures of the BDW conservative GC, and figure out how to integrate this root scanning with the collector itself. It may be an awkward fit in places. There are many smaller optimizations for the library which may not be appropriate given knowledge (for instance knowledge of which data is static and which are really references makes the 'blacklist' unnecessary).

After the type information can be propagated into the collector, my next step would be to add the mark-compact step to the collector where possible. One of the flaws of the BDW collector is the inability to re-compact the heap. It would be desirable to attempt a compaction any time that memory fragmentation gets over a given bound.

Why do I suggest compaction over copying? If we have a (void *) reference, we can't know the type. If this points to memory anywhere in the heap, we can't move the cell owning that memory. Otherwise we'd have to overwrite the memory location, potentially making an integer or bitfield invalid. If we want to copy, we must be able to track and overwrite every single reference. If we simply want to compact, we can pin this 'unknown' data, and move cells around it until the referencing object disappears.

Possible areas of interest are in the interaction of the current incremental BDW mark-sweep collector with my type information. I feel confident that interesting corner cases will appear.

Project page: 


No comments:

Post a Comment