runtime: green tea garbage collector · Issue #73581 · golang/go

<aside> 📘 Series:

Relates:

Green Tea GC: How Go Stopped Wasting 35% of Your CPU Cycles

</aside>

Authors: Michael Knyszek, Austin Clements

This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design.

For now, Green Tea is available as an experiment in the Go 1.25 release. We don't expect any correctness issues with the experiment. It's running within Google, and we feel that it is production-ready. We encourage teams to try it out. Your feedback is instrumental to producing a good final design!

Introduction

Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.

Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and >35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.

In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,1 is at least memory-aware, in that it endeavors to process objects close to one another together.

This new algorithm has an implementation that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.

Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.

Design

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.

Prototype implementation

In the prototype implementation of this new algorithm, the memory blocks we track are called spans. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a gray bit and a black bit. These correspond to the tri-color abstraction: an object is black if it has been scanned, gray if it is in the queue to be scanned, and white if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.

When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.

Limiting scope to small objects

The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.