When multiple threads need to share a lookup table, the traditional answer is simple: wrap it in a lock. One thread reads or writes at a time, everybody else waits in line. It works, but it doesn’t scale. As you add more threads and more cores, that lock becomes a bottleneck. Threads pile up waiting, context switches multiply, and your beautifully parallel application spends most of its time sitting around doing nothing.
kbmMW takes a different approach with its lock-free hash arrays — data structures that let multiple threads insert, look up, and remove values simultaneously without ever acquiring a traditional lock. An upcoming release brings significant performance improvements to these structures, with gains of up to 90% on dynamic workloads and 27% on multi-threaded inserts.
This post explains what they are, how they work under the hood, and how to use them in your own code.
Contents
What Problem Does This Solve?
Imagine you’re building a memory debugger. Every time your application allocates memory (which happens thousands of times per second from every thread), you need to record that allocation in a table. Every time it frees memory, you need to remove the record. And periodically, you need to scan the table to find leaks.
With a traditional locked dictionary, every allocation from every thread must wait for the lock. On a 32-core server, that means 31 threads idle while one thread writes a single record. The memory debugger ends up being slower than the application it’s debugging.
This is exactly the scenario kbmMW’s TkbmMWDebugMemory system faces, and it’s why the lock-free hash array exists. It allows all those threads to record and remove allocations concurrently, without waiting for each other.
But it’s not just for memory debugging. Any scenario where multiple threads need fast concurrent access to a shared key/value store — connection tracking, session management, caching, real-time counters — benefits from the same approach.
How It Works (The Simple Version)
If you’ve ever used a TDictionary, you already understand the basic idea: you have keys and values, you insert a pair, and later you look it up by key. The lock-free hash array works the same way from the outside. The difference is entirely in how it handles concurrent access.
Open Addressing with Linear Probing
Internally, the hash array is a flat array of slots. Each slot can hold a key, a value, and a flag that indicates the slot’s state (unused, in use, tombstoned, or being garbage collected).
When you insert a key, the array hashes the key to find a starting slot. If that slot is empty, the value goes there. If it’s occupied by a different key, the array simply moves to the next slot and tries again. This is called linear probing, and it’s fast because it accesses memory sequentially — modern CPUs love sequential access because their hardware prefetchers can predict the pattern and load the next cache line before you need it.
Fencing Instead of Locking
Here’s where it gets clever. Instead of a global lock that blocks the entire table, each slot has its own flag that can be fenced. When a thread wants to modify a slot, it uses an atomic compare-and-swap (CAS) operation to set a fence bit on that slot’s flag. If it succeeds, it “owns” that slot for the duration of the update. Other threads wanting the same slot will see the fence and briefly wait, but threads accessing any other slot proceed without delay.
The beauty of this design is that the chance of two threads needing the exact same slot at the exact same nanosecond is astronomically small. In practice, the fence mechanism almost never causes any thread to wait. It’s there as a safety net, not as a regular occurrence.
Generational Garbage Collection
Over time, as you remove values, slots become “tombstoned” — they’re marked as deleted but still physically present in the array. Too many tombstones and linear probing becomes inefficient, as searches must skip over all of them.
The hash array solves this with a generational garbage collection scheme. It maintains two copies (generations) of the internal array. When tombstones accumulate beyond a threshold, a GC cycle kicks in: it allocates a fresh generation, copies all live entries into it (rehashing them to eliminate tombstones), and then atomically switches to the new generation. The old generation is freed once no threads are still accessing it.
This GC approach means that the table periodically cleans itself up without requiring any thread to stop and wait. Threads that happen to be accessing the table during GC are gently redirected to retry their operation after the switch completes.
The Class Hierarchy
kbmMW provides several flavors of lock-free hash array, each suited to different use cases.
Static Hash Arrays (Fixed Capacity)
TkbmMWLockFreeHashArray<T> is the simplest variant. You create it with a fixed capacity, and it uses a native unsigned integer as the key. It’s perfect when you know the approximate number of entries in advance and don’t need the table to grow.
You subclass it and override the Zero method to tell it what a “blank” value looks like for your value type. Then you use it through the Value property (indexed by key), Contains for existence checks, Remove to delete, and GetAndRemove to atomically retrieve and delete in one step.
The key must be greater than 2, because values 0, 1, and 2 are reserved internally for the slot state flags.
TkbmMWLockFreeHashArray2<T> is the dual-key variant. It works the same way, but each entry is identified by two keys — a primary key and a secondary key (a TkbmNativeUInt). The two keys are XOR’d together during hashing to produce better distribution. This is useful when your lookup naturally involves two identifiers, such as an object address and a type code.
Dynamic Hash Arrays (Auto-Growing)
TkbmMWDynamicLockFreeHashArray<T> extends the static array with automatic resizing. When the table reaches 60% occupancy, it doubles in capacity. This is the variant used by the debug memory system, because you rarely know in advance how many allocations your application will make.
The dynamic variant adds a thin reader/writer lock around resize operations (readers can still access the table concurrently; only the actual resize takes a brief exclusive lock), and it maintains an accurate element count. You can also manually trigger a Resize or Grow if you anticipate a burst of insertions.
Ready-Made Specializations
For common cases, kbmMW provides pre-built specializations so you don’t need to subclass anything:
TkbmMWLockFreeHashArray32— 32-bit integer values, with atomicIncrementandDecrementoperations built in. Great for counters.TkbmMWLockFreeHashArray64— Same thing for 64-bit integers.TkbmMWLockFreeHashArrayNative— Dual-key variant with native-sized integer values, also with atomic increment/decrement.TkbmMWDynamicLockFreeHashArrayU32— Dynamic (auto-growing) variant for unsigned 32-bit values.
Using It in Practice
Using a lock-free hash array follows a simple pattern. First, you define your value type and create a subclass that implements the Zero method. This method tells the array what a “zeroed out” value looks like — it’s used when clearing slots and detecting empty values.
Then you create an instance with an initial capacity. For static arrays, choose a capacity roughly 2x the expected number of entries to keep the load factor healthy. For dynamic arrays, a smaller initial capacity is fine since the table will grow as needed.
Insertion and lookup are done through the Value property, which accepts a key and returns (or sets) the associated value. The Contains method checks whether a key exists. Remove deletes a key. GetAndRemove atomically retrieves and deletes a value — useful when you need the value but also want to clean up the entry.
If you need to iterate over all entries, use ResetEnum followed by repeated calls to Enum, which fills in the next key and value. Be aware that enumeration gives you a snapshot — concurrent modifications may cause the enumeration to restart if a GC cycle happens mid-iteration.
All of these operations are thread-safe without any external locking. That’s the whole point.
Important Caveats
There are a few things to keep in mind:
Keys must be greater than 2. The values 0, 1, and 2 are used internally to represent slot states (unused, tombstoned, and GC). If you pass keys of 0, 1, or 2, you’ll corrupt the internal state. If your natural key space includes these values, offset them (add 3 to every key, for instance).
Capacity is always rounded up to a power of 2. If you request a capacity of 1000, you’ll get 1024. This is required by the internal hashing mechanism, which uses bitwise AND for fast modulo operations.
The MurmurHash3 option. By default, the UseMurMur3 property is true, which means keys are hashed using the MurmurHash3 algorithm for better distribution. If your keys are already well-distributed (like memory addresses), you can set this to false to skip the hash computation and use the key directly as an index.
GC happens automatically. When tombstones accumulate, the array triggers a GC cycle on its own. You don’t need to (and shouldn’t) call GC manually in normal use.
Performance Improvements in the Upcoming Release
The upcoming release includes a set of structural optimizations to the lock-free hash array internals. These changes were validated through extensive benchmarking on both a 4-core virtual machine and a 32-core bare-metal server, using a comprehensive test suite covering single-threaded operations, multi-threaded contention, dynamic resizing, and long-running stability soaks.
Here’s what changed and why:
Bitwise AND replaces integer division. Every time the array probes a slot, it needs to wrap the index around to stay within bounds. The original code used the MOD operator, which compiles to an integer division — a 20–90 cycle operation depending on the CPU. The new code uses AND (capacity - 1), which is a single-cycle bitwise operation. This works because the capacity is always a power of 2. This single change touches 23 sites in the codebase and shaves cycles off every single probe step.
Virtual key comparison eliminates interface overhead. The original code compared keys through an IEqualityComparer<K> interface, which involves double indirection (first to the interface table, then to the actual method). For the common case of integer keys, this is massive overkill. The new code uses a virtual KeyEquals method that the integer-key specializations override with a direct register comparison. Same result, far fewer cycles.
Forward iteration in GC. The garbage collection phase iterates over the entire old generation to copy live entries to the new generation. The original code deliberately iterated backwards (downto 0). The reasoning was sound: during GC, processed slots are flagged, and any concurrent thread that hits a flagged slot must wait for the entire GC to finish. By iterating backward while probe chains run forward, the GC frontier moves away from most active lookups, giving concurrent threads more runway before they encounter a flagged slot. However, this benefit is small in practice — it only helps the fraction of threads whose hash lands near the moving frontier, and once a thread is blocked it waits for the full GC regardless of when it was caught. Meanwhile, the cost of backward iteration is paid on every GC cycle: it fights the CPU’s hardware prefetcher, which detects sequential forward access patterns and speculatively loads upcoming cache lines. On a large table (the GC pass touches every slot), the difference between prefetcher-assisted and prefetcher-defeated sequential access is substantial. The new code iterates forward (0 to n-1), working with the prefetcher, and benchmarks confirmed consistent improvement on GC-heavy workloads.
Power-of-2 enforcement in resize. The original InternalResize and Grow methods didn’t enforce power-of-2 capacity, which was fine when the code used MOD for index wrapping, but would have caused catastrophic failures with the new AND masking. The new code rounds all resize targets to the next power of 2, and Grow simply doubles the capacity (which is always a power of 2 to begin with).
Overflow-safe growth. The original Grow computed 150 * FCapacity div 100, which overflows a 32-bit cardinal at approximately 28.6 million entries. The new code uses FCapacity * 2, which is safe up to approximately 2.1 billion.
Drain wait safety. The InternalResize method now waits for all threads to finish accessing the old generation before freeing its memory. This closes a theoretical race condition where a thread could still be reading from a generation that’s been freed.
The Numbers
On a 32-core bare-metal server with 8 threads (the realistic deployment scenario), comparing the original code to the new release:
| Benchmark | Improvement |
|---|---|
| Single-threaded lookup | +3% to +5% |
| Single-threaded GetAndRemove | +5% to +7% |
| GC pressure (insert/remove churn) | +4% to +8% |
| Multi-threaded insert (8 threads) | +7% to +27% |
| Dynamic array grow (1K → 1M entries) | +87% to +93% |
| Dynamic multi-threaded mixed operations | +11% to +34% |
| Long-running stability soak | +10% to +26% |
The dynamic grow improvement is the standout — nearly doubling throughput when the table needs to resize under load. This directly benefits any use of TkbmMWDynamicLockFreeHashArray, including the debug memory system.
On a 4-core VM (the constrained scenario), the improvements are smaller but still positive across the board, with zero regressions.
The YIELDWAIT Tuning Option
The upcoming release also introduces a compile-time option for advanced tuning on many-core servers: KBMMW_LOCKFREEHASHARRAY_YIELDWAIT.
Background: What Happens When Two Threads Collide
When a thread tries to fence a slot and finds it already fenced by another thread, it needs to wait. The question is how it waits. The default is sleep(1), which tells the operating system to put the thread to sleep for at least 1 millisecond. On Windows, due to timer resolution, this actually means sleeping for about 15 milliseconds.
For most scenarios, this is fine — fence collisions are extremely rare, and when they do happen, sleeping for 15ms is harmless because the other thread released the fence billions of nanoseconds ago.
But on many-core servers (say, 32 cores running 8 threads), the situation is different. The fence holder is almost certainly running on a different physical core and will release the fence within nanoseconds. Sleeping for 15ms wastes 15 million nanoseconds waiting for an event that took 100 nanoseconds.
Enabling YIELDWAIT
To enable the yield-wait strategy, uncomment the define at the top of kbmMWLockFree.pas:
{$DEFINE KBMMW_LOCKFREEHASHARRAY_YIELDWAIT}
You’ll find it near the top of the file alongside other diagnostic defines, currently commented out with the {. prefix notation:
{.$DEFINE KBMMW_LOCKFREEHASHARRAY_YIELDWAIT}
Remove the dot to activate it.
With this define active, the Fence method uses sleep(0) instead of sleep(1) during normal operations. sleep(0) tells the OS to yield the current timeslice if another thread is ready to run, but return immediately if no one is waiting. This means the thread retries the CAS almost immediately — perfect for the case where the fence holder is on another core and releases within nanoseconds.
The Resize-Aware Twist
Simply switching all waits to sleep(0) doesn’t work well in every situation. During a GC or resize operation, the array needs to rehash the entire table, which takes milliseconds. If seven threads are all sleep(0) polling a flag that won’t change for several milliseconds, they create a storm of context switches and cache-line bouncing that actually slows down the resize.
The implementation handles this intelligently. When KBMMW_LOCKFREEHASHARRAY_YIELDWAIT is defined, the Fence method checks whether a resize is currently in progress. During normal operations (no resize), it uses sleep(0) for fast recovery. During resize, it automatically falls back to sleep(1) to get out of the way and let the resize finish quickly.
The GC drain waits and InternalResize drain waits always use sleep(1) regardless of the define, because they inherently wait for long-running operations where yielding aggressively provides no benefit.
When to Enable It
Enable YIELDWAIT when:
- You’re running on a many-core server (16+ cores)
- Your threads are fewer than your cores (undersubscribed)
- Your workload involves high-frequency concurrent access to the hash array
- You’ve profiled and confirmed that fence contention is measurable
Keep it disabled (the default) when:
- You’re running on 4–8 core machines
- Your thread count equals or exceeds your core count (oversubscribed)
- You’re not sure — the default is safe and performant everywhere
YIELDWAIT Performance Impact
On a 32-core server with 8 threads:
| Benchmark | Default | With YIELDWAIT |
|---|---|---|
| Multi-threaded insert | +27% vs original | +23% vs original |
| High-contention mixed ops | −3% vs original | +130% vs original |
| Dynamic grow | +87% vs original | +91% vs original |
| Stability soak | +26% vs original | +3% vs original |
The high-contention scenario is where YIELDWAIT shines — over 2x faster than the original code when 8 threads compete over the same key range. The trade-off is that long-running stability soaks see less benefit (though still positive versus the original).
On a 4-core VM with 8 threads (oversubscribed), YIELDWAIT still dramatically improves the high-contention case (+142%), but the stability soak shows a significant regression (−32%). This is why it’s disabled by default.
Summary
kbmMW’s lock-free hash arrays give you a thread-safe key/value store that scales with your hardware instead of fighting against it. The upcoming release makes them measurably faster across the board, with the biggest gains in the dynamic growing scenarios that matter most for real-world use. And for those running on many-core servers, the KBMMW_LOCKFREEHASHARRAY_YIELDWAIT option provides an additional boost for contention-heavy workloads.
For most users, the improvements are automatic — just update to the new release and your existing code runs faster. If you’re on a big server and want to squeeze out even more, flip the YIELDWAIT switch and benchmark your specific workload to see if it helps.
![]()





