When multiple threads need to share a lookup table, the traditional answer is simple: wrap it in a lock. One thread reads or writes at a time, everybody else waits in line. It works, but it doesn’t scale. As you add more threads and more cores, that lock becomes a bottleneck. Threads pile up waiting, context switches multiply, and your beautifully parallel application spends most of its time sitting around doing nothing.

kbmMW takes a different approach with its lock-free hash arrays — data structures that let multiple threads insert, look up, and remove values simultaneously without ever acquiring a traditional lock. An upcoming release brings significant performance improvements to these structures, with gains of up to 90% on dynamic workloads and 27% on multi-threaded inserts.

This post explains what they are, how they work under the hood, and how to use them in your own code.

What Problem Does This Solve?

Imagine you’re building a memory debugger. Every time your application allocates memory (which happens thousands of times per second from every thread), you need to record that allocation in a table. Every time it frees memory, you need to remove the record. And periodically, you need to scan the table to find leaks.

With a traditional locked dictionary, every allocation from every thread must wait for the lock. On a 32-core server, that means 31 threads idle while one thread writes a single record. The memory debugger ends up being slower than the application it’s debugging.

This is exactly the scenario kbmMW’s TkbmMWDebugMemory system faces, and it’s why the lock-free hash array exists. It allows all those threads to record and remove allocations concurrently, without waiting for each other.

But it’s not just for memory debugging. Any scenario where multiple threads need fast concurrent access to a shared key/value store — connection tracking, session management, caching, real-time counters — benefits from the same approach.

How It Works (The Simple Version)

If you’ve ever used a TDictionary, you already understand the basic idea: you have keys and values, you insert a pair, and later you look it up by key. The lock-free hash array works the same way from the outside. The difference is entirely in how it handles concurrent access.

Open Addressing with Linear Probing

Internally, the hash array is a flat array of slots. Each slot can hold a key, a value, and a flag that indicates the slot’s state (unused, in use, tombstoned, or being garbage collected).

Hash Array Structure — Flat Array of Slots Index Flag Key Value 0 IN USE 0x7A3F01 { Address: $0040F200, Size: 1024, Type: 3 } 1 UNUSED 2 IN USE 0x0052C8 { Address: $00A03100, Size: 256, Type: 1 } 3 TOMB deleted deleted 4 IN USE 0x1B8E40 { Address: $00C80000, Size: 4096, Type: 2 } Each slot stores a Flag (state), Key (hash identifier), and Value (your data record)

When you insert a key, the array hashes the key to find a starting slot. If that slot is empty, the value goes there. If it’s occupied by a different key, the array simply moves to the next slot and tries again. This is called linear probing, and it’s fast because it accesses memory sequentially — modern CPUs love sequential access because their hardware prefetchers can predict the pattern and load the next cache line before you need it.

Linear Probing — Inserting Key 0xB7 (hashes to slot 3) Step 1: Hash → slot 3 Slot 0 IN USE Key 0x3A Slot 1 UNUSED Slot 2 IN USE Key 0x9C Slot 3 IN USE Key 0x42 Slot 4 IN USE Key 0xEF Slot 5 UNUSED Slot 6 IN USE Key 0x11 hash(0xB7) Step 2: Slot 3 occupied → probe forward occupied occupied empty! Step 3: Insert into slot 5 Slot 3 IN USE Key 0x42 Slot 4 IN USE Key 0xEF Slot 5 IN USE Key 0xB7 ★ Lookups for 0xB7 follow the same path: hash to slot 3, probe 4, find at 5 Sequential probing = cache-friendly — the CPU prefetcher loads the next slot before you need it

Fencing Instead of Locking

Here’s where it gets clever. Instead of a global lock that blocks the entire table, each slot has its own flag that can be fenced. When a thread wants to modify a slot, it uses an atomic compare-and-swap (CAS) operation to set a fence bit on that slot’s flag. If it succeeds, it “owns” that slot for the duration of the update. Other threads wanting the same slot will see the fence and briefly wait, but threads accessing any other slot proceed without delay.

The beauty of this design is that the chance of two threads needing the exact same slot at the exact same nanosecond is astronomically small. In practice, the fence mechanism almost never causes any thread to wait. It’s there as a safety net, not as a regular occurrence.

Per-Slot Fencing — No Global Lock Needed Three threads accessing different slots proceed simultaneously Thread A Thread B Thread C CAS ✓ CAS ✓ CAS ✓ Slot 0 🔒 FENCED by Thread A Slot 1 IN USE Slot 2 IN USE Slot 3 🔒 FENCED by Thread B Slot 4 IN USE Slot 5 IN USE Slot 6 🔒 FENCED by Thread C Traditional Global Lock Thr A working Thr B ⏳ BLOCKED waiting Thr C ⏳ BLOCKED waiting Lock-Free Per-Slot Fencing Thr A working on slot 0 Thr B working on slot 3 Thr C working on slot 6 Threads only wait if they need the exact same slot at the exact same nanosecond — extremely rare

Generational Garbage Collection

Over time, as you remove values, slots become “tombstoned” — they’re marked as deleted but still physically present in the array. Too many tombstones and linear probing becomes inefficient, as searches must skip over all of them.

The hash array solves this with a generational garbage collection scheme. It maintains two copies (generations) of the internal array. When tombstones accumulate beyond a threshold, a GC cycle kicks in: it allocates a fresh generation, copies all live entries into it (rehashing them to eliminate tombstones), and then atomically switches to the new generation. The old generation is freed once no threads are still accessing it.

This GC approach means that the table periodically cleans itself up without requiring any thread to stop and wait. Threads that happen to be accessing the table during GC are gently redirected to retry their operation after the switch completes.

Generational Garbage Collection Before GC — Tombstones accumulating Generation 0 (active) IN USE Key A TOMB IN USE Key B TOMB TOMB IN USE Key C UNUSED IN USE Key D During GC — Copy live entries to fresh generation Generation 0 (being drained) GC’d → copied GC’d skipped GC’d → copied ⟵ GC front IN USE waiting… IN USE waiting… Generation 1 (building — clean, no tombstones) IN USE Key A ✓ IN USE Key B ✓ empty empty remaining slots ready for C, D After GC — Clean table, atomic switch Key A rehashed Key B rehashed Key C rehashed Key D rehashed empty empty empty empty Generation 1 is now active — zero tombstones, optimal probing Tombstones eliminated • Live entries rehashed to optimal positions • Old generation freed when safe

The Class Hierarchy

kbmMW provides several flavors of lock-free hash array, each suited to different use cases.

Static Hash Arrays (Fixed Capacity)

TkbmMWLockFreeHashArray<T> is the simplest variant. You create it with a fixed capacity, and it uses a native unsigned integer as the key. It’s perfect when you know the approximate number of entries in advance and don’t need the table to grow.

You subclass it and override the Zero method to tell it what a “blank” value looks like for your value type. Then you use it through the Value property (indexed by key), Contains for existence checks, Remove to delete, and GetAndRemove to atomically retrieve and delete in one step.

The key must be greater than 2, because values 0, 1, and 2 are reserved internally for the slot state flags.

TkbmMWLockFreeHashArray2<T> is the dual-key variant. It works the same way, but each entry is identified by two keys — a primary key and a secondary key (a TkbmNativeUInt). The two keys are XOR’d together during hashing to produce better distribution. This is useful when your lookup naturally involves two identifiers, such as an object address and a type code.

Dynamic Hash Arrays (Auto-Growing)

TkbmMWDynamicLockFreeHashArray<T> extends the static array with automatic resizing. When the table reaches 60% occupancy, it doubles in capacity. This is the variant used by the debug memory system, because you rarely know in advance how many allocations your application will make.

The dynamic variant adds a thin reader/writer lock around resize operations (readers can still access the table concurrently; only the actual resize takes a brief exclusive lock), and it maintains an accurate element count. You can also manually trigger a Resize or Grow if you anticipate a burst of insertions.

Dynamic Grow — Auto-Doubling at 60% Capacity Load factor: 60% threshold 0% 100% → Trigger Grow! Capacity: 8 slots (5 in use = 62.5% → exceeds 60%) IN USE Key A IN USE Key B TOMB IN USE Key C IN USE Key D IN USE Key E Grow Capacity × 2 Always power of 2 Capacity: 16 slots (5 entries rehashed = 31.3% — plenty of room) Key A rehashed Key B rehashed Key C rehashed Key D rehashed Key E rehashed Tombstones eliminated during rehash • Entries spread across double the slots • Shorter probe chains

Ready-Made Specializations

For common cases, kbmMW provides pre-built specializations so you don’t need to subclass anything:

  • TkbmMWLockFreeHashArray32 — 32-bit integer values, with atomic Increment and Decrement operations built in. Great for counters.
  • TkbmMWLockFreeHashArray64 — Same thing for 64-bit integers.
  • TkbmMWLockFreeHashArrayNative — Dual-key variant with native-sized integer values, also with atomic increment/decrement.
  • TkbmMWDynamicLockFreeHashArrayU32 — Dynamic (auto-growing) variant for unsigned 32-bit values.

Using It in Practice

Using a lock-free hash array follows a simple pattern. First, you define your value type and create a subclass that implements the Zero method. This method tells the array what a “zeroed out” value looks like — it’s used when clearing slots and detecting empty values.

Then you create an instance with an initial capacity. For static arrays, choose a capacity roughly 2x the expected number of entries to keep the load factor healthy. For dynamic arrays, a smaller initial capacity is fine since the table will grow as needed.

Insertion and lookup are done through the Value property, which accepts a key and returns (or sets) the associated value. The Contains method checks whether a key exists. Remove deletes a key. GetAndRemove atomically retrieves and deletes a value — useful when you need the value but also want to clean up the entry.

If you need to iterate over all entries, use ResetEnum followed by repeated calls to Enum, which fills in the next key and value. Be aware that enumeration gives you a snapshot — concurrent modifications may cause the enumeration to restart if a GC cycle happens mid-iteration.

All of these operations are thread-safe without any external locking. That’s the whole point.

Important Caveats

There are a few things to keep in mind:

Keys must be greater than 2. The values 0, 1, and 2 are used internally to represent slot states (unused, tombstoned, and GC). If you pass keys of 0, 1, or 2, you’ll corrupt the internal state. If your natural key space includes these values, offset them (add 3 to every key, for instance).

Capacity is always rounded up to a power of 2. If you request a capacity of 1000, you’ll get 1024. This is required by the internal hashing mechanism, which uses bitwise AND for fast modulo operations.

The MurmurHash3 option. By default, the UseMurMur3 property is true, which means keys are hashed using the MurmurHash3 algorithm for better distribution. If your keys are already well-distributed (like memory addresses), you can set this to false to skip the hash computation and use the key directly as an index.

GC happens automatically. When tombstones accumulate, the array triggers a GC cycle on its own. You don’t need to (and shouldn’t) call GC manually in normal use.

Performance Improvements in the Upcoming Release

The upcoming release includes a set of structural optimizations to the lock-free hash array internals. These changes were validated through extensive benchmarking on both a 4-core virtual machine and a 32-core bare-metal server, using a comprehensive test suite covering single-threaded operations, multi-threaded contention, dynamic resizing, and long-running stability soaks.

Here’s what changed and why:

Bitwise AND replaces integer division. Every time the array probes a slot, it needs to wrap the index around to stay within bounds. The original code used the MOD operator, which compiles to an integer division — a 20–90 cycle operation depending on the CPU. The new code uses AND (capacity - 1), which is a single-cycle bitwise operation. This works because the capacity is always a power of 2. This single change touches 23 sites in the codebase and shaves cycles off every single probe step.

Virtual key comparison eliminates interface overhead. The original code compared keys through an IEqualityComparer<K> interface, which involves double indirection (first to the interface table, then to the actual method). For the common case of integer keys, this is massive overkill. The new code uses a virtual KeyEquals method that the integer-key specializations override with a direct register comparison. Same result, far fewer cycles.

Forward iteration in GC. The garbage collection phase iterates over the entire old generation to copy live entries to the new generation. The original code deliberately iterated backwards (downto 0). The reasoning was sound: during GC, processed slots are flagged, and any concurrent thread that hits a flagged slot must wait for the entire GC to finish. By iterating backward while probe chains run forward, the GC frontier moves away from most active lookups, giving concurrent threads more runway before they encounter a flagged slot. However, this benefit is small in practice — it only helps the fraction of threads whose hash lands near the moving frontier, and once a thread is blocked it waits for the full GC regardless of when it was caught. Meanwhile, the cost of backward iteration is paid on every GC cycle: it fights the CPU’s hardware prefetcher, which detects sequential forward access patterns and speculatively loads upcoming cache lines. On a large table (the GC pass touches every slot), the difference between prefetcher-assisted and prefetcher-defeated sequential access is substantial. The new code iterates forward (0 to n-1), working with the prefetcher, and benchmarks confirmed consistent improvement on GC-heavy workloads.

Power-of-2 enforcement in resize. The original InternalResize and Grow methods didn’t enforce power-of-2 capacity, which was fine when the code used MOD for index wrapping, but would have caused catastrophic failures with the new AND masking. The new code rounds all resize targets to the next power of 2, and Grow simply doubles the capacity (which is always a power of 2 to begin with).

Overflow-safe growth. The original Grow computed 150 * FCapacity div 100, which overflows a 32-bit cardinal at approximately 28.6 million entries. The new code uses FCapacity * 2, which is safe up to approximately 2.1 billion.

Drain wait safety. The InternalResize method now waits for all threads to finish accessing the old generation before freeing its memory. This closes a theoretical race condition where a thread could still be reading from a generation that’s been freed.

The Numbers

On a 32-core bare-metal server with 8 threads (the realistic deployment scenario), comparing the original code to the new release:

BenchmarkImprovement
Single-threaded lookup+3% to +5%
Single-threaded GetAndRemove+5% to +7%
GC pressure (insert/remove churn)+4% to +8%
Multi-threaded insert (8 threads)+7% to +27%
Dynamic array grow (1K → 1M entries)+87% to +93%
Dynamic multi-threaded mixed operations+11% to +34%
Long-running stability soak+10% to +26%

The dynamic grow improvement is the standout — nearly doubling throughput when the table needs to resize under load. This directly benefits any use of TkbmMWDynamicLockFreeHashArray, including the debug memory system.

On a 4-core VM (the constrained scenario), the improvements are smaller but still positive across the board, with zero regressions.

The YIELDWAIT Tuning Option

The upcoming release also introduces a compile-time option for advanced tuning on many-core servers: KBMMW_LOCKFREEHASHARRAY_YIELDWAIT.

Background: What Happens When Two Threads Collide

When a thread tries to fence a slot and finds it already fenced by another thread, it needs to wait. The question is how it waits. The default is sleep(1), which tells the operating system to put the thread to sleep for at least 1 millisecond. On Windows, due to timer resolution, this actually means sleeping for about 15 milliseconds.

For most scenarios, this is fine — fence collisions are extremely rare, and when they do happen, sleeping for 15ms is harmless because the other thread released the fence billions of nanoseconds ago.

But on many-core servers (say, 32 cores running 8 threads), the situation is different. The fence holder is almost certainly running on a different physical core and will release the fence within nanoseconds. Sleeping for 15ms wastes 15 million nanoseconds waiting for an event that took 100 nanoseconds.

Enabling YIELDWAIT

To enable the yield-wait strategy, uncomment the define at the top of kbmMWLockFree.pas:

{$DEFINE KBMMW_LOCKFREEHASHARRAY_YIELDWAIT}

You’ll find it near the top of the file alongside other diagnostic defines, currently commented out with the {. prefix notation:

{.$DEFINE KBMMW_LOCKFREEHASHARRAY_YIELDWAIT}

Remove the dot to activate it.

With this define active, the Fence method uses sleep(0) instead of sleep(1) during normal operations. sleep(0) tells the OS to yield the current timeslice if another thread is ready to run, but return immediately if no one is waiting. This means the thread retries the CAS almost immediately — perfect for the case where the fence holder is on another core and releases within nanoseconds.

The Resize-Aware Twist

Simply switching all waits to sleep(0) doesn’t work well in every situation. During a GC or resize operation, the array needs to rehash the entire table, which takes milliseconds. If seven threads are all sleep(0) polling a flag that won’t change for several milliseconds, they create a storm of context switches and cache-line bouncing that actually slows down the resize.

The implementation handles this intelligently. When KBMMW_LOCKFREEHASHARRAY_YIELDWAIT is defined, the Fence method checks whether a resize is currently in progress. During normal operations (no resize), it uses sleep(0) for fast recovery. During resize, it automatically falls back to sleep(1) to get out of the way and let the resize finish quickly.

The GC drain waits and InternalResize drain waits always use sleep(1) regardless of the define, because they inherently wait for long-running operations where yielding aggressively provides no benefit.

When to Enable It

Enable YIELDWAIT when:

  • You’re running on a many-core server (16+ cores)
  • Your threads are fewer than your cores (undersubscribed)
  • Your workload involves high-frequency concurrent access to the hash array
  • You’ve profiled and confirmed that fence contention is measurable

Keep it disabled (the default) when:

  • You’re running on 4–8 core machines
  • Your thread count equals or exceeds your core count (oversubscribed)
  • You’re not sure — the default is safe and performant everywhere

YIELDWAIT Performance Impact

On a 32-core server with 8 threads:

BenchmarkDefaultWith YIELDWAIT
Multi-threaded insert+27% vs original+23% vs original
High-contention mixed ops−3% vs original+130% vs original
Dynamic grow+87% vs original+91% vs original
Stability soak+26% vs original+3% vs original

The high-contention scenario is where YIELDWAIT shines — over 2x faster than the original code when 8 threads compete over the same key range. The trade-off is that long-running stability soaks see less benefit (though still positive versus the original).

On a 4-core VM with 8 threads (oversubscribed), YIELDWAIT still dramatically improves the high-contention case (+142%), but the stability soak shows a significant regression (−32%). This is why it’s disabled by default.

Summary

kbmMW’s lock-free hash arrays give you a thread-safe key/value store that scales with your hardware instead of fighting against it. The upcoming release makes them measurably faster across the board, with the biggest gains in the dynamic growing scenarios that matter most for real-world use. And for those running on many-core servers, the KBMMW_LOCKFREEHASHARRAY_YIELDWAIT option provides an additional boost for contention-heavy workloads.

For most users, the improvements are automatic — just update to the new release and your existing code runs faster. If you’re on a big server and want to squeeze out even more, flip the YIELDWAIT switch and benchmark your specific workload to see if it helps.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.