Lock-free programming is an approach to concurrency that uses low-level atomic operations to synchronize threads instead of familiar lock objects such as mutexes and critical sections. While it's probably not practical to write an entire multi-threaded application without a single lock, careful deployment of lock-free algorithms against selected problems can mean more scalable software. As AMD's line of processors evolves past two cores to quad-core Barcelona processors and beyond, the scaling performance of lock-free methods makes them increasingly interesting.
More Resources
Eliminating locks takes out a lot of problems. Without locks, threads don't block. Acquiring and releasing locks can introduce significant overhead. And most important, the class of problems that includes deadlocks and priority inversion, although often associated with multithreaded programs in general, are locking problems and a non-issue with lock-free synchronization. Avoiding these problems, notoriously difficult to debug and resolve in the presence of large numbers of threads, is a significant advantage for lock-free.
On the other hand, programming with locks can be easy. To add an item to a list, you lock the entire list (in the coarsest and simplest case), update some pointers, and release the lock. Except for the lock and release, it's the same code you would write for a single thread. That's not true for the lock-free alternative. A lock-free insert to a shared list is complex, and requires attention to keeping the list in a consistent and recoverable state.
The complexity of lock-free algorithms can be sidestepped by using standard algorithms with lock-free data structures. Such objects run lock-free internally but expose the expected operations through simple methods (like push and pop on a lock-free stack). One example of these is the SList object (interlocked singly-linked list) from the Win32 API.
A recent story in Ars Technica reported that Valve Software, makers of Half Life and other graphics-intensive first-person games, turned to lock-free methods in a deadlock-resolution effort. Adding lock-free was part of the company's overall push to multi-core, which made use of different synchronization strategies, including lock-free, coarse-grain and fine-grain locking, and multi-read exclusive write locks.
Whether a lock-free approach makes sense for your application will depend on the workload. Where parallelism is fine-grained (where each thread requires access to a shared resource for relatively short intervals) and there are a large number of contending threads, a well-written lock-free algorithm might make a better-performing choice, and will avoid deadlock and related conditions.
With a small number of threads, there probably won't be much difference in performance between a lock-free implementation and one that uses locks. But what might be a wash today, on a dual-core system with dozens of threads, might strongly favor lock-free on a 16-core future processor supporting thousands of threads.
Atomic Operations
Lock-free synchronization relies on an atomic update operation, ultimately supported on the processor, to maintain data integrity between threads without restricting access. The operation, known as CAS, for compare and swap, atomically updates memory, but only if its initial contents were some expected value. If the value is different, the update fails.
The simplest scenario for memory sharing using CAS is this: a thread loads a value from shared memory, storing that value locally. Then it does some calculations on that value, yielding a new value. Finally, it attempts to update the original memory location using CAS, submitting both the new value and the old value for comparison. If the contents of shared memory haven't changed, the update succeeds and we move on. If it fails, because another thread has updated that location in the interim, we re-fetch from shared memory and try again.
When there is no conflict between writing threads, CAS is cheap. Where there is a conflict, CAS requires a re-run of the calculation (most likely something cheap, like offsetting a pointer) and a retry of the CAS operation. Most of the time, we will need to retry the CAS until the operation succeeds, but it is possible to write lock-free algorithms where threads can continue past a failed CAS without spinning (see the queuing example below).
On AMD hardware, CAS is implemented by the cmpxchg family of instructions, which works with up to 128-bit operands (cmpxchg16b) on current AMD64 processors. CAS maps more-or-less directly to cmpxchg. The key instruction in an assembly implementation of a 64-bit atomic CAS operation boils down to (after some register manipulation to get our new and old operands in the right registers):
lock cmpxchg8b [edi]
After this instruction executes, the new value will be copied to the 64-bit target location and the zero flag set on success. On failure, the zero flag will be clear. The lock prefix is necessary for the operation to execute atomically, maintaining cache coherency among cores.
At the application level, you can find atomic CAS implementations built into the Windows API (InterlockedCompareExchange), .NET (System.Threading.Interlocked.CompareExchange), Java (the compareAndSet methods of the classes in java.util.concurrent.atomic), and by AIX and Solaris compilers. The does not provide high-level atomic functions.
Further Complications
One line of assembly isn't quite enough to address all of the real-world complications of a lock-free implementation. The most significant of these is that both the compiler and the processor may reorder instructions as a performance optimization. Reordering instructions around the CAS will introduce errors when the CAS is used as a gate to make data visible to other threads.
Say you have a section of code that initializes a structure, and then uses CAS to update a shared pointer to reference it. If some or all of the initialization were actually to occur after the CAS, threads running on other cores would see a partially-initialized structure.
Programmers need to take special care to make sure that all variables that may be shared through CAS are marked with the volatile keyword. This will keep the compiler from reordering instructions related to that variable. Visual C++ also includes _ReadBarrier and _WriteBarrier, which provide finer-grained control.
In Java and in Visual Studio 2005, volatile will also introduce code to prevent out-of-order execution by the processor (a memory fence or memory barrier). In other environments, you will need to use memory fence instructions (sfence, lfence, and mfence) to prevent hardware re-ordering. Note that some of the higher-level CAS APIs listed earlier include memory fences.
A second complication of CAS is the so-called ABA problem. One thread loads a value from shared memory (A). Then a second thread updates the value to (B), and then again back to (A). When the first thread completes its operation it will believe that the shared location has not changed and a CAS will wrongly succeed.
One solution to this problem is to extend CAS to track a second data item, a counter, as well as the desired data. The counter can be incremented with each change, and the updating thread can then make sure it is working with the correct generation. This requires the ability to work atomically with more than a pointer-sized object (and thus requires cmpxchg16b on 64-bit platforms).
Lock-Free Data Structures
With CAS as a starting point, researchers have shown that it's possible to develop complex lock-free data structures, including stacks, lists, and queues. One of these is the Michael-Scott non-blocking queue, developed by Maged M. Michael and Michael L. Scott, which we'll touch on here for illustration.
The queue is built as a singly linked-list. Here is the main part of the enqueuing code, with handling of the ABA problem left out for simplification:
for(;;)
{
tail = Q->Tail;
next = tail->next;
if (tail == Q->Tail)
{
if (next == NULL)
{
if (CAS( &tail->next, NULL, new_node))
break;
}
else
CAS( &Q->Tail, tail, next);
}
}
CAS( &Q->Tail, tail, new_node);
The queue uses CAS primitives for all shared-memory writes (the CAS function assumed here updates the first argument with the third, but only if the first and second match). But high-level functions composed of low-level atomic functions aren't atomic, of course. The complexity of lock-free algorithms comes from coordinating updates to the multiple elements that make up a high-level data structure to ensure that it is always in a consistent state.
Inside the queue's main loop, the calling thread gets a local copy of the tail and the tail's next pointers, retrying the read if necessary until it has consistent copies. Once the thread gets its copy, it attempts to CAS update the current tail's next pointer with a pointer to the new node. But before it can do this, it needs to verify that the next pointer is NULL as expected, indicating the true end of the list. It may not be, because another thread may have left the queue's tail pointer un-updated. If it's not, the calling thread uses CAS to move the queue's tail pointer further toward the end of the list before further retries. Finally, the calling thread tries to CAS the queue's tail pointer to the new node. If it fails, the thread just moves on, leaving the unfinished update for the next thread.
The dequeue operation takes a similar tack:
for(;;)
{
head = Q->Head;
tail = Q->Tail;
next = head->next;
if (head == Q->Head)
{
if (head == tail)
{
if (next == NULL)
return FALSE;
CAS( &Q->Tail, tail, next);
}
else
{
*pValue = next->value;
if (CAS( &Q->Head, head, next))
break;
}
}
}
free(head);
return TRUE;
Again, the queue retries until it can get consistent local copies of the pointers it needs. Once it does, it checks to see if the head and tail pointers are the same, which either indicates an empty queue or that the tail pointer isn't up to date. In the latter case, it uses CAS to advance the tail pointer toward the end of the queue and retries. If the head and tail pointers are different, that means there's something to return, so the queue copies the data value and then uses CAS to update the head. We're not allowed to leave the loop until that CAS succeeds, because we're about to delete the head node.
How could you use such a queue in your application? You might use it replace an inefficient lock-based design with serialization. For example, you could serialize socket write requests through a queue, with a thread that owns the socket dequeueing requests on the other side. There's nothing to keep you from doing the same thing with a queue that uses locks internally, but the lock-free queue will avoid any chance of deadlock. The lock-free implementation should also provide better performance as your application runs more threads on hardware with increasing numbers of processor cores.
Steve Apiki is senior developer at Appropriate Solutions, Inc., a Peterborough, NH consulting firm that builds server-based software solutions for a wide variety of platforms using an equally wide variety of tools. Steve has been writing about software and technology for over 15 years.