» Overview» The Incremental Improvements Within Vista» Three Areas of Impact: .NET, NUMA, and I/O» Focus on I/O
The first thing to know about Windows Vista™ is that all versions, all the way down to Home Basic, support multicore processors. Further, the Business, Enterprise, and Ultimate Editions – the ones most likely to be run and encountered by a developer – all support multiple processors on the motherboard.
Additionally, Vista's display model exploits the acceleration capabilities of modern GPUs. That's the unquestionably good news. Vista is not going to hold you back from exploiting the power of multiple cores, processors, and accelerated display pipelines. Vista's approach to the new era in hardware, though, is incremental. Whether you view that as good news or bad news depends on how far into the future you're looking.
If you think about Vista alongside the next few years, when common machines will have anywhere between 1 and possibly 8 cores, an (the) incremental approach is going to be fine. There are a lot of background tasks that can be juggled to provide better performance and, from a programming standpoint, the relatively straightforward techniques of .NET's System.Threading namespace will give many programmers their first taste of actual concurrent speedups, as opposed to the "prevent freezing" benefit that characterizes single-core multithreading.
If you take an even longer view, more substantive changes in the OS architecture will undoubtedly prove necessary. The illusion that we're still running on hopped-up versions of the von Neumann architecture has already exploded at the processor and memory levels. As the number of cores goes up, every sub-system will eventually have to be designed to truly take advantage of an asynchronous world. In particular, the operating system must be increasingly modularized if it is to spread its burden among many cores.
The incremental improvements within the OS relating to multicore performance include enhanced support for nonuniform memory architecture (NUMA) systems (such as the AMD Opteron™ family), pervasive prefetching (at both the low level of dealing with page faults, and the system cache read-ahead and the Prefetch/SuperFetch user-feature, which loads oft-used programs in anticipation of user launch), improvements to internally-used data structures and algorithms, an improved DLL loader that creates processes significantly faster, and a much-improved thread pool.
The thread-pool changes include generally improved performance and the ability to support multiple pools per process. This seems a good time to mention Vista's system-wide anti-convoy features. A "lock convoy" is a situation where performance tanks because a large number of threads are blocked, waiting for a resource, and when the resource becomes available, every thread is awoken, all but one fails to acquire the resource, and all but one are set back to waiting. When the number of threads is high enough and the work performed per thread is small, the time spent in overhead can overwhelm the amount of time spent working. That lock convoys arise from attempts to make thread scheduling "fair" is just one of those surprising results that turns concurrent programming into such a challenge.
Vista uses a different lock handoff mechanism that's "unfair," in that there's now a race condition between a lock becoming available and its acquisition by the first thread in the wait list. This makes it possible for a thread to "sneak in" and grab the lock sending the should-have-been-scheduled thread to the back of the wait list (and at least theoretically raising the potential of starvation). Lock convoys were, anecdotally at least, often the root problem of system-wide freezes that lasted for several seconds and then disappeared.
From a developer's standpoint, there are three major areas where developing for Vista and multicore will play out. First, managed code, aka the .NET Framework. It's here to stay. Historically, some Win32 developers have viewed the .NET Framework as something akin to a souped-up version of the Visual Basic runtime – fine for the hoi polloi but not for the hardcore. Windows Presentation Foundation (WPF) comes close to reversing that position; the slickest user-facing applications are written in WPF, not (primarily) in Win32 or WinForms. Furthermore, with the upcoming release of Visual Studio 2010, managed applications will be able to be written with parallelism in mind using the Task Parallel Library and Parallel LINQ. This will enable performant multiple-core execution using high level languages such as C# and Visual Basic, and will provide enhanced synchronization primitives as well. For C++ native developers, the Parallel Pattern Library also similarly enables the development of concurrent applications .
The second area is the improved support for NUMA systems. In contrast to Symmetric Multiprocessing (SMP), which views memory as a single shared resource, NUMA systems can associate a portion of memory with a single processing unit. When general memory thus becomes "local" to one core, scaling is easier. With shared memory, one must use locks extensively to ensure data integrity. But locks cause context switches, which reduce cache effectiveness and locality and, as general memory is updated, increase traffic over the memory bus, which can affect all access to memory.
So NUMA systems continue the trend towards increasing the number of "steps" in the speed with which data can be retrieved (registers being faster than L1 cache, which is faster than L2 cache, which is faster than L3 cache which is faster than local NUMA memory, which is faster than shared memory on another node, which is faster than Vista's flash-based ReadyBoost RAM, which is faster than the hard drive, and so on out to databases and the network).
Of course not all general memory on a NUMA system is necessarily localized to a single processor. Indeed, in practice the amount of localized memory is probably quite small compared to the amount of shared memory. For shared memory, it is important to maintain cache coherence; that is, all the processors have to agree on the value of a particular general memory location, even if that value is being manipulated in a processor-specific cache. Systems that do this, such as systems based on the AMD Opteron processor, are known as cache-coherent NUMA (ccNUMA) systems. (It is possible to architect a non-cache-coherent NUMA system, but obviously such systems make maintaining the integrity of shared memory very difficult.)
Ironically, having said that the future emphasizes managed APIs, the NUMA-specific APIs are all unmanaged C functions (including VirtualAllocExNuma, CreateFileMappingExNuma, MapViewOfFileExNuma, GetNumaProcessorNode, and AllocateUserPhysicalPagesNuma). This seems appropriate, since localizing memory to a particular processor is dramatically outside the memory abstractions of the CLR. Further, localizing memory is the sort of system-affecting behavior that is more generally tackled by low-level C and C++ programmers trying to get maximum performance for a particular task, and could be used directly by the CLR internals.
The third area where developers interested in multiple cores should focus their attention is in I/O. A few of the simple but very welcome improvements include I/O cancellation and prioritization. Synchronous I/O operations (such as opening a file) can now be cancelled from other threads. If you have the file handle, you can use CancelIoEx() and if you have the thread handle, you can use CancelSynchronousIo(). The scenario we all know for this is the attempt to open a file over a no-longer-available network share.
I/O prioritization allows background tasks such as antivirus scans to be deprioritized. Often, this will be done by way of a corresponding call to the process or thread. Calling SetPriorityClass(GetCurrentProcess(), PROCESS_MODE_BACKGROUND_BEGIN) or SetPriorityClass(GetCurrentProcess(), PROCESS_MODE_BACKGROUND_END) will make all I/O switch to Very Low priority. SetThreadPriority() can be used similarly. File-by-file priority can be set with SetFileInformationByHandle() which takes a FILE_IO_PROPERTY_HINT_INFO argument which, in addition to Very Low, can be set to Low or Normal (note that implementation is a driver responsibility and therefore I/O prioritization is just a "hint").
Multimedia applications are the obvious users of another I/O improvement, which is reserved streaming bandwidth. SetFileBandwidthReservation() takes as inputs the desired nBytesPerPeriod and nPeriodMilliseconds and returns in out parameters the lpTransferSize and the lpNumOutstandingRequests that the application should attempt to saturate the device with to achieve the bandwidth. There is a corresponding GetFileBandwidthReservation(). These functions may not be multicore specific, but performance programmers may be able to exploit them cleverly.
The incremental approach to multicore systems that Vista embodies demands relatively little from the application developer. There are improvements, such as the areas covered in this article, but it's hardly an overwhelming shift. Such a shift may necessarily be coming in the APIs of future operating systems because it is certainly the case that an overwhelming shift in hardware is upon us now.