Size matters! Like all kids, I was fascinated by T-Rex's massive jaws, but the unimaginably huge Brontosaurus1 was always my favorite. Woolly mammoths, blue whales, the planet Jupiter, Everest vs. K2, Betelgeuse, aircraft carriers, Mack trucks, redwood trees, Hoover Dam, I loved them all. The bigger the better, no matter what.
That kid's still inside me: One of the reasons why 64-bit computing got my attention is because it's big and can handle big things, like giant databases, huge address spaces, excessively large integers, I love them all. Add to that list something else big that the AMD Opteron processor can handle: Large memory pages.
In this two-part series, we'll see how large pages can improve performance and how some Java Virtual Machines (JVMs) running on Opteron processors make it easy for you to use large pages in your Java applications.
Yes, the default 4KB virtual pages are wimpy, wimpy, wimpy. On the other hand, the 2MB pages that you can obtain by using some advanced JVMs are hefty, hefty, hefty, especially when mapping a virtual address to a physical address in the translation lookaside buffer (TLB). Say no more.
Well, I suppose we should say some more: What's a memory page? What's a translation lookaside buffer? Why are bigger pages better? And how do you enable them using advanced JVMs? We're going to find out the background in Part 1, and get into the step-by-step directions in Part 2. (Note that what Windows calls "large pages," Linux and Unix call "huge pages" or "huge TLB pages." We'll use the term "large pages" in this article for consistency sake, but note that everything here pertains to Linux and Windows.)
All Pages Large and Small
All x86 processors and modern 32-bit and 64-bit operating systems allocate physical and virtual memory in pages. The page table maps virtual address to physical address for each native application (or to the JVM, which looks like an application to the operating system), and "walking" it to look up address mappings takes time. To speed up that process, modern processors use the translation lookaside buffer, to cache the most recently accessed mappings between physical and virtual memory.
Often, the physical memory assigned to an application or runtime isn't contiguous; that's because in a running operating system, the memory pages can become fragmented. But because the page table masks physical memory address from applications and JVMs, apps think that they do have contiguous memory. (By analogy, think about how fragmented disk files are invisible to applications; the operating system's file system hides all of it.)
When an application needs to read or write memory, the processor uses the page table to translate the virtual memory addresses used by the application to physical memory addresses. As mentioned above, to speed this process, the processor uses a cache system—the translation lookaside buffers. If the requested address is in the TLB cache, the processor can service the request quickly, without having to search the page table for the correct translation. If the requested address is not in the cache, the processor has to walk the page table to find the appropriate virtual-to-physical address translation before it can satisfy the request.
The TLB's cache is important, because there are a lot of pages! In a standard 32-bit Linux, Unix, or Windows server with 4GB RAM, there would be a million 4KB small pages in the page table. That's big enough—but what about a 64-bit system with, oh, 32GB RAM? That means that there are 8 million memory 4KB pages on this system.
Bridging that gap is where our large page scheme comes in.
Let's Look at the Big Ones
The small pages used by most JVMs under Linux and Windows are only 4KB in size. But if you're running a suitable 32-bit or 64-bit JVM on an Athlon 64 or Opteron processor, and have Java applications that access a lot of memory, you can change the memory page size to be 2MB. That's much, much more efficient.
Why is it better? Let's say that your application is trying to read 1MB (1024KB) of contiguous data that hasn't been accessed recently, and thus has aged out of the TLB cache. If memory pages are 4KB in size, that means you'll need to access 256 different memory pages. That means searching and missing the cache 256 times—and then having to walk the page table 256 times. Slow, slow, slow.
By contrast, if your page size is 2MB (2048KB), then the entire block of memory will only require that you search the page table once or twice—once if the 1MB area you're looking for is contained wholly in one page, and twice if it splits across a page boundary. After that, the TLB cache has everything you need. Fast, fast, fast.
It gets better.
For small pages, the TLB mechanism contains 32 entries in the L1 cache, and 512 entries in the L2 cache. Since each entry maps 4KB, you can see that together these cover a little over 2MB of virtual memory.
For large pages, the TLB contains eight entries. Since each entry maps 2MB, the TLBs can cover 16MB of virtual memory. If your application is accessing a lot of memory, that's much more efficient. Imagine the benefits if your app is trying to read, say, 2GB of data. Wouldn't you rather it process a thousand buffed-up 2MB pages instead of half a million wimpy 4KB pages?
You've probably spotted the flipside of the large page system: the small page system has a TLB with 512 entries, and the large page system only has eight entries. That means if your JVM and applications are hopping all around main memory, there may be times when the smaller number of cache entries will cause excess TLB cache misses, and therefore require more access to the page table. But given that each page is much, much larger, and assuming your application has good locality, the use of large pages should be beneficial. Of course, the only way to test whether there is any sort of performance hit toward using the smaller TLB (but larger pages) is to do your own performance testing.
By the way, in order for the large page system to work, you also need to ensure that those large memory pages will be contiguous. If you launch a JVM into an operating system that's been running for a while, memory will be fragmented and you won't be able to use large pages. For this reason, the 64-bit JVMs that support large pages require that that application "pin," or reserve, all its memory up front. Therefore, you should launch the JVM as soon as possible after an operating system reboot.
Next Time: Making Large TLBs Work
The three Java Virtual Machines that can currently be configured for large memory pages are BEA's JRockit, IBM's Java SDK and JRE or Linux and Windows using Eclipse, and Sun's Java 2 Runtime Environment.
In Part 2 of this article, we'll run through the steps you go through to enable large memory pages under both Linux and Windows, discuss the specific ways you configure the BEA, IBM and Sun JVMs to use large pages, and finally present a program that will demonstrate the potential performance differences one can see with large pages.
1 The Brontosaurus is now properly called the Apatosaurus, and it wasn't even the biggest dinosaur. But I didn'tknow that when I was a kid.
A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.