Applications need memory. The processor provides access to memory through memory pages; applications request memory using virtual addresses, and the processor translates that to physical addresses on-the-fly. The mapping between the physical and virtual addresses is maintained in the page table; speeding access to that large table is a cache, called the Translation Lookaside Buffer.
By default, the pages are small, only 4KB in size; on AMD's Opteron and Athlon 64 processors, the TLB caches the translations to 512 of these 4K pages. Thus, the TLB can only provide high-speed access to the mappings for a tiny fraction of the memory used in a modern server or workstation.
However, as we explained in Part 1, you can configure AMD64-based systems to use large memory pages, which are 2MB in size, and thereby provide a more efficient TLB caching system and, in many cases, faster application performance. In Part 2, we'll talk about how to use those large memory pages under 32-bit and 64-bit Windows and Linux, and configure three advanced Java Virtual Machines to take advantage of them.
Setting Up Linux and Windows
Before we get started, note that configuring a JVM for large pages will require the operating system to reserve, or pin, large blocks of continuous memory. In other words, for each 2MB large page you request, Linux or Windows must find 2MB of contiguous memory, which is then pinned so that it will not be paged out.
Therefore, it's best to set up the JVM when your PC or server has just been rebooted; that way, big blocks of contiguous memory are available to be reserved. If you wait until the machine has been running for some time, memory will become fragmented, and the operating system may not be able to find as many 2MB contiguous blocks are you requirethereby giving your JVM less memory, and somewhat reducing the effectiveness of the techniques.
The procedures for setting up Linux and Windows are quite different.
Linux : To set Linux for large pages, there are two steps: first, make sure your kernel supports large pages, and then, allocate those pages. log in as root.
To check to make sure that your kernel support large pages, go to the shell and issue the command:
cat /proc/meminfo
If the output has lines that say "HugePage_Total," "HugePages_Free" and "Hugepagesize," then you're in business. Also, the /proc/filesystems should show a file system of type "hugetlbfs," which means "huge translation lookaside buffer file system." (Remember, Linux uses the word "huge" instead of "large.")
Once you've determined that your system supports large pages, you'll have to allocate them. Think about how much memory your JVM will need, because once you've allocated those pages, they're not available for any other purpose until you reboot, unless you deallocate them using:
echo 0 > /proc/sys/vm/nr_hugepages
Remember, each page is 2MB in size, so if your application will require 2GB of memory, you will need to allocate 1000 large pages. From the shell, issue the command:

Figure 1.
cho 1000 > /proc/sys/vm/nr_hugepages
Now, you may not get all those pages. To see how many large pages were actually allocated, use:
cat /proc/sys/vm/nr_hugepages
Windows : What about Windows XP or Windows Server 2003? The process is, as you'd expect, entirely different. The first step is to authorize a specific user to be able to lock pages in memory. This is a one-time configuration change.
Make sure you're logged in with administrative privileges, and run Start > Control Pages > Administrative Tools > Local Security Policy > Local Policies > User Rights Assignment, and then select Lock pages in memory (see Figure 1).
From that page, click on "Add User or Group" (see Figure 2), and put in your appropriate authorized user or admin account (see Figure 3). Then close all this out and logout/login or reboot to free up memory. Now, applications running as this user can access large pages merely by reserving the number of pages that they need.
Wasn't that easy? Remember, this is different than with Linux, where you'll need to reallocate those pages each time you reboot. With Windows, this is a one-time change
Note that on Windows, you can't reserve pages the way you can with Linux. So, for this reason, the best policy is to start your JVM as soon as you can after rebooting.

Figure 2.
Setting Up Java Virtual Machines
To the best of my knowledge, the only JVMs that currently support large pages under Linux or Windows are from BEA, IBM, and Sun. If you know of othersplease let me know! Here are the directions for setting them up.
Before we dive in, here's an important note: a JVM can't mix large pages and small pages. Even if you provide the appropriate "large page" option, if the JVM can't allocate the whole thing using large 2MB pages, it will revert to using small 4KB pages. Thus, you should explicitly allocate the maximum heap size using the -Xmx optionand make sure it's a multiple of 2MBrather than leave it unbounded.
(The -Xmx flag specifies the maximum heap size. The flag -Xms sets the minimum. If you want to lock in a specific heap size, you can set -Xms and -Xmx to be the same size. So, for example, for the 2GB heap size used above, the command would be -Xmx2G on the JVM command line.)
BEA JRockit : You set up the BEA Java Virtual Machine by using the switch --XXlargepages. You can learn more about this in BEA's documentation.
If you're running Windows, this is the only step you have to do (assuming that you made the "Lock pages in memory" policy change earlier). If you're running Linux, your root account will also need to create a mount point for mapping the large pages, and assign permissions to that mount point. This is also a one-time change:
mkdir /mnt/hugepages
mount -t hugetblfs nodev /mnt/hugepages
chmod 777 /mnt/hugepage

Figure 3.
IBM's SDK and JRE : The IBM solution for Linux has a different mechanism for doing large pages: the -Xlp switch. Before you start the JVM, however, you'll need to log in as root and change the SHMMAX value; that value defines the maximum size (in bytes) for a shared memory segment. You should set it to be the number of bytes (less one) you'll want in your shared memory. Using our example earlier, we wanted 1000 2MB shared pages to provide 2GB of memory. The size for the SHHMAX should thus be:
2 x 1024 x 1024 x 1024 - 1 2,147,483,648 - 1 2,147,483,647
To set that value:
echo 2147483647 > /proc/sys/kernel/shmmax
Before you run the process to allocate pages described earlier:
echo 1000 > /proc/sys/vm/nr_hugepages
Then use the -Xlp switch to launch the JVM.
If you're using the IBM SDK and JRE for Windows using Eclipse, just use the -Xlp parameter on JVM startup.
Sun's HotSpot JRE for Linux and Windows : The process for using the Sun JVM is similar to the IBM one, except that you use the flag XX:+UseLargePages. Use this flag for both Linux and Windows.
With Linux, as with the IBM process above, you'll need to set the SHMMAX capacity. Follow the procedure describe above.
YMMV
When it comes to any performing tuning technique, it's important to test to see that the results pay off for your specific application, usage patterns, hardware, operating system, etc. Thus, you should benchmark your own critical applications using both large and small pagesas well as, possibly, different amounts of allocated space under Linux.
However, it doesn't take a full enterprise app to begin experimenting with the technique. To that end, Listing 1 contains a sample app, LinkedListTest, that runs through a large linked list three times, and then prints out the elapsed time for the operation. Each element in the linked list spans 4K of memory. Because the linked lists span a lot of memory, the code exercises the translation lookahead buffer.
Listing 1.
import java.util.Random;
import java.lang.Math;
import java.util.LinkedList;
import java.util.Iterator;
class LinkListTest {
public static void main(String[] args) {
int arysize = 4096; // default value
if (args.length != 0) {
arysize = Integer.parseInt(args[0]);
}
LinkListTestbench testbench = new LinkListTestbench(arysize);
testbench.run();
}
}
class LinkListTestbench {
private Random generator;
int dummy;
static final int MB = 1 << 20;
int arySize;
int numElems;
int iterations;
static final int numPasses = 10;
LinkedList myList;
public LinkListTestbench(int size) {
arySize = size;
// pick the number of elements so we fit in 75% of the heap
long maxmem = Runtime.getRuntime().totalMemory();
numElems = (int) (maxmem*0.75)/arySize;
System.out.println("List Elements: " + numElems + ", Array size " + arySize
+ ", uses " + showMB(numElems*arySize));
// pick the iterations so we do a measurable amount of work
iterations = 20000000/numElems;
System.out.println(iterations + " iterations per pass");
generator = new Random(0);
}
public void run() {
System.out.println( "\nPass\tTime");
System.out.println( "...Skipping First 2 Passes for Warmup...");
setupList(arySize);
int runTotal = 0;
int runsUsed = 0;
for(int pass = 1; pass < numPasses + 1; pass++) {
long startTime = System.currentTimeMillis();
accessList(iterations);
long elapsedTime = System.currentTimeMillis() - startTime;
if (pass > 2) { // ignore the first 2 passes (warmup)
runTotal += elapsedTime;
runsUsed++;
System.out.println( pass + "\t" + elapsedTime + "ms");
}
// check the calculated value from accessList to avoid compiler optimizing it away
if (dummy != 0) {
System.out.println(" Unexpected Value, exiting");
System.exit(1);
}
}
System.out.println("Average runtime pass 3 thru " + numPasses + " in ms: " + runTotal/runsUsed);
}
private void setupList (int arraySize) {
myList = new LinkedList();
for (int i=0; i<numElems; i++) {
byte a[] = new byte[arraySize];
myList.addLast(a);
}
}
private void accessList (int iterations) {
for (int i=0; i<iterations; i++) {
Iterator it = myList.iterator();
while (it.hasNext()) {
byte a[] = (byte []) it.next();
dummy += a[1];
}
}
}
String showMB(long siz) {
// return string representing # of MB rounded to nearest integer
return((int)((double)siz/(1<<20) + 0.5) + " MB");
}
}
To compile the code, use BEA's JRocket 5.0 or later, IBM's JRE 1.4.2 or later, or Sun's 5.0 Update 5 or later. Be sure the jdk bin directory is in your path, since you'll need javac and java. Compile with javac -source 1.4 linklisttest.java. You can then use different command-line arguments to launch JVMs.

Figure 4.
For this test, we'll take advantage of the fact that the test application looks at the heap size and uses three-quarters of it to create the linked-list elements. To see the benefit of large pages, let's use a heap of 64MB.
To run the application with small pages, start the JVM and launch the application with:
java -Xms64m -Xmx64m linklisttest
Now, run it again using large pages. Here's the startup code for our three JVMs:
BEA java -Xms64m -Xmx64m -XXLargePages linklisttest IBM: java -Xms64m -Xmx64m -Xlp linklisttest Sun: java -Xms64m -Xmx64m -XX:+UseLargePages linklisttest
Try running the code with small and large pages, and with different heap sizes. The results should be strikingly different! (If you get the exact same results, or close enough, then something didn't configure properly; the application may still be using small pages, even if you want otherwise. In that case, please consult the JVM maker's support section or documentation; unfortunately, I don't have the bandwidth or resources to help troubleshoot.)
Let's conclude with an exercise for the reader: When you're running the tests, either with the sample program or with your own applications, use AMD's CodeAnalyst performance analyzer to watch for specific processor events. The event to watch for would be 0x46, "L1 and L2 DTLB Miss" (see Figure 4). This event fires whenever the processor requests a page that's not cached in the TLB. The fewer of those cache misses, the better!
A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.