Azeem Jiva  4/1/2009 
Introduction

The number of flags that someone can change on the Sun Java Virtual Machine (JVM) is astounding.  Most of them are garbage collection (GC)-related and have many dependencies, not only on the system you are using but also to each other.  Tuning GC flags seems like it should be difficult and error-prone, but following a few easy steps can help improve application performance in only a few minutes. Before you get started, you’ll need to determine what sort of application you are trying to tune. Is your application throughput-bound, needing to perform as many transactions as possible in the shortest time? Or is your application latency-bound, requiring each transaction to finish in a specified time?  If your application is somewhere in between, then you’ll have to choose which requirement is more important.  The good thing is that engineers at Sun (and elsewhere) are working hard to create a GC algorithm that can mostly meet latency requirements while providing good throughput.  In the mean time, there are some simple actions you can take to improve your application’s performance. I’ll also be using flags from the Sun HotSpot for my examples, but both Oracle’s JRockit and IBM’s J9 have similar flags (see the respective documentation on those company’s sites).

Flags for Throughput Applications

Let’s start with the most common applications first – throughput applications. Throughput-bound applications are typically not sensitive to GC pause times and are better suited to a fast collector in which the JVM can pause Java execution for GC.  Since the Java application is paused, the garbage collector threads are free to iterate over the allocated objects quickly without fear that the objects will change behind them.  This allows multiple threads to work on collection actively.  While there is no limit to how many threads you can devote to collection, you’ll likely run into bandwidth issues with more than 16 threads.

Since we’re interested in maximum throughput at the expense of pause times, we’re going to enable the parallel garbage collector using XX:+UseParallelOldGC.  This flag enables a parallel version of both the young- and old-generation collectors.  You can also try using -XX:+UseParallelGC, which only enables a parallel collector for the young generation and keeps a serial collector for the old generation.  A serial collector is a single-threaded collector, while the parallel collector that we are enabling can use multiple threads to collect garbage.  Using -XX:+UseParallelOldGC is preferred for newer Java releases, although — if you have a sufficiently old release — you might be limited to using -XX:+UseParallelGC.

By default, the number of threads is some subset of the number of processors, which changes depending on the version of Java. We can change the number of threads to some reasonable default depending on the number of cores that you have available on your system and, more importantly, the number of cores you want to dedicate to this application. In this example, I’m going to assume the system has two processors and eight cores. Since the application I want to run is exclusively dedicated to the system, I’m also going to devote all eight cores to GC.  You should experiment with the number of threads based on your application and system.  If the JVM does not have a complete system to itself, it might be beneficial to your system as a whole to decrease the number of threads to something more manageable.  Experiment as needed, but realize that going beyond 16 may cause you to hit a point of diminishing returns in which more threads will not necessarily get you faster collections.  To modify the number of threads, use the flag-XX:ParallelGCThreads=#, where # is the number of threads (in our case, eight).

Once you’ve enabled those flags, test the application and see how much performance you’ve gained. Ideally, your application should now run faster and have shorter GC pause times.  One thing to realize is, the larger the heap for your application (especially if you’re running a 64-bit application with a multi-gigabyte heap), the more potential gains you can expect.

Here’s a quick example to show the effect of simple flag tuning.  I used the DaCapo Benchmark 2006-10-MR2 (https://dacapobench.org/) as a test case – specifically, the Xalan sub-benchmark.  My system is a two-processor, eight-core AMD Opteron™ 8389 running at 2.9GHz.  The JVM is Sun’s Java6u11 running in 32-bit mode on Windows® 2008 Server Enterprise.  I ran Xalan from DaCapo with the following flags:

java -jar dacapo-2006-10-MR2.jar -n 2 -s large xalan

The benchmark ran a warm-up in 64,178 msec and a testing run in 59,561 msec.  Not bad, actually: about a minute to run through the benchmark.  Let’s take what we’ve learned and attempt to improve that.  I reran the benchmark, this time adding the following flags:

-XX:+UseParallelOldGC –XX:ParallelGCThreads=8

The benchmark ran the warm-up in 22,807 msec and a testing run in 22,012 msec — an improvement of around 2.7 times from adding just two flags!.

 

Flags for Latency Applications

What about applications that are latency-driven rather than performance-driven? There is another class of applications that require a certain guarantee of how long the GC pauses will take. These applications are popular in the financial sector and related fields, although are not necessarily limited to those areas. Such applications require that the JVM never create a pause longer than a minimum time, ensuring that individual transactions finish within a certain time; this is usually less than 10 msec, but each application is different and has different pause-time requirements.  For those applications, we need to enable the low pause or concurrent collector. This collection algorithm can run concurrently with the Java application.  Since the GC threads are running alongside the application, there is quite a bit of contention and checking required, resulting in limited throughput.  The advantage, though, is that — since the garbage collector can complete some parts of the collection while the application is running — there are only tiny pauses that the user can control based on the application’s requirements.

To enable the concurrent collector we use the flag -XX:+UseConcMarkSweepGC.  If your system has one or two cores, you should rethink using the concurrent collector since the limited number of cores means that you have to stop execution of the application anyway.  If you still want to use the concurrent collector in that case, then add the flag –XX:+CMSIncrementalMode, which enables the concurrent collector to perform different collection phases incrementally and improves performance on systems with a limited number of cores.

One issue to consider with the concurrent collector is that it can slowly fragment the heap, because the heap is not compacted on every single collection.  Instead, only when the allocator can’t find a section in heap large enough to hold a new object does the collector then attempt to compact.  This can create large pauses when the collector needs to compact and free space in the heap.  Since the compaction can happen at an unpredictable time, your application requirements might not be met.

The goal of optimal concurrent collection performance is to reduce the number of full collections to as few as possible.  Several flags can help.  The concurrent collector has a safety factor that is added to the duty cycle when deciding how to pace the collection.  The default size is 10%, but can be modified by using –XX:CMSIncrementalSafetyFactor=#, where # is a number between 0-100.  Increasing this number adds a larger safety value to the collector’s minor collection time; a larger value means that you are willing to give more time to a minor collection.   If that doesn’t help, you’ll have to increase the minimum duty cycle itself.  The flag to do that is -XX:CMSIncrementalDutyCycleMin=#, where # is a number between 0-100.  The higher the number, the more minimum time you are willing to give to a minor collection.

Rerun your application and see if the response time for your application has increased. In most cases, while response time may have increased, your throughput may have dropped; whether that tradeoff is acceptable is a decision only you can make.

Heap Settings

At this point, your application might be running OK, and you’re getting some decent performance improvements.  But it feels like you could do more, if only you had a few more flags that you could tune.  Of course, there is one piece of the puzzle that’s missing: the number of times your application performs a collection!  If you could reduce the number of collections as much as possible, but balance that with the time required to perform a collection, your application performance would improve.  One way to do that would be to increase the total memory or heap size available to your application.

The easiest way to set your heap would be to set a maximum allowable size and let the JVM handle everything else.  By using the flag -Xmx you can set the heap size.  For instance, –Xmx1024m or –Xmx1gboth set the maximum heap size to 1 gigabyte.  Just realize that you don’t want to set the maximum heap larger than the available memory in your system; otherwise, performance will suffer because the OS will page the JVM process out to disk to fulfill the memory allocation for the heap.  One tip is to reduce the number of times the JVM has to resize the heap.  By only setting the maximum heap size, the JVM will start the heap small and grow to the maximum.  On the other hand, if you know you’ll need a significant portion of the heap anyway, then you can set the minimum size as well.  By using the -Xms flag, you can set your minimum heap size to the same as the maximum heap size to guarantee that the JVM won’t waste time growing the heap.  An example would be to use -Xms1g –Xmx1g to get a 1-gigabyte heap on application start.

Nursery Tuning

While that’s a good start, you can probably do better.  The JVM has this notion of the heap being split in multiple sections.  One section is the nursery, or young generation, in which all allocations happen.  The other section is the tenured space, or old generation, in which long-lived objects get promoted from the nursery.  You can tune the size of these spaces for better performance depending on how your application allocates objects.  By adding the flag -Xmn you can set the nursery size, and the remaining heap is then allocated to the tenured space.

This tuning of the nursery can be an important source of performance improvements because a nursery collection is significantly faster than a full collection.  And, if your application creates many short-lived objects, then you would be better off with a larger nursery than tenured space.  If only you could somehow calculate how much nursery your application uses, you could tune better.

The JVM has flags to help there as well. Using the flag -verbose:gc shows each nursery, full GC, how much was collected, and how much of the heap is free.  Adding –XX:+PrintGCDetails increases the level of detail.  For example:

Minor Collection
[GC [DefNew: 910K->12K(960K), 0.0003947 secs] 1239K->341K(5056K), 0.0005048 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] [GC [DefNew: 908K->13K(960K), 0.0004003 secs] 1237K->341K(5056K), 0.0005126 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] [GC [DefNew: 909K->14K(960K), 0.0005241 secs] 1237K->342K(5056K), 0.0006367 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] Full Collection 
[Full GC [Tenured: 3485K->4095K(4096K), 0.1745373 secs] 61244K->7418K(63104K), [Perm : 10756K->10756K(12288K)], 0.1762129 secs] [Times: user=0.19 sys=0.00, real=0.19 secs]

What does all of that mean?  The first three lines are young generation or minor collections.  These are usually quick, and happen quite frequently.  The various parts of the young generation include  Nursery, From space, and To space.  Allocations happen in the nursery, so applications that allocate heavily need a large nursery to improve performance.  In this case, the nursery is 960k and has around 909k worth of data before the collection and around 13k after the collection.  The total size of the young generation (Nursery + From Space + To Space) is 5056k; it has around 1237k worth of data before the collection and about 341k after the collection.  The last line shows a full collection.  The tenured space is 4096k and — in this case — objects were moved into tenured space but none were collected.  Here you might have a different problem, in which the tenured space is too small and should be increased to help give better performance.

You might have noticed that with -XX:+PrintGCDetails 
you’ll get a bit of output when your application ends that looks similar to:

Heap
def new generation   total 960K, used 16K [0x22990000, 0x22a90000, 0x22e70000)
eden space 896K,   0% used [0x22990000, 0x22990810, 0x22a70000)
from space 64K,  22% used [0x22a70000, 0x22a738b0, 0x22a80000)
to   space 64K,   0% used [0x22a80000, 0x22a80000, 0x22a90000)
tenured generation   total 4096K, used 328K [0x22e70000, 0x23270000, 0x26990000)
the space 4096K,   8% used [0x22e70000, 0x22ec2328, 0x22ec2400, 0x23270000)
compacting perm gen  total 12288K, used 152K [0x26990000, 0x27590000, 0x2a990000)
the space 12288K,   1% used [0x26990000, 0x269b6390, 0x269b6400, 0x27590000)
ro space 8192K,  63% used [0x2a990000, 0x2aea3ae8, 0x2aea3c00, 0x2b190000)
rw space 12288K,  53% used [0x2b190000, 0x2b7f83f8, 0x2b7f8400, 0x2bd90000)

This shows you the sizes of the various parts of the heap, how large they are, and the amount used at the time the application quit.  It’s a good snapshot of the application’s heap requirements at the end of the run and can help you tune your application.

For example, if the nursery usage is nearly the total size of the nursery, you should increase the nursery size using -Xmn flag.  The fewer collections the JVM has to perform, the faster your application runs.  Realize that a full collection is slower than a minor collection , so you should size your new generation as large as possible to let objects die out from there.  If your nursery is filling up quickly and objects are getting promoted to tenured generation, you’ll get full collections.  By modifying the nursery (-Xmn) size such that the nursery is a larger percentage of the total heap, you’ll decrease the number of full collections because objects will die out in the nursery instead.

Conclusion

I hope that these tips have helped you with your Java application tuning, and you’ve learned a little about how to get better performance out of your application.  By tuning a few flags, you should find that your application runs faster, pauses less often, and generally has a significant native feel to it.  Remember that, like anything, your mileage may vary and much of this depends on your application, system, and other factors.  These suggestions are meant as a general guideline for improved performance.

Supporting Data

C:\azeem>jre6\bin\java -jar dacapo-2006-10-MR2.jar -n 2 -s large xalan
===== DaCapo xalan starting warmup =====
Normal completion.
===== DaCapo xalan completed warmup in 64178 msec =====
===== DaCapo xalan starting =====
Normal completion.
===== DaCapo xalan PASSED in 59561 msec =====

C:\azeem>jre6\bin\java -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -jar dacapo
-2006-10-MR2.jar -n 2 -s large xalan
===== DaCapo xalan starting warmup =====
Normal completion.
===== DaCapo xalan completed warmup in 22807 msec =====
===== DaCapo xalan starting =====
Normal completion.
===== DaCapo xalan PASSED in 22012 msec =====