What I Learned at JavaOne - Part II, Performance Tuning

12:01 PM 0 Comments

This post is the second in a series.

While I have worked for about three years professionally in the realm of security and feel that I now know a thing or two, I know very little in-depth about performance tuning, so the classes I attended were typically chock-full of new information from my perspective.

JVM Bytecode for Dummies, by Charles Nutter

I was really excited to go to a session by Charles Nutter, one of the lead developers on JRuby, especially one that talked about something so low-level as JVM bytecode, which he certainly knows a lot about by now.

First, there are some n00b things that I somehow hadn't gleaned over the last 13 years of coding in Java.  Don't judge me.

Opcodes.  There are over 200 opcodes currently supported by the JVM.  Wow! Years ago, I was tasked with writing some FindBugs rules where I had to learn about ldc, ifeq, ifne, and several others, but I really had no idea that there were so many.  Apparently, they are one byte long which means that there are about 50 possible opcodes left.

CLR.  CLR, the C# virtual machine, does not interpret C# code at runtime, which has the interesting consequence that it can't optimize out things like null checks via profiling at code during runtime.

Finally.  The are only two opcodes that have ever been deprecated:  jsr and ret.  Due to the deprecation of these opcodes, the contents of finally blocks are replicated at every possible exit point.  This could have footprint and optimization implications should your finally blocks be big (hopefully they are not; I don't recall ever having a finally block more than a half-dozen cleanly-spaced lines).

Double vs. AtomicLong.  Because the JVM stack elements are 32-bits each, it means that doubles take up two adjacent stack elements.  This means that double d = 32d; (for example) is not atomic.  So, use AtomicLong instead if you are in a highly-concurrent app.

So, Nutter mostly talked about three things.  The first was a very quick overview of the most common opcodes and how a stack-based interpreter works.  This was an interesting review, though I was already familiar with the majority of them due to my FindBugs efforts (note:  I'm sure I don't have the deep understanding of them that language designers do as most of my work was running javap -c and representing the generated bytecode with the FindBugs API).  

The second was interlaced with the opcode overview, which was his demonstration of the bytecode DSL BiteScript.  This looked really interesting as kind of a bytecode++.  It looked a lot like bytecode (format and commands), but there were several enhancements like commands such as iftrue and iffalse instead of the ever-confusing ifeq and ifne.  It also allowed for macro definitions to make certain things like static invocations less wordy.  He also briefly mentioned ASM, Groovy and Scala support, and JiteScript.  I haven't had occasion to try any of these yet, but they definitely appear to be worth a shot.

The third was his rationale for writing bytecode in the first place.  Writing bytecode, first of all, can be a very good performance enhancement for heavily reflective code since it makes it possible to have reflection-less code.  Also, bytecoded data objects, like what Hibernate does, would be more efficient/cleaner, etc. at the byte code level.  Another example is the ability to support certain language features that are supported on the JVM but not yet available in Java (shameless plug for JRuby). 

In the end, I was happy, but somewhat bummed because I thought there would be more about performance tuning.  Though, I was elated to find out that there was a companion session the following day that was going to address it in more detail.  Wohoo!

JVM JIT For Dummies, by Charles Nutter

Nutter, again, offered a wealth of deep understanding about how the JVM works.  Beforehand, I was familiar with the basics of profiling and optimizations, but almost everything was brand-new for me.

There are a number of things that javac will do to unpackage syntactic sugar and other high-level expressiveness where it can from a static analysis standpoint.  A common case is loop unrolling where if it is statically apparent that the loop will always run x times, and the source of the index is resolvable statically (like int i = 0, etc.), then the compiler can just remove the for loop and replace it with inline statements.

The JIT does many of the same things and more via profiling.  Once a given method has been run over 10000 times, then the JIT considers that method as hot.  At that point, it will start looking at some of the common optimizations that it can do based on which branches in the code are being executed.  For example, if after 10000 times a certain if condition has never been hit, then the JVM can optimize it away, inserting guard logic to allow an optimization rollback just in case.

I won't go into a lot of detail here (though it was very intriguing), but here are some of the things that the JIT looks for:
  • Loop Unrolling:  Observing that a loop always completes in a known number of times and unrolling it into inline statements accordingly
  • Lock Eliding:  Observing locks where synchronization isn't adding benefit and removing them.
  • Escape Analysis:  Observing where certain objects aren't necessary and removing that outlining structure
  • Call Site Analysis:  Observing monomorphic or bimorphic call sites and performing an according optimization.  Briefly, monomorphic means that A calls B and A will only ever call B.  Bimorphic means that A could call B or C, e.g. two implementations of the same class.
The JIT will ignore code that is too big for it to analyse.  Boo to long methods! The really, really interesting thing is that the JIT can be encouraged to optimize more complicated stuff just by breaking the big method up into smaller methods.  Nutter told of a case with the JRuby parser where performance was starting to tank.  They learned that a main parsing method had gotten too big, so the JIT had stopped optimizing it.  Simply by breaking the parsing method into smaller methods, they got a big performance boost.

So, what is one to do if your performance is slow, and you'd like to see if JIT is able to optimize your code?

Monitoring the JVM

  • -XX:+PrintCompilation - This prints out methods as the JIT is optimizing them.  The output details what the JIT is doing with certain methods:
    • "made zombie" - This method is just about to be optimized out by the JIT
    • "made non entrant" - This method is now optimized out by the JIT
    • "uncommon trap" - Woah! I had already optimized this, but apparently someone just called it unexpectedly!
    • "!" - exception handling - Nutter didn't go into this, but the JIT apparently does some interesting optimization by finding the actual catch block up the call stack that is ultimately called when a given exception is thrown
    • "%" - on-stack replacement (OSR) - Sometimes the JIT will come up with an optimization for an entire method that it will then compile and replace at runtime.  I'm not sure when this is applicable, but maybe for things like chip-arch-specific implementations?
There are several options that are so secret that they need two JVM options.  This first is -XX:+UnlockDiagnosticVMOptions.  The seconds are below:
  • -XX:+PrintInlining - More detail about specific inlining that the JIT is doing.  e.g. "intrinsic" means the JIT knows something special about this method and is going to replace it with best-known native code to support it.  Examples are Object#hashCode, String#equals, Atomics, and Math operations.
  • -XX:+LogCompilation - A lot more information, but specifically information about whether a method is too big for the JIT to optimize.  Use something like http://github.com/headius/logc to make the output of this option more readable.
  • -XX:+MaxInlineSiz, -XX:+InlineSmallCode, -XX:FreqInlineSize, -XX:MaxInlineLevel, -XX:MaxRecursiveInlineLevel - Use these to tweak the default levels with the caveat that these are  set by the JVM guys after much research on your specific chip architecture.
  • -XX:+PrintOptoAssembly - Lots of detail, including the assembly to which the code is getting compiled to.  (Wow!)  Nutter demonstrated with this tool how much assembly goes into calling a single method.  The unoptimized assembly was nine instructions vs. the one instruction of just inlining the contents of the (one-line) method.  Here, Nutter also talked about two important outputs:  
    • CALL - This means that the JIT cannot find a way to optimize the method
    • LOCK - This means that the JIT is performing a lock in this part of the code.  This was an interesting one because Nutter explained that at one point, he was seeing an enormous amount of LOCK statements during object construction.  It turns out that there was a private static reference that was being accessed in all JRuby constructors, which was causing a volatile write on every construction.  Removing the line of code gave them 4x performance gains.  Wow!
There was sort of an ominous note at the end that said that any kind of agent can seriously affect the JITs ability to optimize.  Obviously this would include debuggers, but it also includes profiling products like Dynatrace or AppDynamics.  Something to look into.

Do your GC logs speak to you?, by Richard Warburton

This was a great overview of the Java memory model as well as several tips on how to evaluate how your Garbage Collection is performing.  Richard had so much information to disperse, and I'm not completely sure that I understood all of it, but here goes:

First, a few JVM parameters:
  • -Xloggc: {logfile} - verbose:gc doesn't come with timestamps, so use this parameter instead to get more useful information, and in a different file altogether
  • -XX:+PrintGCDetails - Lots more detail
  • -XX:+PrintSafePointStatistics - A safepoint is a line in the sand that the gc draws where all threads stop and wait for the gc to say it's okay to go again.  You want these at a minimum
And now, an overview of what the heap looks like.  There are basically four parts:  Eden space, S0 (survivor space), S1 (survivor space), and Tenured space.  The first three can be together called "young memory" and the last one can be called "old memory".

When a GC runs, objects age.  Eden is completely evacuated on every GC (see the metaphor?) and is promoted to one of the two survivor spaces, whichever one is inactive.  The active survivor space is completely evacuated to Tenured space, and the inactive survivor space is changed to be the active one.  All evacuated spaces are empty after each GC, and once it is in Tenured, a Full GC is required to get it out.

Because it requires a full GC to recover tenured memory, it is best to make sure that not too much is going into tenured.  A full GC will run once tenured is about 69% full, so it is good to try and keep it under that number.  Other indicators are a spiky CPU time graph, an average GC pause > 5% or a full GC-to-GC ratio of >30%.  These numbers can be garnered from the VisualVM product on java.net.

To tune the VM, then, it is typically a matter of making sure that Eden is big enough, the Eden-to-Survivor space ratio is right (to ensure the survivor spaces are big enough), and the max tenuring threshold is the right value (usually sits around 4).
  • -XX:NewRatio=x - The size of the Young memory vs. Old Memory.  Young memory will be 1/x the size of Old memory.  Richard recommended 1, though it probably has a lot to do with what you observe in the data.
  • -XX:SurvivorRatio=x - The size of the survivor spaces relative to eden space.  Each survivor space will be 1/(x+2) of the memory allocated in Young memory.  Richard recommended 1 here as well with the same caveat.
  • -XX:MaxTenuringThreshold=x - The number of collections that an object can survive before it is automatically promoted to tenured space.  You can lower this to make GCs less frequent since the JVM will be able to promote more memory to Tenured more quickly.  Raise it to keep things in Young memory for longer.  In Richard's case study, he set this to 5.
Richard's case study showed a dramatic improvement in GC pause times, GC ratios, etc.  I definitely want to try these out!

There were a couple of interesting notes at the end about GCing in general:
  • concurrent mode failure means that the Young memory is filling up two fast.  The concurrently running tenured collection (full GC) failed to complete before tenured was completely full.
  • mark/sweep doesn't do memory compaction, meaning that the data can get fragmented
  • slab allocators are a way to allocate a distinct amount of memory on the Java heap.  One strategy against memory fragmentation if your data set is very large is to create several slab allocators of varying sizes at once so they are adjacent in memory and then draw from those memory allocations.  Richard warned that these will cause you to develop many of your own GC semantics (a later talk mentioned that this is what memcached does)

Big RAM, by Neil Ferguson

This was basically a case study on alternatives to using out-of-the-box JVM garbage collection for huge applications (250GB-1TB RAM).

The first approach was to use an alternative Garbage Collector like Azul Zing.  For Ferguson's benchmark, Azul performed better than G1, the new garbage collector from Java.

The second approach was to use the Java APIs that allow you to allocate memory off-heap:  ByteBuffer.allocateDirect and SlabAllocators

The third approach was to pick from a list of off-heap vendors.  Ferguson talked about Cassandra, Terracotta, BigCache, and Apache DirectMemory.

This was really interesting, but I'll be honest that this was the last session of the day, and I wasn't paying really close attention.  I'll have to re-watch it.  :)

Summary

All in all, these were really motivating, largely because it is a new space that I know (knew?) nearly zero about.  I look forward to trying a few of these back at work when I get the chance.

Josh Cummings

"I love to teach, as a painter loves to paint, as a singer loves to sing, as a musician loves to play" - William Lyon Phelps

0 comments: