A while ago, I was looking into the performance of a common pattern in Java,
getting the unqualified name of the class. For some reason, it wasn’t being
inlined. The method was implemented like this:
This is compiled to 30 bytecodes:
Why does the size of the bytecode matter? Well, hotspot’s implementation has
a fixed size of whether to in-line or not based on the bytecode length.
Specifically, methods under 35 bytecodes are considered inlineable, and
those that are larger aren’t:
Having said that, limits are a bit like final in Java: not really. If a method
is being called a lot, then it can be inlined if it is under the ‘hot’ limit:
So bytecode length is important, and if methods fall out of that boundary
then they don’t get in-lined. This can lead to sub-optimal performance for
methods that are being called frequently. We can see this
happen by running the JVM with -XX:+PrintInlining to see what happens:
The problem is that the caller of this method is too large to absorb another
30 bytes of bytecode. What can we do?
Shrinking the code
One possible improvement might be to replace the String lookup with
a char instead; in other words, from a String containing "."
with a character '.' instead. Once changed, the method looks like:
It doesn’t make any difference to the bytecode length of the method,
although it does reduce the use of a String constant, so the net effect is
that the .class file will be slightly smaller:
How do these compare when executed? Is there a speed difference between
them? Well, we can turn to our trusty JMH tooling to answer that question:
So there is a difference depending on whether the indexOf(String) or
indexOf(char) methods are used. Granted, it’s not much, but it could
be worthwhile if the call is on a critical path.
However, we can improve this class by noting a couple of priorities of the
runtime. All of the classes used by this code path are in a package; in other
words, the lastIndexOf never returns -1 in our cases. We can remove the
conditional block by observing that mathematically, if lastIndexOf returns -1
then we can add 1, and substring(0) will give us the whole string anyway.
In other words, these two pieces of code are semantically equivalent:
There is a minor behavioural change; now, we’re always allocating a new string
using this method for classes that are not in the default package. However,
since we know that it doesn’t happen in our code path, it makes no difference.
What effect does it have on the generate bytecode?
This compiles to 21 bytecodes:
The speedup can be measured using JMH, both with a char and with a String:
These aren’t faster (or slower) than their original counterparts, but the
bytecode permits more in-lining opportunities from their callers. Running
this change again in-line results in the method being inlined:
Of course, one other way of optimising the code is to cache the values.
This can be done fairly trivially with a simple Map of Class to String
values, and in this case, the cost of calculating a missing key can be
amortized to zero:
When tested under JMH, this has an improved performance profile, and avoids
However, there’s a more optimised version (that doesn’t require lambdas)
which can be used to cache a value for a class. The java.lang.ClassValue
was added in 1.7 to support dynamic languages but can be used for this
purpose for our benefit:
What happens if we run this under JMH? It looks like this:
One of the reasons for this improvement is that the Map is a generic
structure that needs to scale to millions of values, but the ClassValue
is optimised for the smaller cases. The cache lookup code is a little bit
longer, but since we don’t need to allocate and copy the class names
we save some execution speed there as well. Here’s what the 14 bytecodes
The code gets inlined as expected:
Reproducing the benchmarks
The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of
memory. The full list of results and source code to reproduce is at
Being able to measure the different implementation choices, and focussing not
only on the speed of the bytecode but also the amount of bytecode, allows the
method to be in-lined in places where it wasn’t before. Even though they are
small changes, they can tip the balance between not being in-lined and being
in-lined. Other performance optimisations are possible as well but need to
be tested in-situ with the caller code; in other words, test the changes
not only against the implementation of the method, but also where it is used.