A while ago, I was looking into the performance of a common pattern in Java, getting the unqualified name of the class. For some reason, it wasn’t being inlined. The method was implemented like this:
```java Getting the unqualified class name public String getNameOriginal(); String name = getClass().getName(); int index = name.lastIndexOf(‘.’); if (index != -1) name = name.substring(index + 1); return name; }
This is compiled to 30 bytecodes:
```java Getting the unqualified class name (bytecode)
public java.lang.String getNameOriginal();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: ldc #12 // String .
11: invokevirtual #13 // Method java/lang/String.lastIndexOf:(Ljava/lang/String;)I
14: istore_2
15: iload_2
16: iconst_m1
17: if_icmpeq 28
20: aload_1
21: iload_2
22: iconst_1
23: iadd
24: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
27: astore_1
28: aload_1
29: areturn
Why does the size of the bytecode matter? Well, hotspot’s implementation has a fixed size of whether to in-line or not based on the bytecode length. Specifically, methods under 35 bytecodes are considered inlineable, and those that are larger aren’t:
```sh MaxInlineSize $ java -XX:+PrintFlagsFinal -version | grep MaxInlineSize intx MaxInlineSize = 35 openjdk version “11.0.6” 2020-01-14 LTS OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS) OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)
Having said that, limits are a bit like final in Java: not really. If a method
is being called a lot, then it can be inlined if it is under the 'hot' limit:
```sh FreqInlineSize
$ java -XX:+PrintFlagsFinal -version | grep FreqInlineSize
intx FreqInlineSize = 325
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)
So bytecode length is important, and if methods fall out of that boundary
then they don’t get in-lined. This can lead to sub-optimal performance for
methods that are being called frequently. We can see this
happen by running the JVM with -XX:+PrintInlining
to see what happens:
```text Callee is too large @ 2 org.example.class::getNameOriginal (30 bytes) callee is too large
The problem is that the caller of this method is too large to absorb another
30 bytes of bytecode. What can we do?
Shrinking the code
------------------
One possible improvement might be to replace the `String` lookup with
a `char` instead; in other words, from a `String` containing `"."`
with a character `'.'` instead. Once changed, the method looks like:
```java Looking up with a character
public String getNameOriginal();
String name = getClass().getName();
int index = name.lastIndexOf('.');
if (index != -1) name = name.substring(index + 1);
return name;
}
It doesn’t make any difference to the bytecode length of the method,
although it does reduce the use of a String
constant, so the net effect is
that the .class
file will be slightly smaller:
```java Using a char instead of a String public String getNameOriginalChar(); Code: 0: aload_0 1: invokevirtual #18 // Method Object.getClass:()Ljava/lang/Class; 4: invokevirtual #22 // Method Class.getName:()Ljava/lang/String; 7: astore_1 8: aload_1 9: bipush 46 // Only difference - 46 is ‘.’ in ASCII 11: invokevirtual #45 // Method String.lastIndexOf:(I)I 14: istore_2 15: iload_2 16: iconst_m1 17: if_icmpeq 26 20: aload_1 21: iload_2 22: iconst_1 23: iadd 24: invokevirtual #35 // Method String.substring:(I)Ljava/lang/String; 27: astore_1 28: aload_1 29: areturn
How do these compare when executed? Is there a speed difference between
them? Well, we can turn to our trusty JMH tooling to answer that question:
NameTest.getNameOriginal avgt 25 34.439 +- 1.093 ns/op NameTest.getNameOriginalChar avgt 25 30.602 +- 1.001 ns/op
So there is a difference depending on whether the `indexOf(String)` or
`indexOf(char)` methods are used. Granted, it's not much, but it could
be worthwhile if the call is on a critical path.
However, we can improve this class by noting a couple of priorities of the
runtime. All of the classes used by this code path are in a package; in other
words, the `lastIndexOf` never returns `-1` in our cases. We can remove the
conditional block by observing that mathematically, if `lastIndexOf` returns `-1`
then we can add `1`, and `substring(0)` will give us the whole string anyway.
In other words, these two pieces of code are semantically equivalent:
```java Before and After
// Before
if (i != -1) { name = name.substring(i+1); }
// After
name = name.substring(i+1);
There is a minor behavioural change; now, we’re always allocating a new string using this method for classes that are not in the default package. However, since we know that it doesn’t happen in our code path, it makes no difference. What effect does it have on the generate bytecode?
```java Avoiding branch, inlining lastIndex public String getNameChar() { String name = getClass().getName(); return name.substring(name.lastIndexOf(‘.’) + 1); }
This compiles to 21 bytecodes:
```java After transformation
public java.lang.String getNameChar();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: aload_1
10: bipush 46
12: invokevirtual #15 // Method java/lang/String.lastIndexOf:(I)I
15: iconst_1
16: iadd
17: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
20: areturn
The speedup can be measured using JMH, both with a char and with a String:
NameTest.getNameChar avgt 25 33.284 +- 0.334 ns/op
NameTest.getNameString avgt 25 34.935 +- 0.464 ns/op
These aren’t faster (or slower) than their original counterparts, but the bytecode permits more in-lining opportunities from their callers. Running this change again in-line results in the method being inlined:
@ 2 org.example.class::getNameChar (21 bytes) inline (hot)
Nice!
Going faster
Of course, one other way of optimising the code is to cache the values.
This can be done fairly trivially with a simple Map
of Class
to String
values, and in this case, the cost of calculating a missing key can be
amortized to zero:
```java Using a Map to cache names: public String getNameMapCache() { return ClassMap.get(getClass()); }
class ClassMap { private static final Map<Class<?>, String> cache = new HashMap<Class<?>, String>();
public static String get(Class<?> clazz) { return cache.computeIfAbsent(clazz, ClassMap::calculateShortName); }
private static String calculateShortName(Class<?> clazz) { String name = clazz.getName(); return name.substring(name.lastIndexOf(‘.’) + 1); } }
When tested under JMH, this has an improved performance profile, and avoids
creating garbage:
```text Testing name map cache
NameTest.getNameMapCache avgt 25 11.807 +- 0.059 ns/op
NameTest.getNameMapCache: gc.alloc.rate avgt 25 = 10^-4 MB/sec
However, there’s a more optimised version (that doesn’t require lambdas)
which can be used to cache a value for a class. The java.lang.ClassValue
was added in 1.7 to support dynamic languages but can be used for this
purpose for our benefit:
```java Using ClassValue to cache names: public String getNameClassValueCache() { return ClassName.DEFAULT.get(getClass()); }
class ClassName extends ClassValue
What happens if we run this under JMH? It looks like this:
```text Testing class value cache
NameTest.getNameClassValueCache avgt 25 6.535 +- 0.045 ns/op
NameTest.getNameClassValueCache: gc.alloc.rate avgt 25 = 10e-4 MB/sec
One of the reasons for this improvement is that the Map
is a generic
structure that needs to scale to millions of values, but the ClassValue
is optimised for the smaller cases. The cache lookup code is a little bit
longer, but since we don’t need to allocate and copy the class names
we save some execution speed there as well. Here’s what the 14 bytecodes
look like:
```java Decompiled lookup of cache public java.lang.String getNameClassValueCache(); Code: 0: getstatic #16 // Field com/bandlem/jmh/microopts/ClassName.DEFAULT:Ljava/lang/ClassValue; 3: aload_0 4: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class; 7: invokevirtual #17 // Method java/lang/ClassValue.get:(Ljava/lang/Class;)Ljava/lang/Object; 10: checkcast #2 // class java/lang/String 13: areturn
The code gets inlined as expected:
```java inlining works
@ 2 org.example.class::getNameClassValueCache (17 bytes) inline (hot)
@ 4 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 7 java.lang.ClassValue::get (31 bytes) inline (hot)
@ 1 java.lang.ClassValue::getCacheCarefully (20 bytes) inline (hot)
@ 14 java.lang.ClassValue$ClassValueMap::getCache (5 bytes) accessor
@ 7 java.lang.ClassValue$ClassValueMap::probeHomeLocation (13 bytes) inline (hot)
@ 6 java.lang.ClassValue$ClassValueMap::loadFromCache (9 bytes) inline (hot)
@ 9 java.lang.ClassValue::castEntry (2 bytes) inline (hot)
@ 13 java.lang.ClassValue::match (21 bytes) inline (hot)
@ 5 java.lang.ref.Reference::get (5 bytes) (intrinsic)
@ 20 java.lang.ClassValue$Entry::value (9 bytes) inline (hot)
@ 1 java.lang.ClassValue$Entry::assertNotPromise (22 bytes) inline (hot)
@ 27 java.lang.ClassValue::getFromBackup (21 bytes) executed < MinInliningThreshold times
Reproducing the benchmarks
The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of memory. The full list of results and source code to reproduce is at https://github.com/alblue/com.bandlem.jmh.microopts
Summary
Being able to measure the different implementation choices, and focussing not only on the speed of the bytecode but also the amount of bytecode, allows the method to be in-lined in places where it wasn’t before. Even though they are small changes, they can tip the balance between not being in-lined and being in-lined. Other performance optimisations are possible as well but need to be tested in-situ with the caller code; in other words, test the changes not only against the implementation of the method, but also where it is used.