Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Micro optimising class.getName

2020 Java Performance Hotspot Jmh

A while ago, I was looking into the performance of a common pattern in Java, getting the unqualified name of the class. For some reason, it wasn’t being inlined. The method was implemented like this:

Getting the unqualified class name
public String getNameOriginal();
String name = getClass().getName();
int index = name.lastIndexOf('.');
if (index != -1) name = name.substring(index + 1);
return name;
}

This is compiled to 30 bytecodes:

Getting the unqualified class name (bytecode)
public java.lang.String getNameOriginal();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: ldc #12 // String .
11: invokevirtual #13 // Method java/lang/String.lastIndexOf:(Ljava/lang/String;)I
14: istore_2
15: iload_2
16: iconst_m1
17: if_icmpeq 28
20: aload_1
21: iload_2
22: iconst_1
23: iadd
24: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
27: astore_1
28: aload_1
29: areturn

Why does the size of the bytecode matter? Well, hotspot’s implementation has a fixed size of whether to in-line or not based on the bytecode length. Specifically, methods under 35 bytecodes are considered inlineable, and those that are larger aren’t:

MaxInlineSize
$ java -XX:+PrintFlagsFinal -version | grep MaxInlineSize
intx MaxInlineSize = 35
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

Having said that, limits are a bit like final in Java: not really. If a method is being called a lot, then it can be inlined if it is under the ‘hot’ limit:

FreqInlineSize
$ java -XX:+PrintFlagsFinal -version | grep FreqInlineSize
intx FreqInlineSize = 325
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

So bytecode length is important, and if methods fall out of that boundary then they don’t get in-lined. This can lead to sub-optimal performance for methods that are being called frequently. We can see this happen by running the JVM with -XX:+PrintInlining to see what happens:

Callee is too large
@ 2 org.example.class::getNameOriginal (30 bytes) callee is too large

The problem is that the caller of this method is too large to absorb another 30 bytes of bytecode. What can we do?

Shrinking the code

One possible improvement might be to replace the String lookup with a char instead; in other words, from a String containing "." with a character '.' instead. Once changed, the method looks like:

Looking up with a character
public String getNameOriginal();
String name = getClass().getName();
int index = name.lastIndexOf('.');
if (index != -1) name = name.substring(index + 1);
return name;
}

It doesn’t make any difference to the bytecode length of the method, although it does reduce the use of a String constant, so the net effect is that the .class file will be slightly smaller:

Using a char instead of a String
public String getNameOriginalChar();
Code:
0: aload_0
1: invokevirtual #18 // Method Object.getClass:()Ljava/lang/Class;
4: invokevirtual #22 // Method Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: bipush 46 // Only difference - 46 is '.' in ASCII
11: invokevirtual #45 // Method String.lastIndexOf:(I)I
14: istore_2
15: iload_2
16: iconst_m1
17: if_icmpeq 26
20: aload_1
21: iload_2
22: iconst_1
23: iadd
24: invokevirtual #35 // Method String.substring:(I)Ljava/lang/String;
27: astore_1
28: aload_1
29: areturn

How do these compare when executed? Is there a speed difference between them? Well, we can turn to our trusty JMH tooling to answer that question:

NameTest.getNameOriginal avgt 25 34.439 +- 1.093 ns/op
NameTest.getNameOriginalChar avgt 25 30.602 +- 1.001 ns/op

So there is a difference depending on whether the indexOf(String) or indexOf(char) methods are used. Granted, it’s not much, but it could be worthwhile if the call is on a critical path.

However, we can improve this class by noting a couple of priorities of the runtime. All of the classes used by this code path are in a package; in other words, the lastIndexOf never returns -1 in our cases. We can remove the conditional block by observing that mathematically, if lastIndexOf returns -1 then we can add 1, and substring(0) will give us the whole string anyway. In other words, these two pieces of code are semantically equivalent:

Before and After
// Before
if (i != -1) { name = name.substring(i+1); }
// After
name = name.substring(i+1);

There is a minor behavioural change; now, we’re always allocating a new string using this method for classes that are not in the default package. However, since we know that it doesn’t happen in our code path, it makes no difference. What effect does it have on the generate bytecode?

Avoiding branch, inlining lastIndex
public String getNameChar() {
String name = getClass().getName();
return name.substring(name.lastIndexOf('.') + 1);
}

This compiles to 21 bytecodes:

After transformation
public java.lang.String getNameChar();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: aload_1
10: bipush 46
12: invokevirtual #15 // Method java/lang/String.lastIndexOf:(I)I
15: iconst_1
16: iadd
17: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
20: areturn

The speedup can be measured using JMH, both with a char and with a String:

NameTest.getNameChar avgt 25 33.284 +- 0.334 ns/op
NameTest.getNameString avgt 25 34.935 +- 0.464 ns/op

These aren’t faster (or slower) than their original counterparts, but the bytecode permits more in-lining opportunities from their callers. Running this change again in-line results in the method being inlined:

@ 2 org.example.class::getNameChar (21 bytes) inline (hot)

Nice!

Going faster

Of course, one other way of optimising the code is to cache the values. This can be done fairly trivially with a simple Map of Class to String values, and in this case, the cost of calculating a missing key can be amortized to zero:

Using a Map to cache names:
public String getNameMapCache() {
return ClassMap.get(getClass());
}
class ClassMap {
private static final Map<Class<?>, String> cache
= new HashMap<Class<?>, String>();
public static String get(Class<?> clazz) {
return cache.computeIfAbsent(clazz, ClassMap::calculateShortName);
}
private static String calculateShortName(Class<?> clazz) {
String name = clazz.getName();
return name.substring(name.lastIndexOf('.') + 1);
}
}

When tested under JMH, this has an improved performance profile, and avoids creating garbage:

Testing name map cache
NameTest.getNameMapCache avgt 25 11.807 +- 0.059 ns/op
NameTest.getNameMapCache: gc.alloc.rate avgt 25 = 10^-4 MB/sec

However, there’s a more optimised version (that doesn’t require lambdas) which can be used to cache a value for a class. The java.lang.ClassValue was added in 1.7 to support dynamic languages but can be used for this purpose for our benefit:

Using ClassValue to cache names:
public String getNameClassValueCache() {
return ClassName.DEFAULT.get(getClass());
}
class ClassName extends ClassValue<String> {
public static final ClassValue<String> DEFAULT = new ClassName();
protected String computeValue(Class<?> clazz) {
String name = clazz.getName();
return name.substring(name.lastIndexOf('.') + 1);
}
}

What happens if we run this under JMH? It looks like this:

Testing class value cache
NameTest.getNameClassValueCache avgt 25 6.535 +- 0.045 ns/op
NameTest.getNameClassValueCache: gc.alloc.rate avgt 25 = 10e-4 MB/sec

One of the reasons for this improvement is that the Map is a generic structure that needs to scale to millions of values, but the ClassValue is optimised for the smaller cases. The cache lookup code is a little bit longer, but since we don’t need to allocate and copy the class names we save some execution speed there as well. Here’s what the 14 bytecodes look like:

Decompiled lookup of cache
public java.lang.String getNameClassValueCache();
Code:
0: getstatic #16 // Field com/bandlem/jmh/microopts/ClassName.DEFAULT:Ljava/lang/ClassValue;
3: aload_0
4: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
7: invokevirtual #17 // Method java/lang/ClassValue.get:(Ljava/lang/Class;)Ljava/lang/Object;
10: checkcast #2 // class java/lang/String
13: areturn

The code gets inlined as expected:

inlining works
@ 2 org.example.class::getNameClassValueCache (17 bytes) inline (hot)
@ 4 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 7 java.lang.ClassValue::get (31 bytes) inline (hot)
@ 1 java.lang.ClassValue::getCacheCarefully (20 bytes) inline (hot)
@ 14 java.lang.ClassValue$ClassValueMap::getCache (5 bytes) accessor
@ 7 java.lang.ClassValue$ClassValueMap::probeHomeLocation (13 bytes) inline (hot)
@ 6 java.lang.ClassValue$ClassValueMap::loadFromCache (9 bytes) inline (hot)
@ 9 java.lang.ClassValue::castEntry (2 bytes) inline (hot)
@ 13 java.lang.ClassValue::match (21 bytes) inline (hot)
@ 5 java.lang.ref.Reference::get (5 bytes) (intrinsic)
@ 20 java.lang.ClassValue$Entry::value (9 bytes) inline (hot)
@ 1 java.lang.ClassValue$Entry::assertNotPromise (22 bytes) inline (hot)
@ 27 java.lang.ClassValue::getFromBackup (21 bytes) executed < MinInliningThreshold times

Reproducing the benchmarks

The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of memory. The full list of results and source code to reproduce is at https://github.com/alblue/com.bandlem.jmh.microopts

Summary

Being able to measure the different implementation choices, and focussing not only on the speed of the bytecode but also the amount of bytecode, allows the method to be in-lined in places where it wasn’t before. Even though they are small changes, they can tip the balance between not being in-lined and being in-lined. Other performance optimisations are possible as well but need to be tested in-situ with the caller code; in other words, test the changes not only against the implementation of the method, but also where it is used.