Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Micro optimising class.getName

2020 Java Performance Hotspot Jmh

A while ago, I was looking into the performance of a common pattern in Java, getting the unqualified name of the class. For some reason, it wasn’t being inlined. The method was implemented like this:

```java Getting the unqualified class name public String getNameOriginal(); String name = getClass().getName(); int index = name.lastIndexOf(‘.’); if (index != -1) name = name.substring(index + 1); return name; }


This is compiled to 30 bytecodes:

```java Getting the unqualified class name (bytecode)
public java.lang.String getNameOriginal();
  Code:
     0: aload_0
     1: invokevirtual #10    // Method java/lang/Object.getClass:()Ljava/lang/Class;
     4: invokevirtual #11    // Method java/lang/Class.getName:()Ljava/lang/String;
     7: astore_1
     8: aload_1
     9: ldc           #12    // String .
    11: invokevirtual #13    // Method java/lang/String.lastIndexOf:(Ljava/lang/String;)I
    14: istore_2
    15: iload_2
    16: iconst_m1
    17: if_icmpeq     28
    20: aload_1
    21: iload_2
    22: iconst_1
    23: iadd
    24: invokevirtual #14    // Method java/lang/String.substring:(I)Ljava/lang/String;
    27: astore_1
    28: aload_1
    29: areturn

Why does the size of the bytecode matter? Well, hotspot’s implementation has a fixed size of whether to in-line or not based on the bytecode length. Specifically, methods under 35 bytecodes are considered inlineable, and those that are larger aren’t:

```sh MaxInlineSize $ java -XX:+PrintFlagsFinal -version | grep MaxInlineSize intx MaxInlineSize = 35 openjdk version “11.0.6” 2020-01-14 LTS OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS) OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)


Having said that, limits are a bit like final in Java: not really. If a method
is being called a lot, then it can be inlined if it is under the 'hot' limit:

```sh FreqInlineSize
$ java -XX:+PrintFlagsFinal -version | grep FreqInlineSize
  intx FreqInlineSize = 325
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

So bytecode length is important, and if methods fall out of that boundary then they don’t get in-lined. This can lead to sub-optimal performance for methods that are being called frequently. We can see this happen by running the JVM with -XX:+PrintInlining to see what happens:

```text Callee is too large @ 2 org.example.class::getNameOriginal (30 bytes) callee is too large


The problem is that the caller of this method is too large to absorb another
30 bytes of bytecode. What can we do?

Shrinking the code
------------------

One possible improvement might be to replace the `String` lookup with
a `char` instead; in other words, from a `String` containing `"."`
with a character `'.'` instead. Once changed, the method looks like:

```java Looking up with a character
public String getNameOriginal();
  String name = getClass().getName();
  int index = name.lastIndexOf('.');
  if (index != -1) name = name.substring(index + 1);
  return name;
}

It doesn’t make any difference to the bytecode length of the method, although it does reduce the use of a String constant, so the net effect is that the .class file will be slightly smaller:

```java Using a char instead of a String public String getNameOriginalChar(); Code: 0: aload_0 1: invokevirtual #18 // Method Object.getClass:()Ljava/lang/Class; 4: invokevirtual #22 // Method Class.getName:()Ljava/lang/String; 7: astore_1 8: aload_1 9: bipush 46 // Only difference - 46 is ‘.’ in ASCII 11: invokevirtual #45 // Method String.lastIndexOf:(I)I 14: istore_2 15: iload_2 16: iconst_m1 17: if_icmpeq 26 20: aload_1 21: iload_2 22: iconst_1 23: iadd 24: invokevirtual #35 // Method String.substring:(I)Ljava/lang/String; 27: astore_1 28: aload_1 29: areturn


How do these compare when executed? Is there a speed difference between
them? Well, we can turn to our trusty JMH tooling to answer that question:

NameTest.getNameOriginal avgt 25 34.439 +- 1.093 ns/op NameTest.getNameOriginalChar avgt 25 30.602 +- 1.001 ns/op


So there is a difference depending on whether the `indexOf(String)` or
`indexOf(char)` methods are used. Granted, it's not much, but it could
be worthwhile if the call is on a critical path.

However, we can improve this class by noting a couple of priorities of the
runtime. All of the classes used by this code path are in a package; in other
words, the `lastIndexOf` never returns `-1` in our cases. We can remove the
conditional block by observing that mathematically, if `lastIndexOf` returns `-1`
then we can add `1`, and `substring(0)` will give us the whole string anyway.
In other words, these two pieces of code are semantically equivalent:

```java Before and After
// Before
if (i != -1) { name = name.substring(i+1); }
// After
name = name.substring(i+1);

There is a minor behavioural change; now, we’re always allocating a new string using this method for classes that are not in the default package. However, since we know that it doesn’t happen in our code path, it makes no difference. What effect does it have on the generate bytecode?

```java Avoiding branch, inlining lastIndex public String getNameChar() { String name = getClass().getName(); return name.substring(name.lastIndexOf(‘.’) + 1); }


This compiles to 21 bytecodes:

```java After transformation
public java.lang.String getNameChar();
  Code:
     0: aload_0
     1: invokevirtual #10    // Method java/lang/Object.getClass:()Ljava/lang/Class;
     4: invokevirtual #11    // Method java/lang/Class.getName:()Ljava/lang/String;
     7: astore_1
     8: aload_1
     9: aload_1
    10: bipush        46
    12: invokevirtual #15    // Method java/lang/String.lastIndexOf:(I)I
    15: iconst_1
    16: iadd
    17: invokevirtual #14    // Method java/lang/String.substring:(I)Ljava/lang/String;
    20: areturn

The speedup can be measured using JMH, both with a char and with a String:

NameTest.getNameChar                       avgt   25    33.284 +-  0.334   ns/op
NameTest.getNameString                     avgt   25    34.935 +-  0.464   ns/op

These aren’t faster (or slower) than their original counterparts, but the bytecode permits more in-lining opportunities from their callers. Running this change again in-line results in the method being inlined:

 @ 2 org.example.class::getNameChar (21 bytes) inline (hot)

Nice!

Going faster

Of course, one other way of optimising the code is to cache the values. This can be done fairly trivially with a simple Map of Class to String values, and in this case, the cost of calculating a missing key can be amortized to zero:

```java Using a Map to cache names: public String getNameMapCache() { return ClassMap.get(getClass()); }

class ClassMap { private static final Map<Class<?>, String> cache = new HashMap<Class<?>, String>();

public static String get(Class<?> clazz) { return cache.computeIfAbsent(clazz, ClassMap::calculateShortName); }

private static String calculateShortName(Class<?> clazz) { String name = clazz.getName(); return name.substring(name.lastIndexOf(‘.’) + 1); } }


When tested under JMH, this has an improved performance profile, and avoids
creating garbage:

```text Testing name map cache
NameTest.getNameMapCache                   avgt   25    11.807 +-  0.059   ns/op
NameTest.getNameMapCache: gc.alloc.rate    avgt   25     = 10^-4           MB/sec

However, there’s a more optimised version (that doesn’t require lambdas) which can be used to cache a value for a class. The java.lang.ClassValue was added in 1.7 to support dynamic languages but can be used for this purpose for our benefit:

```java Using ClassValue to cache names: public String getNameClassValueCache() { return ClassName.DEFAULT.get(getClass()); }

class ClassName extends ClassValue { public static final ClassValue DEFAULT = new ClassName(); protected String computeValue(Class<?> clazz) { String name = clazz.getName(); return name.substring(name.lastIndexOf('.') + 1); } }


What happens if we run this under JMH? It looks like this:

```text Testing class value cache
NameTest.getNameClassValueCache            avgt   25     6.535 +-  0.045   ns/op
NameTest.getNameClassValueCache: gc.alloc.rate avgt 25    = 10e-4          MB/sec

One of the reasons for this improvement is that the Map is a generic structure that needs to scale to millions of values, but the ClassValue is optimised for the smaller cases. The cache lookup code is a little bit longer, but since we don’t need to allocate and copy the class names we save some execution speed there as well. Here’s what the 14 bytecodes look like:

```java Decompiled lookup of cache public java.lang.String getNameClassValueCache(); Code: 0: getstatic #16 // Field com/bandlem/jmh/microopts/ClassName.DEFAULT:Ljava/lang/ClassValue; 3: aload_0 4: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class; 7: invokevirtual #17 // Method java/lang/ClassValue.get:(Ljava/lang/Class;)Ljava/lang/Object; 10: checkcast #2 // class java/lang/String 13: areturn


The code gets inlined as expected:

```java inlining works
  @ 2   org.example.class::getNameClassValueCache (17 bytes)   inline (hot)
    @ 4   java.lang.Object::getClass (0 bytes)   (intrinsic)
    @ 7   java.lang.ClassValue::get (31 bytes)   inline (hot)
      @ 1   java.lang.ClassValue::getCacheCarefully (20 bytes)   inline (hot)
	@ 14   java.lang.ClassValue$ClassValueMap::getCache (5 bytes)   accessor
      @ 7   java.lang.ClassValue$ClassValueMap::probeHomeLocation (13 bytes)   inline (hot)
	@ 6   java.lang.ClassValue$ClassValueMap::loadFromCache (9 bytes)   inline (hot)
	@ 9   java.lang.ClassValue::castEntry (2 bytes)   inline (hot)
      @ 13   java.lang.ClassValue::match (21 bytes)   inline (hot)
	@ 5   java.lang.ref.Reference::get (5 bytes)   (intrinsic)
      @ 20   java.lang.ClassValue$Entry::value (9 bytes)   inline (hot)
	@ 1   java.lang.ClassValue$Entry::assertNotPromise (22 bytes)   inline (hot)
      @ 27   java.lang.ClassValue::getFromBackup (21 bytes)   executed < MinInliningThreshold times

Reproducing the benchmarks

The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of memory. The full list of results and source code to reproduce is at https://github.com/alblue/com.bandlem.jmh.microopts

Summary

Being able to measure the different implementation choices, and focussing not only on the speed of the bytecode but also the amount of bytecode, allows the method to be in-lined in places where it wasn’t before. Even though they are small changes, they can tip the balance between not being in-lined and being in-lined. Other performance optimisations are possible as well but need to be tested in-situ with the caller code; in other words, test the changes not only against the implementation of the method, but also where it is used.