Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Measuring bundle startup time with JFR

2020, eclipse, java, osgi, performance

Several years ago, I gave a presentation at EclipseCon 2016 on “Optimising Eclipse Plug-ins” in which I talked about a few different ways of making Eclipse faster, at least from a start-up and memory usage perspective.

The presentation wasn’t recorded, but is available on SpeakerDeck as well as a narrated copy I recorded subsequently at Vimeo.

One of the techniques I discussed was using Java Flight Recorder as a means to record how long bundles take to start up, since the main cost of starting up Eclipse (or other OSGi applications) is how long it takes for the bundles to move into the STARTED state. Bundles that have a Bundle-Activator can slow this down, and since the majority of OSGi frameworks have a means to start up bundles sequentially, can lead to a lesser user experience.

At the original time of writing, JFR was still a commercial feature (requiring -XX:+UnlockCommercialFeatures setting and the wrath of licensing) in Java 8, and the API was evolving in Java 9 and onwards. However, now that Java 11 has been released, we have a free (both as in beer and in speech) implementation that we can code against in order to work as expected.

A number of API changes meant that the code needed some TLC, and so I recently uploaded a new version of the JFRBundleListener project to GitHub. The project can be downloaded and installed into an OSGi runtime (or Eclipse application) to find out where the start-up costs lie. At the present time, I have not published it anywhere for binary consumption.

The README.md contains more information about how to use it, but the TL;DR is:

  • Build and install the bundle with a low start level (e.g. 1) so that it tracks all subsequent bundle activation
  • Start your OSGi framework or Eclipse application with the magic invocation to start JFR recording, e.g. -XX:StartFlightRecording=filename=/tmp/dump.jfr,dumponexit=true,disk=true
  • Load and inspect the recording into Java Mission Control, or use the embedded java -jar JFRBundleListener.jar < /tmp/dump.jfr > /tmp/stacks.txt to generate a stacks list that can be processed by FlameGraph.pl
  • Use Brendan Gregg’s flamegraph tool to create an interactive SVG flamegraph.pl --hash --countname ms < /tmp/stacks.txt > /tmp/bundles.svg

The resulting SVG looks something like this:

Bundle startup

Custom JFR events

The API for interacting with JFR has improved significantly since its first inception in Java 8. I am looking forward to the backport of Java 11 to Java 8 to improve the experience significantly there.

To create a custom event, subclass the jdk.jfr.Event class. Provided that it has public fields of a known Java primitive or String, it will be written out when commanded. The JFRBundleEvent class is used to surface an OSGi BundleEvent, and so persists pertinent data to the JFRBundleEvent class upon creation.

These are created when receiving (synchronous) events with the SynchronousBundleListener implemented in the JFRBundleListener class. To mark a span, the event has begin() called at the start of a bundle coming on-line, and end() once it has finished. The commit() ensures that this bundle event is recorded; provided that the commit() occurs at some point later, and that the object is not reused in the meantime, it could be done at a later time.

The event information is recorded with specific key fields; they are identified using the field name in the class, or the one specified with the @Name annotation as defined in the event itself. For simplicity, I have chosen the OSGi bundle headers as the identifiers in the event, which may make post-processing it easier.

Processing JFR files

Similarly, the API for processing a JFR file has become significantly better. There is a helper method RecordingFile.readAllEvents() which translates a dump file at a given path into a list of RecordedEvent objects, or it can be used as an iterator to read through all events.

Events are identified by type; if not specified then the fully qualified name of the event class is used. This makes it easy to filter out specific events from the log, although it may be desirable that they are combined with other JFR events (e.g. thread sleeping, class loading, garbage collections etc.) in the log.

The JFRFlameGraph implementation provides a means to go through the log file, and generate a stack of bundle startup to allow for investigation as to why something is taken so long.

It prunes the list of events in the JFR recording to only include those JFRBundleEvent of interest, and then builds up a sorted data structure based on the earliest start-up time recorded. These are then interactively popped off to write out events before the end time, so that the resulting file contains a list of the bundles starting up:

1
2
3
4
5
6
7
8
9
...
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.equinox.preferences 28
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.core.contenttype;org.eclipse.equinox.registry 877
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.core.contenttype 6
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.equinox.app 28
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime 14
Start-Level...;org.eclipse.ui.trace 10
Start-Level...;org.eclipse.update.configurator 977
...

These can then be post-processed with FlameGraph.pl to generate the interactive SVG above.

The code prepends the Java thread name to the stack shown; this allows some threads (such as background Worker-n threads) to be ignored, while focussing on specific threads (such as the main thread and the Start-Level... thread). Since the SVG is interactive, clicking on the thread name will focus the view on just those bundles, which may provide more detailed analysis.

Summary

Java Flight Recorder is available as open source in Java 11, and can be used to export timing information using a custom event format. With sufficient post-processing of the data, it can generate flame graphs that can be more expressive than just looking at numbers.

The API for programmatically interacting with JFR recordings has been significantly improved, and being able to log timely information in conjunction with other events can be a significant bonus in performance analysis.

Micro optimising class.getName

2020, hotspot, java, jmh, performance

A while ago, I was looking into the performance of a common pattern in Java, getting the unqualified name of the class. For some reason, it wasn’t being inlined. The method was implemented like this:

Getting the unqualified class name
1
2
3
4
5
6
public String getNameOriginal();
  String name = getClass().getName();
  int index = name.lastIndexOf('.');
  if (index != -1) name = name.substring(index + 1);
  return name;
}

This is compiled to 30 bytecodes:

Getting the unqualified class name (bytecode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public java.lang.String getNameOriginal();
  Code:
     0: aload_0
     1: invokevirtual #10    // Method java/lang/Object.getClass:()Ljava/lang/Class;
     4: invokevirtual #11    // Method java/lang/Class.getName:()Ljava/lang/String;
     7: astore_1
     8: aload_1
     9: ldc           #12    // String .
    11: invokevirtual #13    // Method java/lang/String.lastIndexOf:(Ljava/lang/String;)I
    14: istore_2
    15: iload_2
    16: iconst_m1
    17: if_icmpeq     28
    20: aload_1
    21: iload_2
    22: iconst_1
    23: iadd
    24: invokevirtual #14    // Method java/lang/String.substring:(I)Ljava/lang/String;
    27: astore_1
    28: aload_1
    29: areturn

Why does the size of the bytecode matter? Well, hotspot’s implementation has a fixed size of whether to in-line or not based on the bytecode length. Specifically, methods under 35 bytecodes are considered inlineable, and those that are larger aren’t:

MaxInlineSize
1
2
3
4
5
$ java -XX:+PrintFlagsFinal -version | grep MaxInlineSize
  intx MaxInlineSize = 35
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

Having said that, limits are a bit like final in Java: not really. If a method is being called a lot, then it can be inlined if it is under the ‘hot’ limit:

FreqInlineSize
1
2
3
4
5
$ java -XX:+PrintFlagsFinal -version | grep FreqInlineSize
  intx FreqInlineSize = 325
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

So bytecode length is important, and if methods fall out of that boundary then they don’t get in-lined. This can lead to sub-optimal performance for methods that are being called frequently. We can see this happen by running the JVM with -XX:+PrintInlining to see what happens:

Callee is too large
1
 @ 2 org.example.class::getNameOriginal (30 bytes) callee is too large

The problem is that the caller of this method is too large to absorb another 30 bytes of bytecode. What can we do?

Shrinking the code

One possible improvement might be to replace the String lookup with a char instead; in other words, from a String containing "." with a character '.' instead. Once changed, the method looks like:

Looking up with a character
1
2
3
4
5
6
public String getNameOriginal();
  String name = getClass().getName();
  int index = name.lastIndexOf('.');
  if (index != -1) name = name.substring(index + 1);
  return name;
}

It doesn’t make any difference to the bytecode length of the method, although it does reduce the use of a String constant, so the net effect is that the .class file will be slightly smaller:

Using a char instead of a String
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public String getNameOriginalChar();
  Code:
    0: aload_0
    1: invokevirtual #18     // Method Object.getClass:()Ljava/lang/Class;
    4: invokevirtual #22     // Method Class.getName:()Ljava/lang/String;
    7: astore_1
    8: aload_1
    9: bipush        46      // Only difference - 46 is '.' in ASCII
   11: invokevirtual #45     // Method String.lastIndexOf:(I)I
   14: istore_2
   15: iload_2
   16: iconst_m1
   17: if_icmpeq     26
   20: aload_1
   21: iload_2
   22: iconst_1
   23: iadd
   24: invokevirtual #35     // Method String.substring:(I)Ljava/lang/String;
   27: astore_1
   28: aload_1
   29: areturn

How do these compare when executed? Is there a speed difference between them? Well, we can turn to our trusty JMH tooling to answer that question:

Using a char instead of a String
1
2
NameTest.getNameOriginal                   avgt   25    34.439 ±  1.093   ns/op
NameTest.getNameOriginalChar               avgt   25    30.602 ±  1.001   ns/op

So there is a difference depending on whether the indexOf(String) or indexOf(char) methods are used. Granted, it’s not much, but it could be worthwhile if the call is on a critical path.

However, we can improve this class by noting a couple of priorities of the runtime. All of the classes used by this code path are in a package; in other words, the lastIndexOf never returns -1 in our cases. We can remove the conditional block by observing that mathematically, if lastIndexOf returns -1 then we can add 1, and substring(0) will give us the whole string anyway. In other words, these two pieces of code are semantically equivalent:

Before and After
1
2
3
4
// Before
if (i != -1) { name = name.substring(i+1); }
// After
name = name.substring(i+1);

There is a minor behavioural change; now, we’re always allocating a new string using this method for classes that are not in the default package. However, since we know that it doesn’t happen in our code path, it makes no difference. What effect does it have on the generate bytecode?

Avoiding branch, inlining lastIndex
1
2
3
4
public String getNameChar() {
  String name = getClass().getName();
  return name.substring(name.lastIndexOf('.') + 1);
}

This compiles to 21 bytecodes:

After transformation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
public java.lang.String getNameChar();
  Code:
     0: aload_0
     1: invokevirtual #10    // Method java/lang/Object.getClass:()Ljava/lang/Class;
     4: invokevirtual #11    // Method java/lang/Class.getName:()Ljava/lang/String;
     7: astore_1
     8: aload_1
     9: aload_1
    10: bipush        46
    12: invokevirtual #15    // Method java/lang/String.lastIndexOf:(I)I
    15: iconst_1
    16: iadd
    17: invokevirtual #14    // Method java/lang/String.substring:(I)Ljava/lang/String;
    20: areturn

The speedup can be measured using JMH, both with a char and with a String:

1
2
NameTest.getNameChar                       avgt   25    33.284 ±  0.334   ns/op
NameTest.getNameString                     avgt   25    34.935 ±  0.464   ns/op

These aren’t faster (or slower) than their original counterparts, but the bytecode permits more in-lining opportunities from their callers. Running this change again in-line results in the method being inlined:

1
 @ 2 org.example.class::getNameChar (21 bytes) inline (hot)

Nice!

Going faster

Of course, one other way of optimising the code is to cache the values. This can be done fairly trivially with a simple Map of Class to String values, and in this case, the cost of calculating a missing key can be amortized to zero:

Using a Map to cache names:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public String getNameMapCache() {
  return ClassMap.get(getClass());
}

class ClassMap {
  private static final Map<Class<?>, String> cache
    = new HashMap<Class<?>, String>();

  public static String get(Class<?> clazz) {
    return cache.computeIfAbsent(clazz, ClassMap::calculateShortName);
  }

  private static String calculateShortName(Class<?> clazz) {
    String name = clazz.getName();
    return name.substring(name.lastIndexOf('.') + 1);
  }
}

When tested under JMH, this has an improved performance profile, and avoids creating garbage:

Testing name map cache
1
2
NameTest.getNameMapCache                   avgt   25    11.807 ±  0.059   ns/op
NameTest.getNameMapCache:·gc.alloc.rate    avgt   25    ≈ 10⁻⁴           MB/sec

However, there’s a more optimised version (that doesn’t require lambdas) which can be used to cache a value for a class. The java.lang.ClassValue was added in 1.7 to support dynamic languages but can be used for this purpose for our benefit:

Using ClassValue to cache names:
1
2
3
4
5
6
7
8
9
10
11
public String getNameClassValueCache() {
  return ClassName.DEFAULT.get(getClass());
}

class ClassName extends ClassValue<String> {
  public static final ClassValue<String> DEFAULT = new ClassName();
  protected String computeValue(Class<?> clazz) {
    String name = clazz.getName();
    return name.substring(name.lastIndexOf('.') + 1);
  }
}

What happens if we run this under JMH? It looks like this:

Testing class value cache
1
2
NameTest.getNameClassValueCache            avgt   25     6.535 ±  0.045   ns/op
NameTest.getNameClassValueCache:·gc.alloc.rate avgt 25   ≈ 10⁻⁴           MB/sec

One of the reasons for this improvement is that the Map is a generic structure that needs to scale to millions of values, but the ClassValue is optimised for the smaller cases. The cache lookup code is a little bit longer, but since we don’t need to allocate and copy the class names we save some execution speed there as well. Here’s what the 14 bytecodes look like:

Decompiled lookup of cache
1
2
3
4
5
6
7
8
public java.lang.String getNameClassValueCache();
  Code:
     0: getstatic     #16    // Field com/bandlem/jmh/microopts/ClassName.DEFAULT:Ljava/lang/ClassValue;
     3: aload_0
     4: invokevirtual #10    // Method java/lang/Object.getClass:()Ljava/lang/Class;
     7: invokevirtual #17    // Method java/lang/ClassValue.get:(Ljava/lang/Class;)Ljava/lang/Object;
    10: checkcast     #2     // class java/lang/String
    13: areturn

The code gets inlined as expected:

inlining works
1
2
3
4
5
6
7
8
9
10
11
12
13
  @ 2   org.example.class::getNameClassValueCache (17 bytes)   inline (hot)
    @ 4   java.lang.Object::getClass (0 bytes)   (intrinsic)
    @ 7   java.lang.ClassValue::get (31 bytes)   inline (hot)
      @ 1   java.lang.ClassValue::getCacheCarefully (20 bytes)   inline (hot)
  @ 14   java.lang.ClassValue$ClassValueMap::getCache (5 bytes)   accessor
      @ 7   java.lang.ClassValue$ClassValueMap::probeHomeLocation (13 bytes)   inline (hot)
  @ 6   java.lang.ClassValue$ClassValueMap::loadFromCache (9 bytes)   inline (hot)
  @ 9   java.lang.ClassValue::castEntry (2 bytes)   inline (hot)
      @ 13   java.lang.ClassValue::match (21 bytes)   inline (hot)
  @ 5   java.lang.ref.Reference::get (5 bytes)   (intrinsic)
      @ 20   java.lang.ClassValue$Entry::value (9 bytes)   inline (hot)
  @ 1   java.lang.ClassValue$Entry::assertNotPromise (22 bytes)   inline (hot)
      @ 27   java.lang.ClassValue::getFromBackup (21 bytes)   executed < MinInliningThreshold times

Reproducing the benchmarks

The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of memory. The full list of results and source code to reproduce is at https://github.com/alblue/com.bandlem.jmh.microopts

Summary

Being able to measure the different implementation choices, and focussing not only on the speed of the bytecode but also the amount of bytecode, allows the method to be in-lined in places where it wasn’t before. Even though they are small changes, they can tip the balance between not being in-lined and being in-lined. Other performance optimisations are possible as well but need to be tested in-situ with the caller code; in other words, test the changes not only against the implementation of the method, but also where it is used.

AlBlue is Back

2020

For the last three and a half years, I’ve been working at Apple on Swift (and other internal projects) which has meant that I’ve been quieter on both my blog and in the open source community than I’d have liked. I’ve still been Tweeting @alblue although mainly it has been Brexit related rather than technological.

The good news is that I have now left Apple, and am about to start work at Santander on their cloud platform strategy. As a result, I’m able to be a bit more open in the open source community and on my blog. I’m even able to write for InfoQ again and I expect that to continue as before.

One of the things I’m happy to announce is that not only will I be presenting at QConLondon 2020 on Understanding CPU Microarchitecture to Increase Performance, but I’ll also be hosting the Evolving Java track with some great speakers. If you’re coming along to QConLondon, look me up in either of those two rooms.

Finally, I want to thank all of my ex-colleagues at Apple, who have been inspiring and supportive over the last few years. I’ll miss you, but the time was right for me to move on to bigger and better things!