Alex headshot

AlBlue’s Blog

Macs, Modularity and More

QCon London 2020 Day 1

2020 qcon conference

QConLondon is one of my favourite conferences. I don’t know exactly how many I have been to, but I’m in the double figures now. This year’s 14th QConLondon has been something I’ve been looking forward to for some time, not least because I’m speaking on performance on Tuesday and I’m track hosting the Java track on Wednesday, unless anything changes.

This year of course QConLondon is being held with the backdrop of the Corona virus starting its worldwide tour. Already we’ve seen a few global conferences canned (such as the GDC and the Mobile one) and I had feared that QConLondon would be the next in the list. Fortunately, the planning staff for both the conference itself and the QEII centre have been remarkably well prepared; there were hand sanitisers on the way in, personal sanitisers handed out, and instead of the usual help-yourself-lunch that have been a staple of prior conferences, this year there are pre-packed bento boxes. In fact, the only unpackaged food has been the bananas and oranges on the fruit stands, and arguably, they come in their own packaging. Together with that and the code of conduct and the pronoun badges, I think QCon is showing the way to go for other conferences.

That being said, global health has been a concern, and people are adapting. Gone are the handshakes that one might have expected, with a friendly ‘elbow bump’ or other non-contact greeting. Of course, it probably didn’t help that a conference at the QEII had a case of Corona virus a couple of weeks ago happened, but then again, with that foreknowledge I think the preparations have been handled admirably.

The only other problem to report, before writing what I went to see, was the fire alarm. There was an issue on the 2nd floor with a vape-related incident (note to people who vape: do it outside, please) causing the entire building to be evacuated. There was a brief interlude while a lot of people huddled outside – fortunately, in the sunshine – and then we went back in. The 2nd floor unfortunately kept ringing, but the rest of the floors were OK. I feel sorry for the presenters who were mid-flow when the alarm sounded; it’s not great being interrupted like that.

So, on to today’s topics. The morning keynote was by @aallan on the problems with today’s Internet of Things. There aren’t any slides available, but the content was excellent; the first half talked about how a failure in a consumer product shouldn’t have a failure mode of a dead cat (unlike some commercial providers whose pet-feeding devices had a failure that they forgot to dispense food for over a week). In particular, he looked at running TensorFlow on low power devices, like Raspberry Pi. He has a blog post talking about running TensorFlow on various devices, highlighting that the new Raspberry Pi 4 is sufficient to do machine learning tasks, and the even lower-power spark fun edge which can do machine learning while powered with a CR2032 like button battery. Seriously impressive stuff!

The next talk I went to was @samnewman on journeys from monolith to micro-services, and strategies for achieving that. He discussed the ability to refactor in-place from a monolith to a modular monolith, and by using an HTTP(S) proxy to sit between the client and the back-end implementation, which provides a control and dispatch point for diverting between new and old versions, either in a live-live, red/green or phased deployment. His slides are available and he referenced Michael Feather’s “Working effectively with legacy code” book and James Governer’s towards progressive delivery as well.

One talk you should watch when it becomes available is @deniseyu21 presentation on why is building distributed services so hard. The style of presentation (caution: cats included) and takeaways were one of the best I’ve seen; in particular, the fact that in the traditional CAP theorem, viewing it as a see-saw rather than a triangle (you don’t get to choose ignoring partition tolerance) means you have to decide whether you want to have availability at the cost of consistency or vice versa. She referenced Jepsen’s views on consistency and is writing a book for children on consistency.

The next talk I went to was Paul Hammant’s view on full cycle developers; essentially, how to ensure that developers and support staff are aligned to the delivery of a service. His slides weren’t available, but the idea was to make sure that L1 and L2 support staff are properly enabled through the creation of runbooks, produced by the developer team, and that the dev staff rotate through the L2 support staff so they gain empathy and see how they can improve the product. He referred to a couple of blog posts he’d written: hacking with JavaScript and like a used sofa.

Alexandra Noonan’s talk on “From monolith to micro-services and back again” was entertaining, and talked about the difficult issues they had in the transition, the subsequent explosion of services, and then the requirement to bring them back together again. The issues they had in the first instance were due to slow development; moving to micro-services sped that up, but then the explosion meant that library updates weren’t consistent, and so bringing them back to a monolith fixed that problem. Slides aren’t yet available, but I’ll update this post with the links when they are.

The final talk I went to was @hatofmonkeys talking about what can go wrong as part of a deployment of a new massively online player game, Monopoly City Streets. Essentially, an underfunded project with a massive marketing budget ended up attracting a few more than the 20,000 expected gamers – 1.7m, to be exact. A lot of this talk was how to survive under pressure, how 56 hour stints are not to be recommended, how a load of over 400 on a dual-core system is not great, and how PHP is full of facepalm. But the takeaway for this was a plea to allow others who have experienced failure (and failure at scale) to come and share their stories so that others can learn. I hope that Colin gives this talk again, and that it was recorded; his style of delivery was both entertaining and comedic.

All in all, the 14th QCon London first day was excellent, despite the prevailing conditions. Looking forward to tomorrow – and my talk, which is at 16:10 in St James. See you there!

Measuring bundle startup time with JFR

2020 java performance osgi eclipse

Several years ago, I gave a presentation at EclipseCon 2016 on “Optimising Eclipse Plug-ins” in which I talked about a few different ways of making Eclipse faster, at least from a start-up and memory usage perspective.

The presentation wasn’t recorded, but is available on SpeakerDeck as well as a narrated copy I recorded subsequently at Vimeo.

One of the techniques I discussed was using Java Flight Recorder as a means to record how long bundles take to start up, since the main cost of starting up Eclipse (or other OSGi applications) is how long it takes for the bundles to move into the STARTED state. Bundles that have a Bundle-Activator can slow this down, and since the majority of OSGi frameworks have a means to start up bundles sequentially, can lead to a lesser user experience.

At the original time of writing, JFR was still a commercial feature (requiring -XX:+UnlockCommercialFeatures setting and the wrath of licensing) in Java 8, and the API was evolving in Java 9 and onwards. However, now that Java 11 has been released, we have a free (both as in beer and in speech) implementation that we can code against in order to work as expected.

A number of API changes meant that the code needed some TLC, and so I recently uploaded a new version of the JFRBundleListener project to GitHub. The project can be downloaded and installed into an OSGi runtime (or Eclipse application) to find out where the start-up costs lie. At the present time, I have not published it anywhere for binary consumption.

The README.md contains more information about how to use it, but the TL;DR is:

  • Build and install the bundle with a low start level (e.g. 1) so that it tracks all subsequent bundle activation
  • Start your OSGi framework or Eclipse application with the magic invocation to start JFR recording, e.g. -XX:StartFlightRecording=filename=/tmp/dump.jfr,dumponexit=true,disk=true
  • Load and inspect the recording into Java Mission Control, or use the embedded java -jar JFRBundleListener.jar < /tmp/dump.jfr > /tmp/stacks.txt to generate a stacks list that can be processed by FlameGraph.pl
  • Use Brendan Gregg’s flamegraph tool to create an interactive SVG flamegraph.pl --hash --countname ms < /tmp/stacks.txt > /tmp/bundles.svg

The resulting SVG looks something like this:

Bundle startup

Custom JFR events

The API for interacting with JFR has improved significantly since its first inception in Java 8. I am looking forward to the backport of Java 11 to Java 8 to improve the experience significantly there.

To create a custom event, subclass the jdk.jfr.Event class. Provided that it has public fields of a known Java primitive or String, it will be written out when commanded. The JFRBundleEvent class is used to surface an OSGi BundleEvent, and so persists pertinent data to the JFRBundleEvent class upon creation.

These are created when receiving (synchronous) events with the SynchronousBundleListener implemented in the JFRBundleListener class. To mark a span, the event has begin() called at the start of a bundle coming on-line, and end() once it has finished. The commit() ensures that this bundle event is recorded; provided that the commit() occurs at some point later, and that the object is not reused in the meantime, it could be done at a later time.

The event information is recorded with specific key fields; they are identified using the field name in the class, or the one specified with the @Name annotation as defined in the event itself. For simplicity, I have chosen the OSGi bundle headers as the identifiers in the event, which may make post-processing it easier.

Processing JFR files

Similarly, the API for processing a JFR file has become significantly better. There is a helper method RecordingFile.readAllEvents() which translates a dump file at a given path into a list of RecordedEvent objects, or it can be used as an iterator to read through all events.

Events are identified by type; if not specified then the fully qualified name of the event class is used. This makes it easy to filter out specific events from the log, although it may be desirable that they are combined with other JFR events (e.g. thread sleeping, class loading, garbage collections etc.) in the log.

The JFRFlameGraph implementation provides a means to go through the log file, and generate a stack of bundle startup to allow for investigation as to why something is taken so long.

It prunes the list of events in the JFR recording to only include those JFRBundleEvent of interest, and then builds up a sorted data structure based on the earliest start-up time recorded. These are then interactively popped off to write out events before the end time, so that the resulting file contains a list of the bundles starting up:

...
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.equinox.preferences 28
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.core.contenttype;org.eclipse.equinox.registry 877
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.core.contenttype 6
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime;org.eclipse.equinox.app 28
Start-Level...;org.eclipse.ui.trace;org.eclipse.core.runtime 14
Start-Level...;org.eclipse.ui.trace 10
Start-Level...;org.eclipse.update.configurator 977
...

These can then be post-processed with FlameGraph.pl to generate the interactive SVG above.

The code prepends the Java thread name to the stack shown; this allows some threads (such as background Worker-n threads) to be ignored, while focussing on specific threads (such as the main thread and the Start-Level... thread). Since the SVG is interactive, clicking on the thread name will focus the view on just those bundles, which may provide more detailed analysis.

Summary

Java Flight Recorder is available as open source in Java 11, and can be used to export timing information using a custom event format. With sufficient post-processing of the data, it can generate flame graphs that can be more expressive than just looking at numbers.

The API for programmatically interacting with JFR recordings has been significantly improved, and being able to log timely information in conjunction with other events can be a significant bonus in performance analysis.

Micro optimising class.getName

2020 java performance hotspot jmh

A while ago, I was looking into the performance of a common pattern in Java, getting the unqualified name of the class. For some reason, it wasn’t being inlined. The method was implemented like this:

Getting the unqualified class name
public String getNameOriginal();
String name = getClass().getName();
int index = name.lastIndexOf('.');
if (index != -1) name = name.substring(index + 1);
return name;
}

This is compiled to 30 bytecodes:

Getting the unqualified class name (bytecode)
public java.lang.String getNameOriginal();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: ldc #12 // String .
11: invokevirtual #13 // Method java/lang/String.lastIndexOf:(Ljava/lang/String;)I
14: istore_2
15: iload_2
16: iconst_m1
17: if_icmpeq 28
20: aload_1
21: iload_2
22: iconst_1
23: iadd
24: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
27: astore_1
28: aload_1
29: areturn

Why does the size of the bytecode matter? Well, hotspot’s implementation has a fixed size of whether to in-line or not based on the bytecode length. Specifically, methods under 35 bytecodes are considered inlineable, and those that are larger aren’t:

MaxInlineSize
$ java -XX:+PrintFlagsFinal -version | grep MaxInlineSize
intx MaxInlineSize = 35
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

Having said that, limits are a bit like final in Java: not really. If a method is being called a lot, then it can be inlined if it is under the ‘hot’ limit:

FreqInlineSize
$ java -XX:+PrintFlagsFinal -version | grep FreqInlineSize
intx FreqInlineSize = 325
openjdk version "11.0.6" 2020-01-14 LTS
OpenJDK Runtime Environment Zulu11.37+17-CA (build 11.0.6+10-LTS)
OpenJDK 64-Bit Server VM Zulu11.37+17-CA (build 11.0.6+10-LTS, mixed mode)

So bytecode length is important, and if methods fall out of that boundary then they don’t get in-lined. This can lead to sub-optimal performance for methods that are being called frequently. We can see this happen by running the JVM with -XX:+PrintInlining to see what happens:

Callee is too large
@ 2 org.example.class::getNameOriginal (30 bytes) callee is too large

The problem is that the caller of this method is too large to absorb another 30 bytes of bytecode. What can we do?

Shrinking the code

One possible improvement might be to replace the String lookup with a char instead; in other words, from a String containing "." with a character '.' instead. Once changed, the method looks like:

Looking up with a character
public String getNameOriginal();
String name = getClass().getName();
int index = name.lastIndexOf('.');
if (index != -1) name = name.substring(index + 1);
return name;
}

It doesn’t make any difference to the bytecode length of the method, although it does reduce the use of a String constant, so the net effect is that the .class file will be slightly smaller:

Using a char instead of a String
public String getNameOriginalChar();
Code:
0: aload_0
1: invokevirtual #18 // Method Object.getClass:()Ljava/lang/Class;
4: invokevirtual #22 // Method Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: bipush 46 // Only difference - 46 is '.' in ASCII
11: invokevirtual #45 // Method String.lastIndexOf:(I)I
14: istore_2
15: iload_2
16: iconst_m1
17: if_icmpeq 26
20: aload_1
21: iload_2
22: iconst_1
23: iadd
24: invokevirtual #35 // Method String.substring:(I)Ljava/lang/String;
27: astore_1
28: aload_1
29: areturn

How do these compare when executed? Is there a speed difference between them? Well, we can turn to our trusty JMH tooling to answer that question:

NameTest.getNameOriginal avgt 25 34.439 +- 1.093 ns/op
NameTest.getNameOriginalChar avgt 25 30.602 +- 1.001 ns/op

So there is a difference depending on whether the indexOf(String) or indexOf(char) methods are used. Granted, it’s not much, but it could be worthwhile if the call is on a critical path.

However, we can improve this class by noting a couple of priorities of the runtime. All of the classes used by this code path are in a package; in other words, the lastIndexOf never returns -1 in our cases. We can remove the conditional block by observing that mathematically, if lastIndexOf returns -1 then we can add 1, and substring(0) will give us the whole string anyway. In other words, these two pieces of code are semantically equivalent:

Before and After
// Before
if (i != -1) { name = name.substring(i+1); }
// After
name = name.substring(i+1);

There is a minor behavioural change; now, we’re always allocating a new string using this method for classes that are not in the default package. However, since we know that it doesn’t happen in our code path, it makes no difference. What effect does it have on the generate bytecode?

Avoiding branch, inlining lastIndex
public String getNameChar() {
String name = getClass().getName();
return name.substring(name.lastIndexOf('.') + 1);
}

This compiles to 21 bytecodes:

After transformation
public java.lang.String getNameChar();
Code:
0: aload_0
1: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
4: invokevirtual #11 // Method java/lang/Class.getName:()Ljava/lang/String;
7: astore_1
8: aload_1
9: aload_1
10: bipush 46
12: invokevirtual #15 // Method java/lang/String.lastIndexOf:(I)I
15: iconst_1
16: iadd
17: invokevirtual #14 // Method java/lang/String.substring:(I)Ljava/lang/String;
20: areturn

The speedup can be measured using JMH, both with a char and with a String:

NameTest.getNameChar avgt 25 33.284 +- 0.334 ns/op
NameTest.getNameString avgt 25 34.935 +- 0.464 ns/op

These aren’t faster (or slower) than their original counterparts, but the bytecode permits more in-lining opportunities from their callers. Running this change again in-line results in the method being inlined:

@ 2 org.example.class::getNameChar (21 bytes) inline (hot)

Nice!

Going faster

Of course, one other way of optimising the code is to cache the values. This can be done fairly trivially with a simple Map of Class to String values, and in this case, the cost of calculating a missing key can be amortized to zero:

Using a Map to cache names:
public String getNameMapCache() {
return ClassMap.get(getClass());
}
class ClassMap {
private static final Map<Class<?>, String> cache
= new HashMap<Class<?>, String>();
public static String get(Class<?> clazz) {
return cache.computeIfAbsent(clazz, ClassMap::calculateShortName);
}
private static String calculateShortName(Class<?> clazz) {
String name = clazz.getName();
return name.substring(name.lastIndexOf('.') + 1);
}
}

When tested under JMH, this has an improved performance profile, and avoids creating garbage:

Testing name map cache
NameTest.getNameMapCache avgt 25 11.807 +- 0.059 ns/op
NameTest.getNameMapCache: gc.alloc.rate avgt 25 = 10^-4 MB/sec

However, there’s a more optimised version (that doesn’t require lambdas) which can be used to cache a value for a class. The java.lang.ClassValue was added in 1.7 to support dynamic languages but can be used for this purpose for our benefit:

Using ClassValue to cache names:
public String getNameClassValueCache() {
return ClassName.DEFAULT.get(getClass());
}
class ClassName extends ClassValue<String> {
public static final ClassValue<String> DEFAULT = new ClassName();
protected String computeValue(Class<?> clazz) {
String name = clazz.getName();
return name.substring(name.lastIndexOf('.') + 1);
}
}

What happens if we run this under JMH? It looks like this:

Testing class value cache
NameTest.getNameClassValueCache avgt 25 6.535 +- 0.045 ns/op
NameTest.getNameClassValueCache: gc.alloc.rate avgt 25 = 10e-4 MB/sec

One of the reasons for this improvement is that the Map is a generic structure that needs to scale to millions of values, but the ClassValue is optimised for the smaller cases. The cache lookup code is a little bit longer, but since we don’t need to allocate and copy the class names we save some execution speed there as well. Here’s what the 14 bytecodes look like:

Decompiled lookup of cache
public java.lang.String getNameClassValueCache();
Code:
0: getstatic #16 // Field com/bandlem/jmh/microopts/ClassName.DEFAULT:Ljava/lang/ClassValue;
3: aload_0
4: invokevirtual #10 // Method java/lang/Object.getClass:()Ljava/lang/Class;
7: invokevirtual #17 // Method java/lang/ClassValue.get:(Ljava/lang/Class;)Ljava/lang/Object;
10: checkcast #2 // class java/lang/String
13: areturn

The code gets inlined as expected:

inlining works
@ 2 org.example.class::getNameClassValueCache (17 bytes) inline (hot)
@ 4 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 7 java.lang.ClassValue::get (31 bytes) inline (hot)
@ 1 java.lang.ClassValue::getCacheCarefully (20 bytes) inline (hot)
@ 14 java.lang.ClassValue$ClassValueMap::getCache (5 bytes) accessor
@ 7 java.lang.ClassValue$ClassValueMap::probeHomeLocation (13 bytes) inline (hot)
@ 6 java.lang.ClassValue$ClassValueMap::loadFromCache (9 bytes) inline (hot)
@ 9 java.lang.ClassValue::castEntry (2 bytes) inline (hot)
@ 13 java.lang.ClassValue::match (21 bytes) inline (hot)
@ 5 java.lang.ref.Reference::get (5 bytes) (intrinsic)
@ 20 java.lang.ClassValue$Entry::value (9 bytes) inline (hot)
@ 1 java.lang.ClassValue$Entry::assertNotPromise (22 bytes) inline (hot)
@ 27 java.lang.ClassValue::getFromBackup (21 bytes) executed < MinInliningThreshold times

Reproducing the benchmarks

The numbers were taken on a MacBook Pro 8,2 with a 2.3GHz Core i7 and 8Gb of memory. The full list of results and source code to reproduce is at https://github.com/alblue/com.bandlem.jmh.microopts

Summary

Being able to measure the different implementation choices, and focussing not only on the speed of the bytecode but also the amount of bytecode, allows the method to be in-lined in places where it wasn’t before. Even though they are small changes, they can tip the balance between not being in-lined and being in-lined. Other performance optimisations are possible as well but need to be tested in-situ with the caller code; in other words, test the changes not only against the implementation of the method, but also where it is used.