Alex headshot

AlBlue’s Blog

Macs, Modularity and More

StringBuffer and StringBuilder performance with JMH

2016 Java Eclipse Jmh Performance Optimisation

Last week, Doug Schaefer wished on Twitter that other Eclipse projects were getting the same kind of contribution love as Platform UI. Lars Vogel attributed that to the effort in cleaning up the codebase and the focus on new contributions and contributors.

I thought I’d spend some time helping out CDT in assisting with this effort, and over the past week or so have been sending a few patches that way. Fortunately Sergey Prigogin has been an excellent reviewer, turning around my patches in a matter of hours in some cases, and that in turn has meant that I’ve been able to make further and faster progress than on some of the other projects I’ve tried contributing improvements to.

Most recently I’ve been looking into optimising some of the StringBuffer code and thought I’d go into a little bit of detail about the performance aspects of these changes.

The TL;DR of this post is:

  • StringBuilder is better than StringBuffer
  • StringBuilder.append(a).append(b) is better than StringBuilder.append(a+b)
  • StringBuilder.append(a).append(b) is better than StringBuilder.append(a); StringBuilder.append(b);
  • StringBuilder.append() and + are only equivalent provided that they are not nested and you don’t need to pre-sizing the builder
  • Pre-sizing the StringBuilder is like pre-sizing an ArrayList; if you know the approximate size you can reduce the garbage by specifying a capacity up-front

Most of this may be common knowledge but I hope that I can back this up with data using JMH.

Introduction to JMH

The Java Microbenchmark Harness or JMH is the tool to use for performance testing microbenchmarks. In the same way that JUnit is the de facto standard for testing, JMH is the de facto standard for performance measurement. There’s a great thread that goes into the details behind some of JMH’s evolution and the choices that were made; and the fact that since then it seems to have edged out other performance testing benchmark tools like Caliper seems to be a good indicator of its future existence.

JMH projects can be bootstrapped from mvn and then compiled/post annotated with the launcher to generate a benchmarks.jar file, which contains the code under test as well as a copy of the JMH code in an uber JAR. It also helpfully sets up a command line interface that you can use to test your code, and is the simplest way to generate a project.

You can create a stub JMH project using the steps on the JMH homepage:

```sh Generating a JMH project with mvn $ mvn archetype:generate

From the command line, the sample project can be run by executing:

```sh Compiling and Running the JMH benchmark
$ mvn clean package
$ java -jar target/benchmarks.jar

There’s a lot of flags that can be passed on the command line; passing -h will show the full list of flags that can be passed.

Using JMH in Eclipse

If you’re trying to run JMH in Eclipse, you will need to ensure that annotation processing is enabled. That’s because JMH uses annotations not only to annotate the benchmarks, but uses a annotation processing tool to transform the benchmarked code into executable units. If you don’t have annotation processing enabled and try to run it, you’ll see a cryptic message like Unable to read /META-INF/BenchmarkList

If you’ve created a Maven project (and presumably, therefore, have m2e installed) the easiest way is to install JBoss’ m2e-apt connector, which allows you to configure the project for JDT’s support for APT. This can be installed from Eclipse → Preferences → Discovery and choosing the m2e-apt connector. After restart this can be used to enable the JDT support automatically by going to Window → Preferences → Maven → Annotation Processing and then choosing the “Automatically configure JDT APT” option.

If you’re not using Maven then you can add the jmh-generator-annprocess JAR (along with its dependencies) to the project’s Java Compiler → Annotation Processing → Factory Path, and ensure that the annotation processing is switched on.

Tests can then be run by creating a launch configuration to run the main class org.openjdk.jmh.Main or by using the JMH APIs.

StringBuilder vs StringBuffer benchmark

So having got the basis for benchmarking set up, it’s time to look at the performance of the StringBuilder vs the StringBuffer. It’s a good idea to see what the performance is like of the empty buffers before we start adding content to it:

```java public class StringBenchmark { @Benchmark public String testEmptyBuffer() { StringBuffer buffer = new StringBuffer(); return buffer.toString(); }

@Benchmark public String testEmptyBuilder() { StringBuilder builder = new StringBuilder(); return builder.toString(); }

@Benchmark public String testEmptyLiteral() { return “”; } }

Two things are worth calling out: the first is that the resulting expression
you're using always has to be returned to the caller, otherwise the JIT will
optimise the code away. The second is that it's worth testing the empty case
first of all so that it sets a baseline for measurement.

We can run it from the command line by doing:

$ mvn clean package
$ java -jar target/benchmarks.jar Empty \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                      Mode  Cnt  Score   Error  Units
StringBenchmark.testEmptyBuffer   avgt   20  8.306 +- 0.497  ns/op
StringBenchmark.testEmptyBuilder  avgt   20  8.253 +- 0.416  ns/op
StringBenchmark.testEmptyLiteral  avgt   20  3.510 +- 0.139  ns/op

The flags used here are -wi (warmup iterations), -tu (time unit; nanoseconds), -f (number of forked JVMs) and -bm (benchmark mode; in this case, average time).

Somewhat unsurprisingly the values are relatively similar, with the return literal being the fastest.

What if we’re concatenating two strings? We can write a method to test that as well:

```java @Benchmark public String testHelloWorldBuilder() { StringBuilder builder = new StringBuilder(); builder.append(“Hello”); builder.append(“World”); return builder.toString(); }

@Benchmark public String testHelloWorldBuffer() { StringBuffer buffer = new StringBuffer(); buffer.append(“Hello”); buffer.append(“World”); return buffer.toString(); }

When run, it looks like:

$ mvn clean package
$ java -jar target/benchmarks.jar Hello \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                           Mode  Cnt   Score   Error  Units
StringBenchmark.testHelloWorldBuffer   avgt   20  25.747 +- 1.188  ns/op
StringBenchmark.testHelloWorldBuilder  avgt   20  25.411 +- 1.015  ns/op

Not much difference there, although the Buffer is marginally slower than the Builder is. That shouldn’t be too surprising; they are both subclasses of AsbtractStringBuilder anyway, which has all the logic.

Job done?

Are we all done yet? Well, no, because there are other things at play.

Firstly, JMH is a benchmarking tool to find the highest possible value of performance under load. What happens in Java is that by default HotSpot uses a tiered compilation model; it starts off interpreted, then once a method has been executed a number of times it gets compiled. In fact, there are different levels of compilation that kick in after a different amount of calls. You can see these if you look at the various *Threshold* flags generated by -XX:+PrintFlagsFinal from an OpenJDK installation.

When a method has been called thousands of times, it will be compiled using the Tier 3 (client) or Tier 4 (server) compiler. This generally involves optimisations such as in-lining methods, dead code elimination and the like. This gives the best possible code performance for the application.

But what if the method is called infrequently, or puts memory pressure on the garbage collector instead? It won’t be JIT compiled and so will take longer. We can see the effect of running in interpreted mode by running the generated benchmark code with -jvmArgs -Xint to force the forked JVM used to run the benchmarks to only use the interpreter:

```sh Running benchmarks in interpreted mode $ mvn clean package $ java -jar target/benchmarks.jar Empty Hello
-wi 5 -tu ns -f 1 -bm avgt -jvmArgs -Xint … Benchmark Mode Cnt Score Error Units StringBenchmark.testEmptyBuffer avgt 20 1102.609 +- 66.596 ns/op StringBenchmark.testEmptyBuilder avgt 20 769.682 +- 27.962 ns/op StringBenchmark.testEmptyLiteral avgt 20 184.061 +- 13.587 ns/op StringBenchmark.testHelloWorldBuffer avgt 20 2299.749 +- 70.087 ns/op StringBenchmark.testHelloWorldBuilder avgt 20 2381.348 +- 38.726 ns/op

A better option is to use the JMH specific annotation
`@CompilerControl(Mode.EXCLUDE)` which prevents benchmarking methods from being
JIT compiled, while allowing the other Java classes to be JIT compiled as
usual. This is akin to having other classes call the `StringBuffer` (so that is
sufficiently well exercised) while emulating code that isn't called all that
frequently. It can be added at the class level or at the method level.

$ grep -B2 class
public class StringBenchmark {
$ mvn clean package
$ java -jar target/benchmarks.jar Empty Hello \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                              Mode  Cnt    Score   Error  Units
StringBenchmark.testEmptyBuffer        avgt   20  144.745 +- 4.561  ns/op
StringBenchmark.testEmptyBuilder       avgt   20  122.477 +- 3.273  ns/op
StringBenchmark.testEmptyLiteral       avgt   20   91.139 +- 1.685  ns/op
StringBenchmark.testHelloWorldBuffer   avgt   20  236.223 +- 7.679  ns/op
StringBenchmark.testHelloWorldBuilder  avgt   20  222.462 +- 5.733  ns/op

Either way, calling the code before the JIT compilation has kicked in magnifies the difference between the different types of data structure by a factor of around 10%. So for methods that are called less than 1000 times – such as during start-up or when invoked from a user interface – the difference will exist.

Different calling patterns

What about different calling patterns? One example I came across was using an implicit String concatenation inside a StringBuilder or StringBuffer. This might be the case when generating a buffer to represent an e-mail, for example.

To test this, and to prevent Strings being concatenated by the javac compiler, we need to use non-final instance variables. However, to do that with the benchmark requires that the class be annotated with @State(Scope.Benchmark). (As with public static void main(String args[]) it’s best to just learn that this is necessary when you’re getting started, and then understand what it means later.)

```java @State(Scope.Benchmark) public class StringBenchmark { private String from = “Alex”; private String to = “Readers”; private String subject = “Benchmarking with JMH”; … @Benchmark public String testEmailBuilderSimple() { StringBuilder builder = new StringBuilder(); builder.append(“From”); builder.append(from); builder.append(“To”); builder.append(to); builder.append(“Subject”); builder.append(subject); return builder.toString(); }

@Benchmark public String testEmailBufferSimple() { StringBuffer buffer = new StringBuffer(); buffer.append(“From”); buffer.append(from); buffer.append(“To”); buffer.append(to); buffer.append(“Subject”); buffer.append(subject); return buffer.toString(); } }

You can selectively run the benchmarks by putting one or more regular
expressions on the command line:

$ mvn clean package
$ java -jar target/benchmarks.jar Simple \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                               Mode  Cnt   Score   Error  Units
StringBenchmark.testEmailBufferSimple   avgt   20  88.149 +- 1.014  ns/op
StringBenchmark.testEmailBuilderSimple  avgt   20  88.277 +- 1.201  ns/op

These obviously take a lot longer to run. But what about other forms of the code? What if a developer has used + to concatenate the fields together in the append calls?

```java public String testEmailBuilderConcat() { StringBuilder builder = new StringBuilder(); builder.append(“From” + from); builder.append(“To” + to); builder.append(“Subject” + subject); return builder.toString(); }

@Benchmark public String testEmailBufferConcat() { StringBuffer buffer = new StringBuffer(); buffer.append(“From” + from); buffer.append(“To” + to); buffer.append(“Subject” + subject); return buffer.toString(); }

Running this again shows why this is a bad idea:

$ mvn clean package
$ java -jar target/benchmarks.jar Simple Concat \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                               Mode  Cnt    Score   Error  Units
StringBenchmark.testEmailBufferConcat   avgt   20  105.424 +- 3.704  ns/op
StringBenchmark.testEmailBufferSimple   avgt   20   91.427 +- 2.971  ns/op
StringBenchmark.testEmailBuilderConcat  avgt   20  100.295 +- 1.985  ns/op
StringBenchmark.testEmailBuilderSimple  avgt   20   90.884 +- 1.663  ns/op

Even though these calls do the same thing, the cost of having an embedded implicit String concatenation is enough to add a 10% penalty on the time taken for the methods to return.

This shouldn’t be too surprising; the cost of doing the in-line concatenation means that it’s generating a new StringBuilder, appending the two String expressions, converting it to a new String with toString() and finally inserting that resulting String into the outer StringBuilder/StringBuffer.

This should probably be a warning in the future.

Chaining methods

Finally, what about chaining the methods instead of referring to a local variable? That can’t make any difference; after all, this is equivalent to the one before, right?

```java @Benchmark public String testEmailBuilderChain() { return new StringBuilder() .append(“From”) .append(from) .append(“To”) .append(to) .append(“Subject”) .append(subject) .toString(); }

@Benchmark public String testEmailBufferChain() { return new StringBuffer() .append(“From”) .append(from) .append(“To”) .append(to) .append(“Subject”) .append(subject) .toString(); }

What's interesting is that you do see a significant difference:

$ java -jar target/benchmarks.jar Simple Concat Chain \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                               Mode  Cnt    Score   Error  Units
StringBenchmark.testEmailBufferChain    avgt   20   38.950 +- 1.120  ns/op
StringBenchmark.testEmailBufferConcat   avgt   20  103.151 +- 4.197  ns/op
StringBenchmark.testEmailBufferSimple   avgt   20   89.685 +- 2.041  ns/op
StringBenchmark.testEmailBuilderChain   avgt   20   38.113 +- 1.012  ns/op
StringBenchmark.testEmailBuilderConcat  avgt   20  102.193 +- 2.829  ns/op
StringBenchmark.testEmailBuilderSimple  avgt   20   89.117 +- 2.658  ns/op

In this case, the chaining together of arguments has resulted in a 50% speed up of the method call after JIT. One possible reason this may occur is that the length of the method’s bytecode has been significantly reduced:

$ javap -c StringBenchmark.class | egrep "public|areturn"
  public java.lang.String testEmailBuilder();
      60: areturn
  public java.lang.String testEmailBuffer();
      60: areturn
  public java.lang.String testEmailBuilderConcat();
      84: areturn
  public java.lang.String testEmailBufferConcat();
      84: areturn
  public java.lang.String testEmailBuilderChain();
      46: areturn
  public java.lang.String testEmailBufferChain();
      46: areturn

Simply by chaining the .append() methods together has resulted in a smaller method, and thus a faster call site when compiled to native code. The other advantage (though not demonstrated here) is that the size of the bytecode affects the caller’s ability to in-line the method; smaller than 35 bytes (-XX:MaxInlineSize) means the method can be trivially inlined, and if it’s smaller than 325 bytes then it can be in-lined if it’s called enough times (-XX:FreqInlineSize).

Finally, what about ordinary String concatenation? Well, as long as you don’t mix and match it, then you’re fine – it works out as being identical to the testEmailBuilderChain methods.

```java @Benchmark public String testEmailLiteralConcat() { return “From” + from + “To” + to + “Subject” + subject; }

Running it shows:

$ java -jar target/benchmarks.jar EmailLiteral \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                         Mode  Cnt   Score   Error  Units
StringBenchmark.testEmailLiteral  avgt   20  38.033 +- 0.588  ns/op

And for comparative purposes, running the lot with @CompilerControl(Mode.EXCLUDE) (simulating an infrequently used method) gives:

$ java -jar target/benchmarks.jar Email \
   -wi 5 -tu ns -f 1 -bm avgt
Benchmark                               Mode  Cnt    Score    Error  Units
StringBenchmark.testEmailBufferChain    avgt   20  416.745 +-  9.087  ns/op
StringBenchmark.testEmailBufferConcat   avgt   20  764.726 +-  9.535  ns/op
StringBenchmark.testEmailBufferSimple   avgt   20  462.361 +- 15.091  ns/op
StringBenchmark.testEmailBuilderChain   avgt   20  384.936 +-  9.173  ns/op
StringBenchmark.testEmailBuilderConcat  avgt   20  752.375 +- 19.544  ns/op
StringBenchmark.testEmailBuilderSimple  avgt   20  414.372 +-  6.940  ns/op
StringBenchmark.testEmailLiteral        avgt   20  417.772 +-  9.515  ns/op

What a lot of rubbish

The other aspect that affects the performance is how much garbage is created during the program’s execution. Allocating new data in Java is very, very fast these days, regardless of whether it’s interpreted or JIT compiled code. This is especially true of the new +XX:+UseG1GC which is available in Java 8 and will become the default in Java 9. (Hopefully it will also become a part of the standard Eclipse packages in the future.) That being said, there are certainly cycles that get wasted, both from the CPU but also the GC, when using concatenation.

The StringBuffer and StringBuilder are implemented like an ArrayList (except dealing with an array of characters instead of an array of Object instances). When you add new content, if there’s capacity, then the content is added at the end; if not, a new array is created with double-plus-two size, the content backing store is copied to a new array, and then the old array is thrown away. As a result this step can take between O(1) and O(n lg n) depending on whether the initial capacity is exceeded.

By default both classes start with a size of 16 elements (and thus the implicit String concatenation also uses that number); but the explicit constructors can be overridden to specify a default starting size.

JHM also comes with a garbage profiler that can provide (in my experience, fairly accurate) estimates of how much garbage is collected per operation. It does this by hooking into some of the serviceability APIs in the OpenJDK runtime (so other JVMs may find this doesn’t work) and then provides a normalised estimate for how much garbage is attributable per operation. Since garbage is a JVM wide construct, any other threads executing in the background will cause the numbers to be inaccurate.

By modifying the creation of the StringBuffer with a JMH parameter, it’s possible to provide different values at run-time for experimentation:

```java public class StringBenchmark { @Param({“16”}) private int size; … public void testEmail… { StringBuilder builder = new StringBuilder(size); } }

It's possible to specify multiple parameters; JMH will then iterate over each
and give the results separately. Using `@Param({"16","48"})` would run first
with `16` and then `48` afterwards.

$ java -jar target/benchmarks.jar EmailBu \
   -wi 5 -tu ns -f 1 -bm avgt -prof gc
Benchmark                                               (size)  Mode  Cnt     Score     Error   Units
StringBenchmark.testEmailBufferChain                        16  avgt   20    37.593 +-   0.595   ns/op
StringBenchmark.testEmailBufferChain: gc.alloc.rate.norm    16  avgt   20   136.000 +-   0.001    B/op
StringBenchmark.testEmailBufferConcat                       16  avgt   20   155.290 +-   2.206   ns/op
StringBenchmark.testEmailBufferConcat: gc.alloc.rate.norm   16  avgt   20   576.000 +-   0.001    B/op
StringBenchmark.testEmailBufferSimple                       16  avgt   20   136.341 +-   3.960   ns/op
StringBenchmark.testEmailBufferSimple: gc.alloc.rate.norm   16  avgt   20   432.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderChain                       16  avgt   20    37.630 +-   0.847   ns/op
StringBenchmark.testEmailBuilderChain: gc.alloc.rate.norm   16  avgt   20   136.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderConcat                      16  avgt   20   153.879 +-   2.699   ns/op
StringBenchmark.testEmailBuilderConcat: gc.alloc.rate.norm  16  avgt   20   576.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderSimple                      16  avgt   20   136.587 +-   3.146   ns/op
StringBenchmark.testEmailBuilderSimple: gc.alloc.rate.norm  16  avgt   20   432.000 +-   0.001    B/op

Running this shows that the normalised allocation rate for the various methods (gc.alloc.rate.norm) varies between 136 bytes and 576 for both classes. This shouldn’t be a surprise; the implementation of the storage structure is the same between both classes. It’s more noteworthy to observe that there is a variation between using the chained implementation and the simple allocation (136 vs 432).

The 136 bytes is the smallest value we can expect to see; the resulting String in our test method works out at 45 characters, or 90 bytes. Considering a String instance has a 24 byte header and a character array has a 16 byte header, 90 + 24 + 16 = 130. However, the character array is aligned on an 8 bit boundary, so it is rounded up to 96 bits. In other words, the code for the *Chain methods has been JIT optimised to produce a single String with the exact data in place.

The *Simple methods have additional data generated by the increasing size of the internal character backing array. 136 of the bytes are the returned String value, so that can be taken out of the equation. Of the 296 remaining bytes, 24 bytes are taken up by the StringBuilder leaving 272 bytes to account for. This actually turns out to be the character arrays; a StringBuilder starts off with a size of 16 chars, then doubles to 34 chars and then 70 chars, following a 2n+2 growth. Since each char[] has an overhead of 16 bytes (12 for the header, 4 for the length) and that chars are stored as 16 bit entities, this results in 48, 88 and 160 bytes. Perhaps unsurprisingly the growth (and subsequent discarded char[] arrays) equal 296 bytes. So the growth of both the *Simple elements are equivalent here.

The larger values in the *Concat methods show additional garbage growth caused due to the temporary internal StringBuilder elements.

To test a different starting size of the buffer, passing the -p size=48 JMH argument will allow us to test the effect of initialising the buffers with 48 characters:

$ java -jar target/benchmarks.jar EmailBu \
   -wi 5 -tu ns -f 1 -bm avgt -prof gc -p size=48
Benchmark                                               (size)  Mode  Cnt     Score     Error   Units
StringBenchmark.testEmailBufferChain                        48  avgt   20    38.961 +-   1.732   ns/op
StringBenchmark.testEmailBufferChain: gc.alloc.rate.norm    48  avgt   20   136.000 +-   0.001    B/op
StringBenchmark.testEmailBufferConcat                       48  avgt   20   106.726 +-   4.118   ns/op
StringBenchmark.testEmailBufferConcat: gc.alloc.rate.norm   48  avgt   20   392.000 +-   0.001    B/op
StringBenchmark.testEmailBufferSimple                       48  avgt   20    93.455 +-   2.702   ns/op
StringBenchmark.testEmailBufferSimple: gc.alloc.rate.norm   48  avgt   20   248.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderChain                       48  avgt   20    39.056 +-   1.723   ns/op
StringBenchmark.testEmailBuilderChain: gc.alloc.rate.norm   48  avgt   20   136.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderConcat                      48  avgt   20   103.264 +-   2.404   ns/op
StringBenchmark.testEmailBuilderConcat: gc.alloc.rate.norm  48  avgt   20   392.000 +-   0.001    B/op
StringBenchmark.testEmailBuilderSimple                      48  avgt   20    88.175 +-   2.442   ns/op
StringBenchmark.testEmailBuilderSimple: gc.alloc.rate.norm  48  avgt   20   248.000 +-   0.001    B/op

By tweaking the initialised StringBuffer/StringBuilder instances to 48 bytes, we can reduce the amount of garbage generated as part of the concatenation process. The Java implicit String concatenation is outside our control, and is a result of the underlying character array resizing itself.

Here, the *Simple methods have dropped from 432 to 248 bytes, which represents the 136 byte String result and a copy of the 112 byte array (corresponding to an 41-48 character array with the 16 byte header). Presumably in this case the JIT has managed to avoid the creation of the StringBuilder instance in the *Simple methods, but the array copy has leaked through. However other than these two values, there is no additional garbage created.


Running benchmarks is a good way of finding out what the cost of a particular operation is, and JMH makes it easy to be able to generate such benchmarks. Being able to ensure that the benchmarks are correct are a little harder, as well as what effect seen by other processes. Of course, different machines will give different results to these, and you’re encouraged to replicate this on your own setup.

Although the fully JIT compiled method for both StringBuffer and StringBuilder are very similar, there is an underlying trend for the StringBuilder to be at least as fast as its StringBuffer older cousin. In any case, implicit String concatenation (with +) creates a StringBuilder under the covers and it’s likely therefore that the StringBuilder will hit hot compilation method before StringBuffer in any case.

The most efficient way of concatenating strings is to have a single expression which uses either implicit String concatenation ( + + + + ) or has a series of (e.g. .append().append().append()) without any intermediate reference to a local variable. If you’ve got a lot of constants then using + will also have the advantage of using constant folding of the String literals ahead of time.

Mixing + and .append() is a bad idea though, because there will be extra pressure on the memory as the String instances are created and then immediately thrown away.

Finally, although using + + + + is easy, it doesn’t let you pre-size the StringBuilder array, which starts off with 16 characters by default. If the StringBuilder is used to create large Strings then avoiding multiple results is a relatively simple optimisation technique as far as reducing garbage is concerned. In addition, the array copy operation will grow larger as the size of the data set increases.

Update 2020

I have uploaded this code to along with an updated version of the results, also committed to the repository.

One of the significant changes in the results was that the JVM has now learnt how to do indification of string concatenation, which has improved both the speed and also the garbage collection profile of the operations. However, the overall relative behaviour of the differences still holds.