Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Understanding CPU Microarchitecture for performance


I recently gave a talk at QCon London entitled “Understanding CPU Microarchitecture for Performance” on the details of CPU internals and how they affect the speed of programs that run on them.

I’ve given this talk twice recently; once at QCon London, and a further virtual event for the London Java Community (LJC). Although both of the presentations are similar content, I marginally updated the slides for the LJC event to include a new release of one version of the software that I recommended, and for a new project that wasn’t open-sourced at the time.

The QCon London presentation has the advantage that there’s a transcript, and synchronised slides, so depending on which form you find more useful it’s up to you. Here are the links:

The abstract for both is the same:

Microprocessors have evolved over decades to eke out performance from existing code. But the microarchitecture of the CPU leaks into the assumptions of a flat memory model, with the result that equivalent code can run significantly faster by working with, rather than fighting against, the microarchitecture of the CPU.

This talk, given for the (QCon London| London Java Community) in 2020, presents the microarchitecture of modern CPUs, showing how misaligned data can cause cache line false sharing, how branch prediction works and when it fails, how to read CPU specific performance monitoring counters and use that in conjunction with tools like perf and toplev to discover where bottlenecks in CPU heavy code live. We’ll use these facts to revisit performance advice on general code patterns and the things to look out for in executing systems. The talk will be language agnostic, although it will be based on the Linux/x86-64 architecture.

If you have any comments or questions, feel free to reach out to me via Twitter, e-mail or any other means you have at your disposal.

QCon London 2020 Day 3

2020 qcon conference

This morning was kicked off by Katie Gamanji, of American Express and part of the Cloud Native Computing Foundation. The talk was on Kubernetes; there were 23,000 attendees of KubeCon conferences in 2019, and over 2,000 contributors to the project. She introduced the Cloud Native Interface, which is used to ensure that a pod has its own IP which can be routed to via various mechanisms, and the Container Runtime Interface which abstracted away Docker from other engines such as gviosr, containerd etc. Mostly it seemed to be an overview of what Kube does, rather than anything new; for the people who were attending hoping to learn about more than just the basics, I’m not sure what the benefits were.

The rest of the day I spent hosting the Java track, on behalf of Martijn Verburg, who was unable to assume hosting duties due to other committments. I’m very glad that he had those committments, because I had a very enjoyable day, listening to experts in their field as well as playing compere to a captive audience. I’m very much looking forward to being invited back again!

Ben Evans from New Relic was the first speaker in my track, talking about Record and sealed types in Java, which are in an experimental phase. Record objects are data types on steroids; essentially, a data type wich has an automaticaly generated equals, hashCode and toString with the correct properties in a final data structure. Together with sealed types, which are a way of doing inner classes properly, it looks like the various experimental Java projects are really providing the goods.

Andrzej Grzesik, better known as “ags”, talked about the path that Revolut had in mirating to Java 11 over the last year. They moved over to Java 11 as the compile engine and as the runtime engine for all of their apps over the year, following up with various bugs reported against OpenJDK as they wen through, and some of the pitfalls of their experience. The main one seems to be updating all of the dependent libraries and build tools; for example, moving from Gradle v3 to Gradle v4 and v5, a step at a time. They are keen to try keeping up with the latest JDK releases, although their use of Gradle and more specifically Groovy is holding them back from migrating at the moment. Making sure that all of the dependencies worked seemed to be the biggest challenge, though most (non-abandoned) Java libraries work at this stage.

David Delabassee gave an overview from Oracle about how to best drive Java applications inside Docker containers. His examples and slides (to be uploaded later) showed how to use a multi-stage Docker build along with jlink and the --no-man-pages and --no-header-files along with compression to build a custom base image which was about 20% of its original size. He also highlighted using Alpine as a distribution and Musl as a replacement for glibc, but noted this wasn’t officially supported; project portola aims to provide a means to do this in the future. By using AppCDS he was able to shrink the launch time down further, and talked about the evolution of Docker and docker rootless along with a podman blog talking more about enabling this functionality.

Emily Jiang gave a talk and live demo of how to build a 12-factor app with Open Liberty and MicroProfile. The demo code is available on GitHub and since it was a live demo, the screencast from InfoQ will have a lot more detail. In essence, Emily demonstrated having two services, connecting via localhost, to be run inside different Kubernetes pods and with redundant services, and used asynchronous calls to be able to route around delays or failures in the underlying service implementations. One to watch carefully on the replays, I think.

The day was rounded off nicely by James (Jim) Gough whom I’ve had the pleasure of working with commercially before. He gave a talk on how GraalVM executes code, and showed a number of demonstrations of how Graal can eecute code that has been compiled using the Java compiler, by debugging through how the JDK’s JIT (in the form of Graal) works. He also demonstrated the jaotc tool for compiling Java code ahead of time for faster startup and lower memory footprint. I had had a prior audience for this talk, and so almost everything went to plan in this presentation – with the exception of the QCon sign falling off the podium by the end. Oh well, it had been a long day …

This brought to the end my 12th? QCon London. I don’t think I’ve been to all 14, but I have certainly been to most of them over the past years, and all in all, it’s just as great as ever. Obviously this year was a little ‘special’ due to all the changes in place, but I thought that the staff (both at QCon and also at the QEII conference centre itself) handled everything marveously. I can’t wait to be back again this time next year, either as an attendee, speaker or track host!

QCon London 2020 Day 2

2020 qcon conference

The second day of QConLondon had an opening keynote by (@anjuan)[] talking about the Underground Railroad network that helped free slaves in the United States. He likened various players in the scheme with the structure of management and developers. The talk was delivered in an entertaining enough fashion, but the subject was based on an analogy to something that is a part of American history rather than the UK or European history, and so was a little bit disconnected from the current reality – especially given the dual challenges to the UK of Brexshit and the Coronavirus. Ultimately I think this keynote may have worked better for an American audience.

The first real talk of the day talked about Quarkus, as a Java framework for small quick-start applications. One of the differences with a JIT enabled language is that there’s a warm-up phase when the application is reaching peak performance; not an issue for long-running applications, but can be a concern if you’re following continuous delivery, redelivering multiple times a day. If you’re deploying every 10 minutes, with every commit, then you may find that the Java application doesn’t ever reach steady state before being shut down for a new version of the service. Quarkus has supersonic/fast boot. It is designed for GraalVM by default, and takes a generated build file and uses a Maven/Gradle plugin to produce the optimised JAR, and using ahead-of-time compiler, an ELF executable for the GraalVM. It also produces containers with a small on-disk footprint. The hot reload of code makes for a fast turnaround time, and Quarkus is opinionated about how it starts Java – like creating a debug listener on the port at the same time.

Sergey Kuksenko @kuksenk0 has been a JVM engineer since 2005, and working on performance for the last decade. He kicked off a demo with two Java Mandelbrot generators, using a Complex class - but in the case of the faster demo (12fps vs 5fps), the only difference was the addition of the “inline” keyword. Valhalla provides a denser memory layout for inline class (aka value types), with the plan to have specialised generics. The ‘inline class’ rather than ‘value’ was chosen to make it easier to update the Java Language spec with fewer changes. Inline classes don’t have identity, which means they can’t be compared with == and so avoid problems like Integer.valueOf(42) == Integer.valueOf(42).

An inline class has no identity, is immutable, not nullable and not synchronizable. The JVM will decide whether to allocate the inline classes on the heap, or whether they’ll be on the stack or inlined into the container class. The phrase “Code like a class, work like an int” summarises the goal of Valhalla. The performance is about 50% faster, but importantly, has many fewer L1 cache misses. The work on benchmarking the current implementation is in progress, and seems to be under 2% regression at the moment. For any kind of arithmetic types, having Complex inline classes gave an order of magnitude speed up in some cases. The Optional class will be a value class in the future, and will be a proof of concept of a migration path. Work is ongoing to reduce the performance overheads so that it can be available in the future; for the time being, the JDK14 release in the next couple of weeks will have a version available for experimentation.

A wildcard session followed, due to the last-minute cancellation of a speaker. Instead, I went to a talk on TornadoVM, which provides a way of compiling and running Java code in parallel on a variety of different FPGA or GPU solutions. By translating Java bytecode into Tornado bytecode, and then having different translators which re-write those kernels to GPU specific instruction sets, it’s possible to get a many thousand times speed up on numerical calculations. A demo showed capturing depth information from video captured from a Microsoft Kinect, and re-rendering into three dimensional representation afterwards. Importantly they have a docker image which can be used for testing in beehivelab/tornado-gpu:latest which is covered by the project’s README.

The next session I attended was by Alina Yurenko from Oracle on “Maximising application performance with GraalVM”, which talked about using Graal as an ahead-of-time compiler for generating native images. Not only does starting the application run much faster, it also uses a lot less memory than the equivalent running application under the JVM. Partially this is due to the fact that the C2 compiler doesn’t need to compile the underlying JVM classes, and partially due to the fact that the actual runtime of the application has a far lower total memory layout. Of course, creating an accurate execution profile requires running the application under an expected (simulated) load, so that the correct hot code samples can be identified and thus translated appropriately. Graal uses information gained from the initial execution to prepare appropriate code for expected types; if these assumptions are incorrect, then the executed binary will be different. There was also a sales pitch of using GraalVM to host multiple languages, along with an evolution of the Nashorn JavaScript engine. It’s unclear as to whether people will really want to use a JVM for running multiple languages, but then again, those people never really saw JavaScript as anything other than a toy language, so what do they know? :)

The next session I attended was my own. I thought it would be a little rude not to turn up :) Fortunately the talk went OK – after all, I completed writing it with at least an hour to go – and as for timing, I finished 10s early. My talk was on CPU microarchitecture for maximum performance, looking at the nitty gritty details of how CPUs execute. I didn’t go down to the electron or transistor level, but rather talking about the general architectural details of the processor. My slides have been uploaded to my SpeakerDeck profile and apart from a quibble about a bullet on page 25, it seems that most people seem to have enjoyed it; after all, I did. The video was recorded and will be available on InfoQ at some point in the future; I’ll update this post with the link when I have a public link.

Day 2 ended with a get-together of the speakers in the usual location, and I made several contacts with people who had been speaking at QCon; some whom I knew, some whom I did not. One of the pleasures of QCon is meeting and talking to people; I enjoy meeting up with the attendees during the breaks, but it is also excellent to be able to talk to the movers and shakers of the conference.

Tomorrow I’m leading the Java track, which I’m looking forward to; stay tuned!