Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Bite-sized bytecode and class loaders

2020 Java

Today I gave a talk at the London Java Community on bytecode and classloaders. The presentation is available at SpeakerDeck; the presentation was recorded and is on the London Java Community channel.

For the presentation, I wrote a JVM emulator that allows stepping through bytecode and seeing the result of the local and stack as you go. It’s not a complete implementation (the deficiencies are listed on the README) but it’s something you could look through to get a feel of how the JVM works when interpreting code.

The JVMulator is available at https://github.com/alblue/jvmulator and you can build it with Maven or your favourite IDE. There’s a GUI which is set up to run as the main class, so once built, you can run it with java -jar or even mvn exec:java to launch it.

Bytecode

The JVM runs on bytecode; it’s a compact encoding of instructions where most instructions take up a single byte. There’s a good description of it on Wikipedia, and there’s also a useful table of bytecodes as well.

The majority of bytecodes take no operands, but deal with values being pushed to or pulled from the stack. There are also a number of local variable placeholders which are specific to the frame being executed; these typically hold things like the counter in the loop for iteration or other local variables. Methods can have zero or more locals and require zero or more stack depth; both figures are encoded in the method bytecode, so that when the JVM runs it can reserve the amount of required space on the stack for the method to execute.

Arguments passed in to the method take up one local slot, though these placeholders can be re-used throughout a method’s execution if the argument is no longer required after first use. For instance methods specifically, there’s a hidden first argument which contains the this pointer, so if you have an instance method with 2 arguments, it’s always going to reserve at least 3 slots for local variables.

Some bytecodes take operands in the instruction stream, so not all bytes in the stream represent valid instructions. For example, when pushing a constant to the stack the bipush will push the next byte on the stack, and sipush will push the next two bytes as a short onto the stack. Although many such opcodes take only one or two bytes, there is a special wide mode which means that the next instruction takes double the normal amount of variables. This is primarily used when dealing with local variables; the first 256 local variables can be accessed by having a single byte, but if you have more than 256 local variables (why‽) then you’d use the wide form of the iload bytecode for that.

Bytecode is stored in the Code attribute of a method, so all Java class files that have code associated with them (i.e. everything that’s not purely an interface) will have the string Code inside the file somewhere. Interfaces and abstract methods have no Code attribute, though a class will typically have a default constructor injected by the javac compiler.

Stack

The stack forms a key part of the Java bytecode. Operations are consumed from the stack, and results are pushed onto the stack. At the end of the method’s return, the top level of the stack is the return result. Simple math operations (e.g. iadd, fmul) consume two stack elements and then push the result back; some, like ineg pull and push a single value.

One quirk of the JVM is that long and double values occupy two slots on the stack; that is, there’s a missing stack element value which can’t be accessed each time you push or pull one of these values. This was an implementation workaround when JVMs were 32-bit; unnecessary for today, but kept for backwards compatibility and to prevent requiring re-compiling Java code.

There are some 2 operators that deal with two slots at a time (like dup2) which exist as an optimisation to duplicate a long or double value; otherwise, dup is used.

Locals

Locals are accessed with various iload or aload operators to pull the values from the local variables onto the stack. You’ll typically see programs pulling with aload_0 which pulls the first local variable - for instance fields, this is the this parameter. The aload instructions deal with objects by reference (address load) as well as arrays; there’s separate aaload for accessing an object in an object array (such as that you’d process with main(String args[])).

Since bytecodes operate on the stack, if a variable is to be used, it needs to be pulled there first. The only time this isn’t needed is for incrementing (or decrementing) an integer value – there’s a special instruction which is used to do that – and that’s typically used for loops where pulling and stashing the loop counter each time would be unproductive.

Classfiles

A classfile is a tightly packed mechanism for representing Java classes (and interfaces, and some special containers like package-info and module-info). It contains several variable-length sections, so it can’t be randomly accessed directly when loading; it has to be parsed to be understood.

The constant pool is a key component of a class file. It contains a list of typed data values; UTF-8 strings, long values, double values etc. that are used in the method’s code or as field initialisers. There are some instructions which encode specific values – for example, iconst_5 will push 5 onto the stack – but if you are out of luck with your value, you can encode in the constant pool.

As well as UTF-8 strings and numeric values (for int/long/float/double – char/short/byte/boolean are figments of the JVM’s imagination) there are fields which define what it means to be a Class, what a FieldRef or MethodRef is, and a pairing called NameAndType which is essentially used to bind together a method name like equals with its descriptor type (Ljava/lang/Object;)Z – or as programmers know it as, boolean equals(Object). Java decomposes its methods this way, because if there are any other methods that have the same signature of boolean something(Object) then they can use the same type descriptor in the file and simply pair it with a different name.

All of the constant pool references are cross-checked by index number, which starts at 1 – the special slot 0 is only used to encode no parent for the java.lang.Object class as far as I can tell. It encodes a tree-like structure through the power of indices; the this class and super class are merely a short value pointing into the pool, so to understand what a class file is from its bytes you have to parse the full constant pool first of all.

I put together this infographic showing how the class file looks in my presentation referenced at the top of this post, which hopefully paints a picture.

Class file format infographic

Other tools for introspecting bytecode are available; I’d recommend starting off with javap and using the -c and -v options to give you a bunch of information on the class. If you want to see stepping through real bytecode on a real JVM then I recommend looking at Chris Newland’s JITWatch which shows you the bytecode as it executes and how it maps back onto the source files, using the LineNumberTable attributes encoded in the bytecode.

Compiling classes in memory

Bytecode can be read in from previously generated .class files, but you can also generate it on the fly. Many JVM languages have the ability to generate .class files, but if you want to stick with Java you can use the built-in JavaC compiler to generate code:

var javac = ToolProvider.getSystemJavaCompiler();
var fileManager = javac.getStandardFileManager(null,null,null);
var sources = fileManager.getJavaFileObjects(new File(...));
javac.getTask(null,fileManager,null,null,null,sources).compile();

You can create a file manager from the tool, but you can provide your own as well. I’ve written an InMemoryFileManager which allows you to compile Java source from a String and then obtain the appropriate .class bytes as a byte array, or even load it dynamically in a class file with a classloader. The example fits on a slide if you’re interested.

Bytecode can also be created on the fly using tools like Mockito, or through generation agents like the higher level ByteBuddy or the lower level ASM. Many of these types of tools provide simple transformation operations on existing classes, like inserting instrumentation, and there are constraints about methods (including the ability to generate accurate object maps for the compiler) which can be challenging.

Summary

The bytecode format used by classfiles is remarkably compact yet extensible. Of the constant pool types, very few new entries have arrived and only one removal since the JVM was created; the majority of new features have been added through attributes, either on the class as a whole or on the individual methods.

Bytecode has remained very similar as well; much of the innovation has come from higher up the stack in the Java compiler. The only significant changes were the introduction of invokedynamic in Java 7 (which set the ground for Lambdas arriving later) and building on top of that the CONSTANT_Dynamic_Info and CONSTANT_InvokeDynamic constant pool types.

There was a political decision to increment the bytecode number upon each major release since Java 8, although the bytecode hasn’t changed that much. One argument for doing this is you know when you have a class file that requires Java 11 runtime features, even if the bytecode could run on a Java 8 VM. Since it’s also possible to get a Java compiler to output bytecode with a lower level, it doesn’t make much of a difference, and it also allows you to use javap to find out what version of Java is required to run a particular class.

Getting started with understanding bytecode is easy; just run javap -c -v java.lang.Object or javap -c -v java.lang.String and see if you can understand what it tells you. Then try stepping through some compiled bytecode with JITWatch or the JVMulator. Finally, use the code snippets above or in the presentation to compile some Java code on the fly and then execute it. Once you’ve done that, you’ll have a much greater appreciation of what the JVM does for you every day.