During my free time, I am developing
goimports-like Java command line tool that auto-imports Java classes without relying on an IDE, a LSP or
any kind of cache. I use it daily to write Java, pairing it with
automatically format my code and add missing imports on save. Obviously, I need
javaimports to run fast, which pushes me to look for ways to make my code, and Java command line
applications in general, faster.
My first move was to add multithreading support. But when I ran a poor man’s benchmark using
some numbers caught my attention.
|project||number of files||dependencies||v1.1 (single thread) time||v1.2 (multi thread) time|
Unsurprisingly, multithreading does not help in project C. But 0.6 seconds to parse 4 files and scan 0 dependency sounds like an awful lot of time to me! A quick analysis found a bottleneck: despite all files being roughly the same size, parsing the first file took close to a hundred times longer (200ms as opposed to 2~3ms).
I first suspected there might be some kind of caching involved, but the library I used did not
contain any. What else could explain why the first time is abnormally slow? My suspicions turned
to class loading. Running
-Xlog:class+load did show something interesting around the
first call to
[0.108s][info][class,load] com.nikodoko.javaimports.parser.Parser source: file:/Users/nicolas.couvrat/javaimports-1.2-SNAPSHOT-all-deps.jar ... ~1100 classes!! ... [0.313s][info][class,load] com.nikodoko.javaimports.parser.Parser$$Lambda$73/0x0000000800c47040 source: com.nikodoko.javaimports.parser.Parser
The large number of classes involved suggested that improving loading time could lead to serious performance gains. I decided to look at what happens when the Java Virtual Machine (JVM) loads a class.
What Happens When the JVM Loads a Class?
As you probably know, Java is a hybrid language, halfway between compiled languages (like C) and interpreted languages (like Python). Java code is first compiled to bytecode, which is the machine language of a virtual machine called the JVM. The JVM then interprets it1, which means that bytecode is loaded at runtime.
Most of the details of class loading are out of the scope of this article, so I will stick to a broad overview. If you wish to learn more, I recommend you check Oracle’s very thorough documentation.
Loading Bytecode And Creating an Internal Representation
When the JVM first encounters a class name, it will try to load it using an available class loader. A
class loader can be viewed as a bytecode provider: most of the time, that bytecode will come from a
.jar archive stored locally, but it could occasionally be downloaded over the network, generated
on the fly, etc.
The JVM will parse that binary data and, provided it represents a valid
file, derive an internal
representation from it. That representation will eventually be stored in the method area of the JVM
(a section of the heap shared by all JVM threads), and used every time the corresponding class is
initialized or called by the rest of the code.
Before using that internal representation, however, the JVM goes through a number of additional steps, together referred to as “linking”.
- it verifies that the bytecode satisfies a number of constraints,
- it rewrites parts of it, optimizing what it can,
- it resolves it by loading and creating the other classes it uses (also checking access permissions),
- and etc.
Finally, static fields and static blocks are initialized: the class is now ready to be used.
As you can see, that’s a lot of work, especially given that it’s done recursively, and for all the core classes too. And because it is only performed when a class is first encountered, most of the price is paid at startup. While this is acceptable for a long running, server-side application, it is very painful for a short-lived command line tool that will have to go through this loading step every single time.
Reducing Startup Time With AppCDS
Maintainers of the Java language have long been aware of this issue, and have introduced Class Data Sharing (CDS) as a way to mitigate it. A first version was included with Java SE 5.0, limited to a small set of core classes. It was improved and extended several times, with AppCDS (that also works for user-defined classes) being finally released in OpenJDK 102.
The idea is simple. The internal representation of loaded classes is dumped in an archive, that can be directly memory-mapped the next time the application runs, removing the need for expensive verification and parsing. This reduces the memory footprint (as part of the class archive can be shared between JVM processes, and less allocations are made), but also obviously helps with speed, which is what I am interested in here.
The original iteration of AppCDS used to require 3 steps to generate and use the archive file3, but it has been simplified with OpenJDK 13. It is now a matter of running the following 2 commands:
# Run the program once to generate a shared archive file java -XX:ArchiveClassesAtExit=cds.jsa -jar ... # Then use it every other time and profit! java -XX:SharedArchiveFile=cds.jsa -jar ...
How efficient is it at decreasing loading time? Using
-Xlog:class+load shows the following:
[0.104s][info][class,load] com.nikodoko.javaimporthets.parser.Parser source: shared objects file (top) ... still ~1000 classes, but a lot of them now come from the shared objects file ... [0.222s][info][class,load] com.nikodoko.javaimports.parser.Parser$$Lambda$74/0x000000080160cc40 source: com.nikodoko.javaimports.parser.Parser
With AppCDS, the first call to
Parser takes 100ms less! Overall, running
javaimports on the
small project C now takes 400ms instead of 600ms, a 33% speed increase. Of course, the gain is flat:
it only affects class loading, and the amount of classes loaded does not change depending on how
long the program runs. In my case, on the bigger project A, the same 200ms difference can be
observed, going from 1.6s to 1.4s: less impressive, but still a solid 13% improvement.
So, do you want to use AppCDS in your project? For a short-lived Java application not spending too much time waiting on I/O, the answer is probably yes. It’s hard to predict exactly how much it will benefit you, but hoping for a 30% improvement in startup time for a small command line utility is not unreasonable. As for me, I’m not yet entirely satisfied. It’s not bad, but can’t it go even faster? (Spoiler: The answer is yes, and I will cover it in a next article).
As always, shoot me a message or tweet @nicol4s_c if you want to chat about any of this, if you spotted any mistakes or typos, or if you’d like me to cover anything else! Have a great day :)
- Of course, modern day JVMs pack plenty of features and optimizations like just-in-time compilation, but these are mostly irrelevant to code that is executed a low number of times like it is the case in command line applications. [return]
- See JEP 310 for more details. [return]
- See the JEP 310, or alternatively this short blog post. [return]