性能调优之JMH必知必会3：编写正确的微基准测试用例

JMH必知必会系列文章（持续更新）

性能调优之JMH必知必会1：什么是JMH
性能调优之JMH必知必会2：JMH的基本用法
性能调优之JMH必知必会4：JMH的高级用法
性能调优之JMH必知必会5：JMH的Profiler

一、前言

在前面两篇文章中分别介绍了什么是JMH、JMH的基本法。现在来介绍JMH正确的微基准测试用例如何编写。【

单位换算：1秒(s)=1000000微秒(us)=1000000000纳秒(ns)

】

官方JMH源码（包含样例，在jmh-samples包里）下载地址：https://github.com/openjdk/jmh/tags。

官方JMH样例在线浏览地址：http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/。

本文内容参考书籍《Java高并发编程详解：深入理解并发核心库》，作者为汪文君，读者有需要可以去购买正版书籍。

本文由 @大白有点菜 原创，请勿盗用，转载请说明出处！如果觉得文章还不错，请点点赞，加关注，谢谢！

二、编写正确的微基准测试用例

1、添加JMH依赖包

在Maven仓库中搜索依赖包

jmh-core

和

jmh-generator-annprocess

，版本为

1.36

。需要注释 jmh-generator-annprocess 包中的“<scope>test</scope>”，不然项目运行会报错。

<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core --><dependency><groupId>org.openjdk.jmh</groupId><artifactId>jmh-core</artifactId><version>1.36</version></dependency><!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess --><dependency><groupId>org.openjdk.jmh</groupId><artifactId>jmh-generator-annprocess</artifactId><version>1.36</version><!--            <scope>test</scope>--></dependency>

2、避免DCE（Dead Code Elimination）

所谓Dead Code Elimination是指JVM为我们擦去了一些上下文无关，甚至经过计算之后确定压根不会用到的代码，如下面的代码块。

publicvoidtest(){int x=10;int y=10;int z=x+y;}

我们在test方法中分别定义了x和y，并且经过相加运算得到了z，但是在该方法的下文中再也没有其他地方使用到z（既没有对z进行返回，也没有对其进行二次使用，z甚至不是一个全局的变量），JVM很有可能会将test()方法当作一个空的方法来看待，也就是说会擦除对x、y的定义，以及计算z的相关代码。

【验证在Java代码的执行过程中虚拟机是否会擦除与上下文无关的代码 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（避免DCE，即死码消除）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.MICROSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_DCE{@Benchmarkpublicvoidbaseline(){// 空的方法}@BenchmarkpublicvoidmeasureLog1(){// 进行数学运算，但是在局部方法内Math.log(Math.PI);}@BenchmarkpublicvoidmeasureLog2(){// result是通过数学运算所得并且在下一行代码中得到了使用double result =Math.log(Math.PI);// 对result进行数学运算，但是结果既不保存也不返回，更不会进行二次运算Math.log(result);}@BenchmarkpublicdoublemeasureLog3(){// 返回数学运算结果returnMath.log(Math.PI);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【验证在Java代码的执行过程中虚拟机是否会擦除与上下文无关的代码 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_DCE.baseline     avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog1  avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog2  avgt    5  ≈ 10⁻⁴           us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_DCE.measureLog3  avgt    50.002 ±  0.001  us/op

baseline 方法作为一个空的方法，主要用于做基准数据。
measureLog1 中虽然进行了 log 运算，但是结果既没有再进行二次使用，也没有进行返回。
measureLog2 中同样进行了 log 运算，虽然第一次的运算结果是作为第二次入参来使用的，但是第二次执行结束后也再没有对其有更进一步的使用。
measureLog3 方法与 measureLog1 的方法类似，但是该方法对运算结果进行了返回操作。

从输出结果看出，measureLog1 和 measureLog2 方法的基准性能与 baseline 几乎完全一致，因此我们可以肯定这两个方法中的代码进行过擦除操作，这样的代码被称为 Dead Code（死代码，其他地方都没有用到的代码片段），而 measureLog3 则与上述两个方法不同，由于它对结果进行了返回，因此 Math.log(PI) 不会被认为它是 Dead Code，因此它将占用一定的CPU时间。

得出结论：要想编写性能良好的微基准测试方法，不要让方法存在Dead Code，最好每一个基准测试方法都有返回值。

【附上官方Dead Code样例（JMHSample_08_DeadCode） - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：官方Dead Code样例
 * @author 大白有点菜
 */@State(Scope.Thread)@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.NANOSECONDS)publicclassJmhTestApp14_DeadCode{/**
     * The downfall of many benchmarks is Dead-Code Elimination (DCE): compilers
     * are smart enough to deduce some computations are redundant and eliminate
     * them completely. If the eliminated part was our benchmarked code, we are
     * in trouble.
     *
     * 许多基准测试的失败是死代码消除（DCE）：编译器足够聪明，可以推断出一些计算是多余的，并完全消除它们。
     * 如果被淘汰的部分是我们的基准代码，我们就有麻烦了。
     *
     * Fortunately, JMH provides the essential infrastructure to fight this
     * where appropriate: returning the result of the computation will ask JMH
     * to deal with the result to limit dead-code elimination (returned results
     * are implicitly consumed by Blackholes, see JMHSample_09_Blackholes).
     *
     * 幸运的是，JMH 提供了必要的基础设施来在适当的时候解决这个问题：返回计算结果将要求 JMH 处理结果
     * 以限制死代码消除（返回的结果被黑洞隐式消耗，请参阅 JMHSample_09_Blackholes）。
     */privatedouble x =Math.PI;privatedoublecompute(double d){for(int c =0; c <10; c++){
            d = d * d /Math.PI;}return d;}@Benchmarkpublicvoidbaseline(){// do nothing, this is a baseline}@BenchmarkpublicvoidmeasureWrong(){// This is wrong: result is not used and the entire computation is optimized away.compute(x);}@BenchmarkpublicdoublemeasureRight(){// This is correct: the result is being used.returncompute(x);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_DeadCode.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【附上官方Dead Code样例（JMHSample_08_DeadCode） - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_DeadCode.baseline      avgt    50.261 ± 0.027  ns/op
JmhTestApp14_DeadCode.measureRight  avgt    513.920 ± 0.591  ns/op
JmhTestApp14_DeadCode.measureWrong  avgt    50.266 ± 0.039  ns/op

【官方Dead Code样例（JMHSample_08_DeadCode）注解 - 谷歌和百度翻译互补】

许多基准测试的失败是死代码消除（DCE）：编译器足够聪明，可以推断出一些计算是多余的，并完全消除它们。如果被淘汰的部分是我们的基准代码，我们就有麻烦了。

幸运的是，JMH 提供了必要的基础设施来在适当的时候解决这个问题：返回计算结果将要求 JMH 处理结果以限制死代码消除（返回的结果被黑洞隐式消耗，请参阅 JMHSample_09_Blackholes）。

3、使用Balckhole

假设在基准测试方法中，需要将两个计算结果作为返回值，那么我们该如何去做呢？我们第一时间想到的可能是将结果存放到某个数组或者容器当中作为返回值，但是这种对数组或者容器的操作会对性能统计造成干扰，因为对数组或者容器的写操作也是需要花费一定的CPU时间的。

JMH提供了一个 Blackhole（黑洞） 类，可以在不作任何返回的情况下避免 Dead Code 的发生，与Linux系统下的黑洞设备 /dev/null 非常相似。

【Blackhole样例 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.infra.Blackhole;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（使用Blackhole，即黑洞）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole{double x1 =Math.PI;double x2 =Math.PI*2;@Benchmarkpublicdoublebaseline(){// 不是Dead Code，因为对结果进行了返回returnMath.pow(x1,2);}@BenchmarkpublicdoublepowButReturnOne(){// Dead Code会被擦除Math.pow(x1,2);// 不会被擦除，因为对结果进行了返回returnMath.pow(x2,2);}@BenchmarkpublicdoublepowThenAdd(){// 通过加法运算对两个结果进行了合并，因此两次的计算都会生效returnMath.pow(x1,2)+Math.pow(x2,2);}@BenchmarkpublicvoiduseBlackhole(Blackhole hole){// 将结果存放至black hole中，因此两次pow操作都会生效
        hole.consume(Math.pow(x1,2));
        hole.consume(Math.pow(x2,2));}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【Blackhole样例 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.baseline         avgt    52.126 ± 0.163  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.powButReturnOne  avgt    52.065 ± 0.112  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.powThenAdd       avgt    52.181 ± 0.151  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Blackhole.useBlackhole     avgt    53.748 ± 0.342  ns/op

baseline 方法中对 x1 进行了 pow 运算，之后返回，因此这个基准测试方法是非常合理的。
powButReturnOne 方法中的第一个 pow 运算仍然避免不了被当作 Dead Code 的命运，因此我们很难得到两次 pow 计算的方法耗时，但是对 x2 的 pow 运算会作为返回值返回，因此不是 Dead Code。
powThenAdd 方法就比较聪明，它同样会有返回值，两次 pow 操作也会被正常执行，但是由于采取的是加法运算，因此相加操作的CPU耗时也被计算到了两次 pow 操作中。
useBlackhole 方法中两次 pow 方法都会被执行，但是我们并没有对其进行返回操作，而是将其写入了 black hole 之中。

输出结果表明，baseline 和 putButReturnOne 方法的性能几乎是一样的，powThenAdd 的性能相比前两个方法占用CPU的时间要稍微长一些，原因是该方法执行了两次 pow 操作。在 useBlackhole 中虽然没有对两个参数进行任何的合并操作，但是由于执行了 black hole 的 consume 方法，因此也会占用一定的CPU资源。虽然 blackhole 的 consume 方法会占用一定的CPU资源，但是如果在无返回值的基准测试方法中针对局部变量的使用都统一通过 blackhole 进行 consume ，那么就可以确保同样的基准执行条件，就好比拳击比赛时，对抗的拳手之间需要统一的体重量级一样。

得出结论：Blackhole 可以帮助你在无返回值的基准测试方法中避免DC（Dead Code）情况的发生。

【附上官方Blackhole样例（JMHSample_09_Blackholes） - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.infra.Blackhole;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：官方Blackhole样例
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Balckhole{/**
     * Should your benchmark require returning multiple results, you have to
     * consider two options (detailed below).
     *
     * 如果您的基准测试需要返回多个结果，您必须考虑两个选项（详细信息如下）。
     *
     * NOTE: If you are only producing a single result, it is more readable to
     * use the implicit return, as in JMHSample_08_DeadCode. Do not make your benchmark
     * code less readable with explicit Blackholes!
     *
     * 注意：如果您只生成一个结果，使用隐式返回更具可读性，如 JMHSample_08_DeadCode。不要使用显式黑洞来降低基准代码的可读性！
     */double x1 =Math.PI;double x2 =Math.PI*2;privatedoublecompute(double d){for(int c =0; c <10; c++){
            d = d * d /Math.PI;}return d;}/**
     * Baseline measurement: how much a single compute() costs.
     *
     * 基线测量：单个 compute() 的成本是多少。
     */@Benchmarkpublicdoublebaseline(){returncompute(x1);}/**
     * While the compute(x2) computation is intact, compute(x1)
     * is redundant and optimized out.
     *
     * 虽然 compute(x2) 计算完好无损，但 compute(x1) 是多余的并经过优化。
     *
     */@BenchmarkpublicdoublemeasureWrong(){compute(x1);returncompute(x2);}/**
     * This demonstrates Option A:
     *
     * 这演示了选项 A：
     *
     * Merge multiple results into one and return it.
     * This is OK when is computation is relatively heavyweight, and merging
     * the results does not offset the results much.
     *
     * 将多个结果合并为一个并返回。当计算相对重量级时，这是可以的，并且合并结果不会抵消太多结果。
     */@BenchmarkpublicdoublemeasureRight_1(){returncompute(x1)+compute(x2);}/**
     * This demonstrates Option B:
     *
     * 这演示了选项 B：
     *
     * Use explicit Blackhole objects, and sink the values there.
     * (Background: Blackhole is just another @State object, bundled with JMH).
     * 
     * 使用明确的 Blackhole 对象，并将值下沉到那里。
     * （背景：Blackhole 只是另一个 @State 对象，与 JMH 捆绑在一起）。
     */@BenchmarkpublicvoidmeasureRight_2(Blackhole bh){
        bh.consume(compute(x1));
        bh.consume(compute(x2));}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Balckhole.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【附上官方Blackhole样例（JMHSample_09_Blackholes） - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Balckhole.baseline        avgt    513.691 ± 0.661  ns/op
JmhTestApp14_Balckhole.measureRight_1  avgt    522.318 ± 1.664  ns/op
JmhTestApp14_Balckhole.measureRight_2  avgt    526.079 ± 3.079  ns/op
JmhTestApp14_Balckhole.measureWrong    avgt    513.276 ± 1.931  ns/op

【官方Dead Code样例（JMHSample_09_Blackholes）注解 - 谷歌和百度翻译互补】

见代码中的注释

4、避免常量折叠（Constant Folding）

常量折叠是Java编译器早期的一种优化——编译优化。在javac对源文件进行编译的过程中，通过词法分析可以发现某些常量是可以被折叠的，也就是可以直接将计算结果存放到声明中，而不需要在执行阶段再次进行运算。比如：

privatefinalint x =10;privatefinalint y = x*20;

在编译阶段，y的值将被直接赋予200，这就是所谓的常量折叠。

【Constant Folding样例 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（避免Constant Folding，即常量折叠）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding{/**
     * x1和x2是使用final修饰的常量
     */privatefinaldouble x1 =124.456;privatefinaldouble x2 =342.456;/**
     * y1则是普通的成员变量
     */privatedouble y1 =124.456;/**
     * y2则是普通的成员变量
     */privatedouble y2 =342.456;/**
     * 直接返回124.456×342.456的计算结果，主要用它来作基准
     * @return
     */@BenchmarkpublicdoublereturnDirect(){return42_620.703936d;}/**
     * 两个常量相乘，我们需要验证在编译器的早期优化阶段是否直接计算出了x1乘以x2的值
     * @return
     */@BenchmarkpublicdoublereturnCalculate_1(){return x1 * x2;}/**
     * 较为复杂的计算，计算两个未被final修饰的变量，主要也是用它来作为对比的基准
     * @return
     */@BenchmarkpublicdoublereturnCalculate_2(){returnMath.log(y1)*Math.log(y2);}/**
     * 较为复杂的计算，操作的同样是final修饰的常量，查看是否在编译器优化阶段进行了常量的折叠行为
     * @return
     */@BenchmarkpublicdoublereturnCalculate_3(){returnMath.log(x1)*Math.log(x2);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【Constant Folding样例 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_1  avgt    51.873 ± 0.119  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_2  avgt    536.126 ± 2.372  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnCalculate_3  avgt    51.888 ± 0.169  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Constant_Folding.returnDirect       avgt    51.869 ± 0.115  ns/op

我们可以看到，1、3、4三个方法的统计数据几乎相差无几，这也就意味着在编译器优化的时候发生了常量折叠，这些方法在运行阶段根本不需要再进行计算，直接将结果返回即可，而第二个方法的统计数据就没那么好看了，因为早期的编译阶段不会对其进行任何的优化。

【附上官方Constant Folding样例（JMHSample_10_ConstantFold） - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：官方Constant Folding样例
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_ConstantFolding{/**
     * The flip side of dead-code elimination is constant-folding.
     *
     * 死码消除的另一面是常量折叠。
     *
     * If JVM realizes the result of the computation is the same no matter what,
     * it can cleverly optimize it. In our case, that means we can move the
     * computation outside of the internal JMH loop.
     *
     * 如果 JVM 意识到无论如何计算的结果都是一样的，它可以巧妙地优化它。
     * 在我们的例子中，这意味着我们可以将计算移到内部 JMH 循环之外。
     *
     * This can be prevented by always reading the inputs from non-final
     * instance fields of @State objects, computing the result based on those
     * values, and follow the rules to prevent DCE.
     *
     * 这可以通过始终读取 @State 对象的非最终实例字段的输入，根据这些值计算结果，并遵循防止 DCE 的规则来防止。
     */// IDEs will say "Oh, you can convert this field to local variable". Don't. Trust. Them.// IDEs 会说“哦，你可以将这个字段转换为局部变量”。不要.相信.它们.// (While this is normally fine advice, it does not work in the context of measuring correctly.)// (虽然这通常是很好的建议，但它在正确测量的情况下不起作用。)privatedouble x =Math.PI;// IDEs will probably also say "Look, it could be final". Don't. Trust. Them. Either.// IDEs 可能还会说“看，它可能是最终版本”。 也.不要.相信.它们.// (While this is normally fine advice, it does not work in the context of measuring correctly.)// (虽然这通常是很好的建议，但它在正确测量的情况下不起作用。)privatefinaldouble wrongX =Math.PI;privatedoublecompute(double d){for(int c =0; c <10; c++){
            d = d * d /Math.PI;}return d;}@Benchmarkpublicdoublebaseline(){// simply return the value, this is a baseline// 简单地返回值，这是一个基线returnMath.PI;}@BenchmarkpublicdoublemeasureWrong_1(){// This is wrong: the source is predictable, and computation is foldable.// 这是错误的：来源是可预测的，计算是可折叠的。returncompute(Math.PI);}@BenchmarkpublicdoublemeasureWrong_2(){// This is wrong: the source is predictable, and computation is foldable.// 这是错误的：来源是可预测的，计算是可折叠的。returncompute(wrongX);}@BenchmarkpublicdoublemeasureRight(){// This is correct: the source is not predictable.// 这是正确的：来源是不可预测的。returncompute(x);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_ConstantFolding.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【附上官方Constant Folding样例（JMHSample_10_ConstantFold） - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_ConstantFolding.baseline        avgt    51.871 ± 0.077  ns/op
JmhTestApp14_ConstantFolding.measureRight    avgt    513.989 ± 0.909  ns/op
JmhTestApp14_ConstantFolding.measureWrong_1  avgt    51.846 ± 0.075  ns/op
JmhTestApp14_ConstantFolding.measureWrong_2  avgt    51.870 ± 0.090  ns/op

【官方Constant Folding样例（JMHSample_10_ConstantFold）注解 - 谷歌和百度翻译互补】

见代码中的注释

5、避免循环展开（Loop Unwinding）

我们在编写JMH代码的时候，除了要避免Dead Code以及减少对常量的引用之外，还要尽可能地避免或者减少在基准测试方法中出现循环，因为循环代码在运行阶段（JVM后期优化）极有可能被“痛下杀手”进行相关的优化，这种优化被称为循环展开，下面我们来看一下什么是循环展开（Loop Unwinding）。

int sum=0;for(int i =0;i<100;i++){
    sum+=i;}

上面的例子中，sum=sum+i 这样的代码会被执行100次，也就是说，JVM会向CPU发送100次这样的计算指令，这看起来并没有什么，但是JVM的设计者们会认为这样的方式可以被优化成如下形式（可能）。

int sum=0;for(int i =0;i<20; i+=5){
    sum+=i;
    sum+=i+1;
    sum+=i+2;
    sum+=i+3;
    sum+=i+4;}

优化后将循环体中的计算指令批量发送给CPU，这种批量的方式可以提高计算的效率，假设1+2这样的运算执行一次需要1纳秒的CPU时间，那么在一个10次循环的计算中，我们觉得它可能是10纳秒的CPU时间，但是真实的计算情况可能不足10纳秒甚至更低。

【Loop Unwinding样例 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（避免Loop Unwinding，即循环展开）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding{privateint x =1;privateint y =2;@Benchmarkpublicintmeasure(){return(x + y);}privateintloopCompute(int times){int result =0;for(int i =0; i < times; i++){
            result +=(x + y);}return result;}@OperationsPerInvocation@BenchmarkpublicintmeasureLoop_1(){returnloopCompute(1);}@OperationsPerInvocation(10)@BenchmarkpublicintmeasureLoop_10(){returnloopCompute(10);}@OperationsPerInvocation(100)@BenchmarkpublicintmeasureLoop_100(){returnloopCompute(100);}@OperationsPerInvocation(1000)@BenchmarkpublicintmeasureLoop_1000(){returnloopCompute(1000);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【Loop Unwinding样例 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measure           avgt    52.038 ± 0.167  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_1     avgt    52.112 ± 0.548  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_10    avgt    50.226 ± 0.013  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_100   avgt    50.026 ± 0.003  ns/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Loop_Unwinding.measureLoop_1000  avgt    50.023 ± 0.002  ns/op

上面的代码中，measure() 方法进行了 x+y 的计算，measureLoop_1() 方法与 measure() 方法几乎是等价的，也是进行了 x+y 的计算，但是 measureLoop_10() 方法对 result+=(x+y) 进行了10次这样的操作，其实说白了就是调用了10次 measure() 或者 loopCompute(times=1) 。但是我们肯定不能直接拿10次的运算和1次运算所耗费的CPU时间去做比较，因此 @OperationsPerInvocation(10) 注解的作用就是在每一次对measureLoop_10() 方法进行基准调用的时候将op操作记为10次。

通过JMH的基准测试我们不难发现，在循环次数多的情况下，折叠的情况也比较多，因此性能会比较好，说明JVM在运行期对我们的代码进行了优化。

【附上官方Loop Unwinding样例（JMHSample_11_Loops） - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：官方Loop Unwinding样例
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_LoopUnwinding{/**
     * It would be tempting for users to do loops within the benchmarked method.
     * (This is the bad thing Caliper taught everyone). These tests explain why
     * this is a bad idea.
     *
     * 对于用户来说，在基准方法中进行循环是很有吸引力的。
     * （这是 Caliper 教给大家的坏事）。这些测试解释了为什么这是一个坏主意。
     *
     * Looping is done in the hope of minimizing the overhead of calling the
     * test method, by doing the operations inside the loop instead of inside
     * the method call. Don't buy this argument; you will see there is more
     * magic happening when we allow optimizers to merge the loop iterations.
     *
     * 循环是为了最小化调用测试方法的开销，通过在循环内而不是在方法调用内进行操作。
     * 不要相信这个论点； 当我们允许优化器合并循环迭代时，您会看到更多神奇的事情发生。
     *//**
     * Suppose we want to measure how much it takes to sum two integers:
     */int x =1;int y =2;/**
     * This is what you do with JMH.
     * 这是您使用JMH所做的。
     */@BenchmarkpublicintmeasureRight(){return(x + y);}/**
     * The following tests emulate the naive looping.
     * 以下测试模拟了天真的循环。
     * This is the Caliper-style benchmark.
     * 这是 Caliper 风格的基准测试。
     */privateintreps(int reps){int s =0;for(int i =0; i < reps; i++){
            s +=(x + y);}return s;}/**
     * We would like to measure this with different repetitions count.
     * 我们想用不同的重复次数来衡量这一点。
     * Special annotation is used to get the individual operation cost.
     * 使用特殊注释来获得单个操作成本。
     */@Benchmark@OperationsPerInvocation(1)publicintmeasureWrong_1(){returnreps(1);}@Benchmark@OperationsPerInvocation(10)publicintmeasureWrong_10(){returnreps(10);}@Benchmark@OperationsPerInvocation(100)publicintmeasureWrong_100(){returnreps(100);}@Benchmark@OperationsPerInvocation(1_000)publicintmeasureWrong_1000(){returnreps(1_000);}@Benchmark@OperationsPerInvocation(10_000)publicintmeasureWrong_10000(){returnreps(10_000);}@Benchmark@OperationsPerInvocation(100_000)publicintmeasureWrong_100000(){returnreps(100_000);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_LoopUnwinding.class.getSimpleName()).forks(1).build();newRunner(opt).run();}}

【附上官方Loop Unwinding样例（JMHSample_11_Loops） - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_LoopUnwinding.measureRight         avgt    52.326 ± 0.089  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_1       avgt    52.052 ± 0.085  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_10      avgt    50.225 ± 0.006  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_100     avgt    50.026 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_1000    avgt    50.022 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_10000   avgt    50.019 ± 0.001  ns/op
JmhTestApp14_LoopUnwinding.measureWrong_100000  avgt    50.017 ± 0.002  ns/op

【官方Loop Unwinding样例（JMHSample_11_Loops）注解 - 谷歌和百度翻译互补】

见代码中的注释

6、Fork用于避免配置文件引导的优化（profile-guided optimizations）

Fork是用来干什么的呢？本节将会为大家介绍Fork的作用以及JVM的配置文件引导的优化（profile-guided optimizations）。

在开始解释Fork之前，我们想象一下平时是如何进行应用性能测试的，比如我们要测试一下Redis分别在50、100、200个线程中同时进行共计一亿次的写操作时的响应速度，一般会怎样做？首先，我们会将Redis库清空，尽可能地保证每一次测试的时候，不同的测试用例站在同样的起跑线上，比如，服务器内存的大小、服务器磁盘的大小、服务器CPU的大小等基本上相同，这样的对比才是有意义的，然后根据测试用例对其进行测试，接着清理Redis服务器资源，使其回到测试之前的状态，最后统计测试结果做出测试报告。

Fork的引入也是考虑到了这个问题，虽然Java支持多线程，但是不支持多进程，这就导致了所有的代码都在一个进程中运行，相同的代码在不同时刻的执行可能会引入前一阶段对进程profiler的优化，甚至会混入其他代码profiler优化时的参数，这很有可能会导致我们所编写的微基准测试出现不准确的问题。对于这种说法大家可能会觉得有些抽象，通过例子去理解更好。

【Fork设置为0样例 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（Fork用于避免 profile-guided optimizations）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)// 将Fork设置为0@Fork(0)// 将Fork设置为1//@Fork(1)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.MICROSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_Fork{// Inc1 和Inc2的实现完全一样interfaceInc{intinc();}publicstaticclassInc1implementsInc{privateint i =0;@Overridepublicintinc(){return++i;}}publicstaticclassInc2implementsInc{privateint i =0;@Overridepublicintinc(){return++i;}}privateInc inc1 =newInc1();privateInc inc2 =newInc2();privateintmeasure(Inc inc){int result =0;for(int i =0; i <10; i++){
            result += inc.inc();}return result;}@Benchmarkpublicintmeasure_inc_1(){returnthis.measure(inc1);}@Benchmarkpublicintmeasure_inc_2(){returnthis.measure(inc2);}@Benchmarkpublicintmeasure_inc_3(){returnthis.measure(inc1);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.class.getSimpleName()).build();newRunner(opt).run();}}

【Fork设置为0样例 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_1  avgt    50.002 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_2  avgt    50.012 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_3  avgt    50.012 ±  0.001  us/op

将Fork设置为0，每一个基准测试方法都将会与 JmhTestApp14_Coding_Correct_Benchmark_Case_Fork 使用同一个JVM进程，因此基准测试方法可能会混入 JmhTestApp14_Coding_Correct_Benchmark_Case_Fork 进程的Profiler。

measure_inc_1和 measure_inc_2 的实现方式几乎是一致的，它们的性能却存在着较大的差距，虽然 measure_inc_1 和 measure_inc_3 的代码实现完全相同，但还是存在着不同的性能数据，这其实就是JVM的

profiler-guided optimizations

导致的，由于我们所有的基准测试方法都与 JmhTestApp14_Coding_Correct_Benchmark_Case_Fork 的JVM进程共享，因此难免在其中混入 JmhTestApp14_Coding_Correct_Benchmark_Case_Fork 进程的Profiler，但是在将Fork设置为1的时候，也就是说每一次运行基准测试时都会开辟一个全新的JVM进程对其进行测试，那么多个基准测试之间将不会再存在干扰。

【Fork设置为1样例 - 代码】

packagecn.zhuangyt.javabase.jmh;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：编写正确的微基准测试用例（Fork用于避免 profile-guided optimizations）
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)// 将Fork设置为0//@Fork(0)// 将Fork设置为1@Fork(1)@Warmup(iterations =5)@Measurement(iterations =5)@OutputTimeUnit(TimeUnit.MICROSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Coding_Correct_Benchmark_Case_Fork{// Inc1 和Inc2的实现完全一样interfaceInc{intinc();}publicstaticclassInc1implementsInc{privateint i =0;@Overridepublicintinc(){return++i;}}publicstaticclassInc2implementsInc{privateint i =0;@Overridepublicintinc(){return++i;}}privateInc inc1 =newInc1();privateInc inc2 =newInc2();privateintmeasure(Inc inc){int result =0;for(int i =0; i <10; i++){
            result += inc.inc();}return result;}@Benchmarkpublicintmeasure_inc_1(){returnthis.measure(inc1);}@Benchmarkpublicintmeasure_inc_2(){returnthis.measure(inc2);}@Benchmarkpublicintmeasure_inc_3(){returnthis.measure(inc1);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.class.getSimpleName()).build();newRunner(opt).run();}}

【Fork设置为1样例 - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_1  avgt    50.003 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_2  avgt    50.003 ±  0.001  us/op
JmhTestApp14_Coding_Correct_Benchmark_Case_Fork.measure_inc_3  avgt    50.003 ±  0.001  us/op

以上输出是将Fork设置为1的结果，是不是合理了很多，若将Fork设置为0，则会与运行基准测试的类共享同样的进程Profiler，若设置为1则会为每一个基准测试方法开辟新的进程去运行，当然，你可以将Fork设置为大于1的数值，那么它将多次运行在不同的进程中，不过一般情况下，我们只需要将Fork设置为1即可。

【附上官方Fork样例（JMHSample_12_Forking） - 代码】

packagecn.zhuangyt.javabase.jmh;importcn.zhuangyt.javabase.jmh.jmh_sample.JMHSample_12_Forking;importorg.openjdk.jmh.annotations.*;importorg.openjdk.jmh.runner.Runner;importorg.openjdk.jmh.runner.RunnerException;importorg.openjdk.jmh.runner.options.Options;importorg.openjdk.jmh.runner.options.OptionsBuilder;importjava.util.concurrent.TimeUnit;/**
 * JMH测试14：官方Forks样例
 * @author 大白有点菜
 */@BenchmarkMode(Mode.AverageTime)@OutputTimeUnit(TimeUnit.NANOSECONDS)@State(Scope.Thread)publicclassJmhTestApp14_Fork{/**
     * JVMs are notoriously good at profile-guided optimizations. This is bad
     * for benchmarks, because different tests can mix their profiles together,
     * and then render the "uniformly bad" code for every test. Forking (running
     * in a separate process) each test can help to evade this issue.
     *
     * JVM 以擅长配置文件引导的优化而著称。 这对基准测试不利，因为不同的测试可以将它们的配置文件混合在一起，
     * 然后为每个测试呈现“一致糟糕”的代码。 分叉（在单独的进程中运行）每个测试可以帮助避免这个问题。
     *
     * JMH will fork the tests by default.
     *
     * JMH 将默认分叉测试。
     *//**
     * Suppose we have this simple counter interface, and two implementations.
     * Even though those are semantically the same, from the JVM standpoint,
     * those are distinct classes.
     *
     * 假设我们有这个简单的计数器接口和两个实现。尽管它们在语义上是相同的，但从 JVM 的角度来看，它们是不同的类。
     */publicinterfaceCounter{intinc();}publicstaticclassCounter1implementsJMHSample_12_Forking.Counter{privateint x;@Overridepublicintinc(){return x++;}}publicstaticclassCounter2implementsJMHSample_12_Forking.Counter{privateint x;@Overridepublicintinc(){return x++;}}/**
     * And this is how we measure it.
     * 这就是我们衡量它的方式。
     * Note this is susceptible for same issue with loops we mention in previous examples.
     * 请注意，这很容易受到我们在前面示例中提到的循环的相同问题的影响。
     */publicintmeasure(JMHSample_12_Forking.Counter c){int s =0;for(int i =0; i <10; i++){
            s += c.inc();}return s;}/**
     * These are two counters.
     */JMHSample_12_Forking.Counter c1 =newJMHSample_12_Forking.Counter1();JMHSample_12_Forking.Counter c2 =newJMHSample_12_Forking.Counter2();/**
     * We first measure the Counter1 alone...
     * 我们首先单独测量 Counter1 ...
     * Fork(0) helps to run in the same JVM.
     * Fork(0) 有助于在同一个 JVM 中运行。
     */@Benchmark@Fork(0)publicintmeasure_1_c1(){returnmeasure(c1);}/**
     * Then Counter2...
     * 然后到 Counter2...
     */@Benchmark@Fork(0)publicintmeasure_2_c2(){returnmeasure(c2);}/**
     * Then Counter1 again...
     * 然后再次到 Counter1
     */@Benchmark@Fork(0)publicintmeasure_3_c1_again(){returnmeasure(c1);}/**
     * These two tests have explicit @Fork annotation.
     * JMH takes this annotation as the request to run the test in the forked JVM.
     * It's even simpler to force this behavior for all the tests via the command
     * line option "-f". The forking is default, but we still use the annotation
     * for the consistency.
     *
     * 这两个测试有显示的 @Fork 注释。
     * JMH 将此注释作为在分叉的 JVM 中运行测试的请求。
     * 通过命令行选项“-f”为所有测试强制执行此行为甚至更简单。 分叉是默认的，但我们仍然使用注释来保持一致性。
     *
     * This is the test for Counter1.
     * 这是 Counter1 的测试
     */@Benchmark@Fork(1)publicintmeasure_4_forked_c1(){returnmeasure(c1);}/**
     * ...and this is the test for Counter2.
     * 还有这是 Counter2 的测试
     */@Benchmark@Fork(1)publicintmeasure_5_forked_c2(){returnmeasure(c2);}publicstaticvoidmain(String[] args)throwsRunnerException{Options opt =newOptionsBuilder().include(JmhTestApp14_Fork.class.getSimpleName()).build();newRunner(opt).run();}}

【附上官方Fork样例（JMHSample_12_Forking） - 代码运行结果】

BenchmarkModeCntScoreErrorUnitsJmhTestApp14_Fork.measure_1_c1         avgt    52.162 ± 0.129  ns/op
JmhTestApp14_Fork.measure_2_c2         avgt    512.490 ± 0.304  ns/op
JmhTestApp14_Fork.measure_3_c1_again   avgt    512.182 ± 0.605  ns/op
JmhTestApp14_Fork.measure_4_forked_c1  avgt    53.138 ± 0.162  ns/op
JmhTestApp14_Fork.measure_5_forked_c2  avgt    53.179 ± 0.302  ns/op

【官方Fork样例注解（JMHSample_12_Forking） - 谷歌和百度翻译互补】

见代码中的注释

标签： java 开发语言性能优化

本文转载自: https://blog.csdn.net/u014282578/article/details/128180516
版权归原作者 大白有点菜 所有，如有侵权，请联系我们删除。

性能调优之JMH必知必会3：编写正确的微基准测试用例