I have two sets of machines, set A and set B, with apparently the same hardware/software configuration, but with a significant difference in performance. Machines in set B are up to x4 faster than machines in set A. However, if I reboot a machine in set A, inexplicably it starts performing as expected, like machines in set B. I can't find an explanation for this behavior.
- Dual processor
E5-2630v3, 8 physical cores, 2.4GHz base frequency, 3.2GHz Turbo frequency.
8x8GB RAM DDR4, 2133Mhz, one module per channel
- None of the machine has SEL events logged by the BMC that could point to a hardware issues.
- None of the machine is triggering any Machine Check Exception during the whole duration of the benchmarks.
All the hardware components are identical in Model/Part Number.
Settings and software
BIOS version and BIOS settings are identical, e.g.
Turbo Boostis enabled. See the link for details.
The machines are running the same
64 bitsversion of
Red Hat 6with kernel
Machines in both sets are idle, but whatever load I try to run, I get very different results in terms of performance. For the sake of simplicity, I will run all the benchmarks on core 0. The result is reproducible on all cores (both processors).
[root@SET_A ~]# uptime 11:48:40 up 51 days, 19:34, 2 users, load average: 0.00, 0.00, 0.00 [root@SET_A ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null' real 0m43.751s user 0m43.742s sys 0m0.005s
[root@SET_B ~]# uptime 11:50:00 up 15 days, 19:43, 1 user, load average: 0.00, 0.00, 0.00 [root@SET_B ~]# taskset -c 0 sh -c 'time echo "scale=5000; a(1)*4" | bc -l > /dev/null' real 0m18.648s user 0m18.646s sys 0m0.004s
reports core 0 as being in
C0 Consumption state and in
P0 Performance state with Turbo Frequency enabled for the whole duration of the benchmark.
[root@SET_A ~]# turbostat -i 5 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_% 3.15 3.18 2.39 0 3.26 0.00 93.59 0.00 40 46 49.77 0.00 0.00 0.00 46.45 22.41 0.00 0.00 0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 40 46 0.00 0.00 0.00 0.00 29.29 12.75 0.00 0.00
[root@SET_B ~]# turbostat -i 5 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 %c7 CTMP PTMP %pc2 %pc3 %pc6 %pc7 Pkg_W RAM_W PKG_% RAM_% 3.14 3.18 2.39 0 3.27 0.00 93.59 0.00 38 40 49.81 0.00 0.00 0.00 46.12 21.49 0.00 0.00 0 0 0 99.99 3.19 2.39 0 0.01 0.00 0.00 0.00 38 40 0.00 0.00 0.00 0.00 32.27 13.51 0.00 0.00
To simplify the benchmark as much as possible (no FP, as few memory accesses as possible) I wrote the following 32 bit code.
.text .global _start _start: movl $0x0, %ecx oloop: cmp $0x2, %ecx je end inc %ecx movl $0xFFFFFFFF,%eax movl $0x0, %ebx loop: cmp %eax, %ebx je oloop inc %ebx jmp loop end: mov $1, %eax int $0x80 .data value: .long 0
It simply increments a register from 0 to
0xFFFFFFFF twice, nothing else.
[root@SET_A ~]# md5sum simple.out 30fb3a645a8a0088ff303cf34cadea37 simple.out [root@SET_A ~]# time taskset -c 0 ./simple.out real 0m10.801s user 0m10.804s sys 0m0.001s
[root@SET_B ~]# md5sum simple.out 30fb3a645a8a0088ff303cf34cadea37 simple.out [root@SET_B ~]# time taskset -c 0 ./simple.out real 0m2.722s user 0m2.724s sys 0m0.000s
x4 difference to increment a register.
More observations with the simplified benchmark
During the benchmark, the number of interrupts is the same on both machines,
~1100 intr/s (reported with mpstat). These are mostly Local Timer Interrupts on
CPU0, so there's basically no difference in the source of interruption.
[root@SET_A ~]# mpstat -P ALL -I SUM 1 01:00:35 PM CPU intr/s 01:00:36 PM all 1117.00
[root@SET_B ~]# mpstat -P ALL -I SUM 1 01:04:50 PM CPU intr/s 01:04:51 PM all 1112.00
C-States holds the same as above.
Performance counter stats for 'taskset -c 0 ./simple.out': 41,383,515 instructions:k # 0.00 insns per cycle [71.42%] 34,360,528,207 instructions:u # 1.00 insns per cycle [71.42%] 63,675 cache-references [71.42%] 6,365 cache-misses # 9.996 % of all cache refs [71.43%] 34,439,207,904 cycles # 0.000 GHz [71.44%] 34,400,748,829 instructions # 1.00 insns per cycle [71.44%] 17,186,890,732 branches [71.44%] 143 page-faults 0 migrations 1,117 context-switches 10.905973410 seconds time elapsed
Performance counter stats for 'taskset -c 0 ./simple.out': 11,112,919 instructions:k # 0.00 insns per cycle [71.41%] 34,351,189,050 instructions:u # 3.99 insns per cycle [71.44%] 32,765 cache-references [71.46%] 3,266 cache-misses # 9.968 % of all cache refs [71.47%] 8,600,461,054 cycles # 0.000 GHz [71.46%] 34,378,806,261 instructions # 4.00 insns per cycle [71.41%] 17,192,017,836 branches [71.37%] 143 faults 2 migrations 281 context-switches 2.740606064 seconds time elapsed
- Number of kernel space instructions is different due to the control paths which lead to a reschedule. There are no system calls involved apart from a final
sys_exit. Clearly the number of context switches is higher for Set A.
- There's also a very small difference in user space instructions (
~10M). This might be caused by the same reason as above? Instructions which lead to a reschedule which are accounted as user space. Or instructions in interrupt context?
- Total number of instructions for Set A is
0.06%higher, but the number of
L3 cache referencesis double. Is this expected? A quick check on cache configuration leads to the same result. Cache are correctly enabled (
CR0: 0x80050033, CD is 0) and
MTRRconfiguration is identical.
- Probably the most interesting value is the instr per cycle. 1 inst per cycle on Set A, 4 inst per cycle on Set B.
Is there an obvious reason that can explain this difference in performance?
Why machines in Set A are running at 1 instr per cycle while machines in Set B are running at 4 instr per cycle, given the fact that the hardware/software configuration is identical?
Why rebooting a machine seems to fix this behavior? As explained in the introduction, if I reboot a machine in Set A, it starts performing as expected.
The cause here is either too trivial that I missed it or too complex that can't really be explained. I hope it's the former, any help/hint/suggestion is appreciated.