In order to measure DTLB performance, we collected sample data for Retired Instructions, Data Cache Accesses, L1 DTLB Miss and L2 DTLB Hit, and L1 DTLB and L2 DTLB Miss events. (See Section 6.5.1.)
Event Classic function Improved function
Abbreviation multiply_matrices multiply_matrices
------------------ ----------------- -----------------
Ret_instructions 68,180 88,150 samples
DC_accesses 402,415 602,298 samples
DTLB_L1M_L2H 59,532 53 samples
DTLB_L1M_L2M 157,529 175 samples
In their paper titled "On Reducing TLB Misses in Matrix Multiplication," Kazushige Goto and Robert van de Geijn assert that translation lookaside buffer misses are the limiting factor in fast matrix multiplication. The event data supports their claim.
Derived measurements were computed from the event data:
Classic function Improved function
Measurement multiply_matrices multiply_matrices
------------------ ----------------- -----------------
Elapsed time 13.2340 3.4370 seconds
L1 DTLB req rate 0.5902 0.6833
L1 DTLB miss rate 0.3184 0.0003
L1 DTLB miss ratio 0.5394 0.0004
L2 DTLB req rate 0.3184 0.0003
L2 DTLB miss rate 0.2310 0.0002
L2 DTLB miss ratio 0.7257 0.7675
The L1 DTLB request rate is higher for the improved version since it performs more memory access operations than the classic version. For the textbook program, an L1 DTLB miss occurs every 3.1 instructions and an L2 DTLB miss occurs every 4.3 instructions -- clearly unacceptable. The improved matrix multiplication program executes at least 3,300 instructions per DTLB miss.