I ran the program (compiled using g++-3.4.3, linked to glibc-2.3.5) on a 5475 board(266Mhz) w/linux-2.4.26, and get a total time of 9.62 seconds, of which 7.05 seconds is in user space, and the other 2.57 seconds is spent in the kernel.
The ColdFire MMU requires support code to handle page table walks, wheras the ARM has hardware support for TLB (transition lookaside buffers) walks, so this explains the 2+ seconds in the kernel since there are 1887570 page misses (a page is 8K). Each miss takes on average 1.361 microseconds to handle. The program has *very* poor reference locality.
Part of the TLB code in the kerenl has to flush the cache upon handling the TLB miss since when a page is removed from the hardware MMU, cache lines for that page need to be flushed/invalidated. Currently I believe that the *entire* cache is flushed/invalidated instead of only the affected lines since its simpler to do. It is an area of the kernel that is ripe for performance review.
Which compile,r and libstdc++ did your friend compile the test code for the ARM? WHich ARM is it? Which kernel is it running on?
|