Go to previous topic
Go to next topic
Last Post 16 Mar 2018 11:32 AM by  Adam Ford
ltib custom toolchain
 50 Replies
Author Messages
Jared
New Member
New Member
Posts:19


--
21 Jan 2015 05:16 PM

    I am trying to use a different toolchain other than the Codesourcery-2011.09-70 gcc-4.6.1 ARMv5te/glibc-2.13.  Everything seems to build without errors and I can load it onto the torpedo with 'run makeyaffsboot' .  However, when I try to execute the image It always seems to hang at the starting of the kernel:

    //cut and paste- of terminal window

       Verifying Checksum ... OK
       Loading Kernel Image ... OK
    OK

    Starting kernel ...

    //end cut and paste - hangs at Starting Kernel ...

     

    I am having crosstools-ng use glibc-2.13 and I've tried different GCC versions with out much luck.  It seems like the u-boot portions compiled with the custom toolchain  appear to be working. I can load previously compiled kernels / system builds with the new u-boot builds.  However, none of the kernel files generated with the  custom toolchain seem to load.   Any suggestions?

    -J

     

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 06:30 AM
    We haven't tested or certified other tools, so my first question would be why do you want a different tool? If it's for size, there are some options to compile for size with the existing compiler.

    If you really feel it necessary to use a different compiler, there are kernel debugging options you can enable to help debug. I have been told there are some incompatibilities with other compilers and the TI DVSDK, so we generally recommend against using different compilers.


    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 08:38 AM
    Hi Adam,

    I had been testing some statically compiled DSP performance test programs from very similar platforms OMAP 3530 with very similar PoP memory. The other platform was significantly faster (3x-8x) for certain tests. Since the binary is static, the hardware is very similar, and the CPU frequency was the same, I was thinking it was something with kernel or system calls that was responsible for the difference. The other hardware platform is running a slightly newer kernel, and was built with more recent toolchains. I was thinking that this may be resulting in the difference in performance. Additionally, I believe there has been some compiler optimization improvements for the ARM especially with the NEON / SIMD in the more modern compilers.

    My other thought was RAM timing settings. Are the Torpedo's RAM timings set to the fastest allowed between the memory and platform by default?

    Thank you for your assistance,

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 08:59 AM

    Jared,

     

    I've spent a some time this morning trying to compile with Code Sourcery 2014.05 without success.  The kernel behaves the same way as you describe.  You say that the OMAP35 was 3-8x faster, but it was using a different Kernel, I was wondering if you could share what kernel it's using, what compile you're usingon the 3530 and how you are running the benchmarks. 

     

    I'll do some digging into the memory interface and let you know what I find.

     

    adam

    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 09:24 AM
    Hi Adam,

    The other hardware platform is very similar to the original Beagleboard. Infact, it is so similar it uses the same demo kernel as beagleboard platform. The uname -a :
    Linux arm 3.18.1-armv7-x2 #1 SMP Wed Dec 17 14:22:02 UTC 2014 armv7l GNU/Linux
    http://elinux.org/BeagleBoardDebian

    The test program is a combination of calls to common DSP libraries and some homebrew algorithms. The DSP libraries are FFTW, and NE10, which both have options for NEON optimizations. Both libraries seem to be commonly used with ARM devices. The test program was originally used to compare FFTW and NE10, but also tested different versions of some homebrew algorithms. Thank you again for your assistance.

    Sincerely,

    Jared



    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 09:29 AM
    I forgot the compiler. it looks like the kernel was compiled with 4.7.2, but the test programs look like they were compiled with 4.8.2.

    -J
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 10:15 AM
    Are you using a benchmark tool? If so , can you tell me which benchmarks? I'd like to try to replicate your findings.

    I noticed the kernel you are using a very new compared to the one we have. The 3.18 kernel might have a fair amount of optimizations as well. I can't work on this full time, but I would suggest trying to build the kernel with the stock compilers and possibly trying your new computer tools to compile the app to see if the speed is affected when compiling the app.

    At home, I've been working on a 3.17 kernel for this board compiled with gcc 4.8.3, but it's not endorsed by Logic. I'd like to see if I notice a difference in speed between our stock kernel and the 3.17 assuming I can get the benchmark tools working, but I'd like to know what and how you are testing it so it's comparable.

    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 10:51 AM
    Hi Adam,

    Both FFTW and NE10 have benchmarking programs as part of their libraries. When the libraries are built, they also compile benchmarking programs. In FFTW the tests directory contains a program called bench, which allows performance testing. in NE10 the build/test directory contains a program NE10_dsp_unit_test_performance, which allows performance testing.

    my test program uses these libraries. for some comparison here are some numbers I get when I take the average of multiple executions of these algorithms.

    FFTW with 1024pt rfft - 290 us vs 95us
    NE10 with 1024pt rfft (using NEON) 44 us vs 402us

    -Jared


    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 10:52 AM
    in the example below, the slower times are on the Torpedo.
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    23 Jan 2015 10:17 AM
    I found this website: http://www.vesperix.com/a...cc-a8-fma/index.html that shows some benchmarks, and I agree with you that our module is running significantly slower than these benchmarks listed.

    I'm going to contact our software engineer who is in charge of the Linux kernel to see if he has any opinions.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 09:36 AM
    I spent a few hours over the weekend running some tests on our stock BSP 2.4-3.

    I found that if I ran ./ltib -c and changed kernel default preconfig (Normal) to performance, and I also changed the compiler from 2009 to 2011 which increased the version number GCC from 4.3.3 to 4.6.1.

    Can you tell me what compiler flags you are using to compile FFTW? http://www.fftw.org/doc/I...llation-on-Unix.html has some suggestions.

    Either way, I am still not getting the highest speeds, but for my testing, it seemed to double the performance. I also experimented with some other compiler flags based on feedback from TI http://processors.wiki.ti.../index.php/Cortex-A8 In there they recomended using "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp" as GCC flags.

    After doing all of the above suggestions, I was about to get higher MFLOPS ratings than before by about 3x, but it wasn't yet able to acheive the 8x you stated.

    I am still waiting to hear back from our Linux developer to see what he says.

    adam

    jduran.gm
    New Member
    New Member
    Posts:79


    --
    26 Jan 2015 09:41 AM
    Adam,

    Taking a look to the web of FFTW (http://www.fftw.org/doc/I...ation-on-Unix.html), you should enable neon:

    --enable-sse, --enable-sse2, --enable-avx, --enable-altivec, --enable-neon: Enable the compilation of SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM processors). SSE, AltiVec, and NEON only work with --enable-float (above). SSE2 works in both single and double precision (and is simply SSE in single precision). The resulting code will still work on earlier CPUs lacking the SIMD extensions (SIMD is automatically disabled, although the FFTW library is still larger).

    Joaquim Duran
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 09:46 AM
    For configuring FFTW I did the following:

    ./configure --prefix=/home/aford/1026167_LogicPD_Linux_BSP_2.4-3/rootfs --enable-single --enable-neon --host=arm-none-linux-gnueabi "CC=arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mfloat-abi=softfp -mfpu=neon -ffast-math"

    This was both based on the link you sent me as well as the TI one I sent you.
    Jared
    New Member
    New Member
    Posts:19


    --
    26 Jan 2015 10:49 AM
    Hi Adam,

    I am already using the 2011 version of CodeSourcery and I believe I had the NEON instructions already enabled. I remember I tried the performance kernel option in the past and did not remember a noticeable difference.

    my configure was
    ./configure --prefix=/home/logic/logic/Logic_BSPs/Linux_3.0/REL-ltib-DM3730-2.3-2/rootfs/usr --with-slow-timer --host=arm-linux-gnueabi --enable-single --enable-neon
    and my CFLAGS were
    "-O2 -fsigned-char -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -ffast-math -funsafe-math-optimizations"
    Would it be possible for you to post your bench results with the CPU locked at 600MHz for:
    ./bench orf1024
    ./bench orf2048
    ./bench orf4096

    Thank you again for your help and support.

    -J

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 10:59 AM
    I have to split my time amongst multiple people, but I should be able to work on this again later tonight and have some results for you tomorrow morning.

    I know our head Linux guy is investigating because I have seen some e-mails going back and forth internally. I'll let you know when I hear something useful to you.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:27 AM
    I mentioned the kernel issue with newer build tools the the Linux developer here, and I'm waiting to hear back from him, so he is aware of both the performance issue as well as the kernel building with the newer toolchain.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 01:47 PM
    Our head linux developer confirmed the cause of the kernel hang was due to a Kernel Oops in the first call to clkddev_add(), specifically that while calling mutex_lock_nested, $r4 changes (which is a definite no-no as #r4 should be a callee save/restored register).

    I won't bore you with the disassembly. He is not sure if the kernel Oops is a bug in the tools or the kernel source. Either way, we're looking for a simple solution to solve the performance concern without creating the Kernel Oops.

    I'll keep you posted as I learn more.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:31 PM
    Jared,

    To answer your questions, at 600MHz, I get the following resultswith the 2011 Compiler


    Problem: orf1024, setup: 22.58 s, time: 133.04 us, ``mflops'': 192.42
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 29.10 s, time: 206.95 us, ``mflops'': 272.14
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.84 s, time: 467.28 us, ``mflops'': 262.97
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:43 PM
    At 600 MHz, the 2014.05 compiler returns:

    Problem: orf1024, setup: 22.50 s, time: 85.11 us, ``mflops'': 300.79
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 29.00 s, time: 206.95 us, ``mflops'': 272.14
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.87 s, time: 472.06 us, ``mflops'': 260.3
    DM-37x#

    Jared
    New Member
    New Member
    Posts:19


    --
    02 Feb 2015 04:14 PM
    Hi Adam,
    I am able to build my system with GCC 4.7.4 with the ltib custom tool chain option. When I try to build with GCC versions later than that, it seems to hang at starting kernel. The kernel appears to run, but I am still not seeing the performance I expect out of the hardware. I was wondering if the memory timing settings were ever confirmed to be set to the fastest for the platform? My other guess was an alignment or alignment / trap issue. I noticed on the other systems I was running the static binaries on, the A bit in the ARM control register is not set and on the torpedo it is.

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    03 Feb 2015 08:41 AM
    Jared,

    To answer your question about the memory timings, I confirmed with another engineer that the timings should be set to their optimal settings for this hardware. It is possible that other hardware you are evaluating has different memory with differing timings.

    As far as the unaligned access and A- register, I had to do a little digging into the A register that you mentioned. I found this article interesting. http://jsolano.net/2012/0...ting-point-in-linux/

    We do set the CONFIG_ALIGNMENT_TRAP flag in our kernel by default as recommended by this article.

    I also ran some benchmarks against a hacked 3.14 kernel that I cobbled together compiled for our DM3730 and the benchmarks were consistent with the 3.0 kernel.

    I also modified the SPEC file to load the 3.0.101 kernel instead of the 3.0 kernel which compiled and executed with some small tweaks with a much newer code sourcery arm2014.05 tool chain as well as a few different incarnations of cross-ng. Each time, I got fairly consistent values of the NEON floating point.

    Having said that, since my job is to help people get the board functional, and some of these details go beyond my expertise, we do have some people who could help you investigate this further, but it would require a service contract at some expense. I realize that isn't what you wanted to hear, but if you are interested in setting up a service contract, I can certainly have someone in our sales group contact you.

    adam

    Mike
    New Member
    New Member
    Posts:11


    --
    18 Feb 2015 10:03 AM
    Hi Adam,

    I work with Jared and we have some new developments on this subject.

    First, Jared pulled memory timing information from the manufacturer datasheet and applied them in x-loader overriding your memory timing settings, resulting in a notable performance improvement.

    Second (and more significant), we discovered that after initial boot, cache is not enabled on the platform.  What is interesting is that if you put the board in sleep state and then wake it up cache is then enabled.  The easiest way to verify this is to boot the system and run a benchmark program of your choice (neon or not, it affects all system performance), then put the system into sleep state and wake it back up, then run the benchmark program again.  The difference, as you might expect, is astounding.  What we haven't yet figured out is the right place in the kernel to enable cache so that it is in fact enabled after initial boot.  Any insight you might have on that would be appreciated.

    We are also trying to figure out how to properly enable L1NEON in the Auxiliary Control Register of CP15.  This should allow L1 caching for neon which should further improve neon performance.  Any thoughts on this would also be appreciated.

    Thanks,

    -Mike
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    18 Feb 2015 10:05 AM
    You have a very interesting find. Your findings are not what I expected. I'm going to try to escalate this to our main Linux developer, to get some feedback.

    stay tuned...

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    18 Feb 2015 02:02 PM
    There is some concern about stability over temperature. Because some of our SOMs are setup for high temperature, we have to adjust the timings to accommodate that since we have one BSP for all temperature ranges.

    Having said that, if you are willing to share with us what timings you're using, I can run some tests at temperature to see if we experience any failures. It won't be an exhaustive test, but it would also make us feel better because we normally recommending against doing that.

    I also have been given some tasks by our Linux developer to run some tests and based on the outcome of those tests, we can probably provide some patches based on the findings.

    I hope to have an update for you soon.


    Can you post the timings you have? if not, would be willing to share them over e-mail?

    adam



    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    18 Feb 2015 02:37 PM
    I am doing some testing of the L1NEON in the in the Auxiliary Control Register of CP15.

    I will post some results when I am finished along with some recommended changes.

    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    18 Feb 2015 02:37 PM
    Hi Adam,

    The Register values come from the Datasheet and the excel spreadsheet provided by TI for calculating the register values for the processor family. These values should work for both the Commercial and the industrial versions. The commercial version can probably be adjusted even more aggressively, but these settings will work for both. I have used both the commercial and industrial units, bot not tested them each over the full temperature range.

    ACTIMA = 0x7AE1B4C6
    ACTIMB = 0x00021217
    RFR_CTRL = 0x0005E601

    Also consider using:
    mcfg = 0x3588099

    Hope that helps.

    -Jared


    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    18 Feb 2015 02:53 PM

    I did some testing of the L1NEON and the performance actually was worse after coming out of sleep with cache turned back on, so I'm a little reluctant to send you any changes to that yet. I've been sendin my data to our Linux developer and he's giving me stuff to try. I'll keep you posted as we make progress.


    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 08:07 AM
    Jared & Mike,

    I tested this with mixed results, but this is what I received from our Linux developer. It's not considered fully validated, but it should help you modify u-boot to modify the auxcr. Please use at your own risk until we can further test

    As for L1NEON, assuming they are using u-boot-2011.06 (i.e. BSP 2.4-3), then to enable L1NEON, I'd think modifying setup_auxcr in arch/arm/cpu/armv7/omap3/cache.S, around line 171 from:

    mov r12, #0x3
    mrc p15, 0, r0, c1, c0, 1
    orr r0, r0, #0x10 @ Enable ASA
    @ Enable L1NEON on pre-r2p1 (erratum 621766 workaround)
    cmp r1, #0x21
    orrlt r0, r0, #1 << 5
    .word 0xE1600070 @ SMC
    to:

    mov r12, #0x3
    mrc p15, 0, r0, c1, c0, 1
    orr r0, r0, #0x10 @ Enable ASA
    @ Enable L1NEON on all variants (may violate an errata, don't know)
    orr r0, r0, #1 << 5
    .word 0xE1600070 @ SMC


    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 11:01 AM
    With the above patch and using the 2011 compiler supplied in the BSP, I was able to compile FFTW and get the following benchmark results at 600MHz:

    DM-37x# ./bench orf1024
    Problem: orf1024, setup: 22.49 s, time: 60.08 us, ``mflops'': 426.08
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 29.00 s, time: 134.23 us, ``mflops'': 419.59
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.96 s, time: 556.94 us, ``mflops'': 220.64
    DM-37x#
    Mike
    New Member
    New Member
    Posts:11


    --
    19 Feb 2015 11:21 AM

    Thanks Adam.

    Just to clarify, the patch you are referring to only enables L1NEON for all platforms (no other change), correct?  It also appears (assuming your patch only enables L1NEON) that the 1024 and 2048 yielded improvement but the 4096 performed worse.  Is that your assessment as well?

    We are very curious about the cache not being enabled on initial boot, only being enabled after a sleep/wake cycle.  This seems to have the most significant impact on performance that we have observed.  Have you confirmed this behavior?  We are wondering how/where to patch to resolve this particular issue.

     

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 11:43 AM
    Mike,

    I was able to confirm that upon startup the NEON performance was 3x slower on startup than after returning from sleep in the stock BSP

    The patch I posted above is to only enable L1NEON. This seemed to eliminate the performance gap between startup and waking after sleep.

    Do determine if the issue is sytem-wide or NEON specific, I ran some of the optional benchmark tests that we have available in our BSP. When running those non-NEON benchmarks, they did not change before and after sleep which leads us to believe that the issue is more specific to NEON rather than the system as a whole. Are you seeing different results? If so, do you have any register dumps that we can compare?


    As far as the FFTW results go, I am not sure how to explain why it drops at 4096, but the 1024 and 2048 results are much improvements.

    We have a built-in benchmark tool called speedtest-neon which has been my primary focus on verifying performance.

    adam



    Jared
    New Member
    New Member
    Posts:19


    --
    19 Feb 2015 12:05 PM
    Hi Adam,

    JTAG probing of the microprocessor registers before sleep and after sleep show a difference in the L2EN bit in the auxiliary control register. This bit is not particularly neon specific and should effect the whole system. With the NE10 library, there is non-neon equivalents to all of the NEON functions. these functions have the same name with the suffix '_c'. Benchmarks with these non neon routines show a speed increase as well.

    example:

    ne10_fft_r2c_1d_float32_neon (fout_c, fin_r, cfg); //force neon usage

    ne10_fft_r2c_1d_float32_c (fout_c, fin_r, cfg); //force non-neon version,


    -Jared
    Jared
    New Member
    New Member
    Posts:19


    --
    19 Feb 2015 03:15 PM
    Hi Adam,

    Just a quick note. If L2EN is off, then setting L1NEON will probably help performance, However, enabling L1NEON while L2EN is also enabled will probably lead to slower performance than just having L2EN enabled on its own. I've been using the example you posted above to set L2EN on boot-up. It is still running through tests, I'll keep you updated as I get the results.

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 03:30 PM
    That's great. I'm still waiting to hear back from our main Linux developer to comment on it. If you have something you'd like me to try, I'm willing to do that. I will be out of the office tomorrow afternoon, and I have a few meetings in the morning, so I might not get to it on Monday, but I'm happy to help as I can.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 04:04 PM
    It looks like u-boot intentionally disables L2 cache.

    Inside: rpm/BUILD/u-boot-2011.06/include/configs/omap3logic.h


    /* L2 was disabled since observed that large displays (720p) weren't working
    * Verify can boot kernel using either XGA or 720p HDMI displays settings */
    #define CONFIG_L2_OFF /* Keep L2 Cache Disabled */

    I am going to run some tests to see if enabling this helps anything by disabling this. I also put in a note to the developer to ask for clarification on it.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    19 Feb 2015 04:23 PM
    I agree with your assessment that L2 makes a difference, but I am still waiting to hear back on the issues with it being disabled for the purposes of high resolution displays.

    With L2 Cache enabled and ondemand governor on a 1GHz module, I get the following results:

    DM-37x# ./bench orf1024
    Problem: orf1024, setup: 22.10 s, time: 34.75 us, ``mflops'': 736.69
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 28.58 s, time: 73.20 us, ``mflops'': 769.45
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.43 s, time: 155.45 us, ``mflops'': 790.46

    DM-37x# ./fftw.sh
    c i 1-D powers of two
    Problem: ic2, setup: 80.47 ms, time: 143.89 ns, ``mflops'': 69.497
    Problem: ic4, setup: 92.41 ms, time: 206.76 ns, ``mflops'': 193.46
    Problem: ic8, setup: 80.51 ms, time: 425.60 ns, ``mflops'': 281.96
    Problem: ic16, setup: 72.30 ms, time: 1.23 us, ``mflops'': 260.3
    Problem: ic32, setup: 2.48 s, time: 1.18 us, ``mflops'': 676.4
    Problem: ic64, setup: 4.20 s, time: 2.30 us, ``mflops'': 835.34
    Problem: ic128, setup: 8.10 s, time: 6.21 us, ``mflops'': 720.97
    Problem: ic256, setup: 11.37 s, time: 13.20 us, ``mflops'': 775.63
    Problem: ic512, setup: 16.66 s, time: 24.91 us, ``mflops'': 924.78
    Problem: ic1024, setup: 22.04 s, time: 59.12 us, ``mflops'': 865.96
    Problem: ic2048, setup: 28.03 s, time: 125.41 us, ``mflops'': 898.2
    Mike
    New Member
    New Member
    Posts:11


    --
    19 Feb 2015 04:25 PM

    Hi Adam,

    Good find.  Since we don't need the display we're going to enable L2 cache in u-boot.

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    20 Feb 2015 07:56 AM
    The results above were with u-boot returned to normal with L2 Cache enabled using the 2011 compiler

    With the L1NEON enabled AND L2 Cache enabled I got the following results using the same 1GHz processor with the ondemand governor:

    DM-37x# ./bench orf1024
    Problem: orf1024, setup: 22.05 s, time: 33.26 us, ``mflops'': 769.7
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 28.60 s, time: 69.86 us, ``mflops'': 806.24
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.44 s, time: 159.27 us, ``mflops'': 771.54
    DM-37x#

    DM-37x# ./fftw.sh
    c i 1-D powers of two
    Problem: ic2, setup: 93.41 ms, time: 410.71 ns, ``mflops'': 24.348
    Problem: ic4, setup: 100.19 ms, time: 206.74 ns, ``mflops'': 193.48
    Problem: ic8, setup: 80.02 ms, time: 421.88 ns, ``mflops'': 284.44
    Problem: ic16, setup: 71.17 ms, time: 1.23 us, ``mflops'': 261.1
    Problem: ic32, setup: 2.48 s, time: 1.15 us, ``mflops'': 694.97
    Problem: ic64, setup: 4.18 s, time: 2.25 us, ``mflops'': 851.9
    Problem: ic128, setup: 8.08 s, time: 6.54 us, ``mflops'': 684.86
    Problem: ic256, setup: 11.47 s, time: 11.62 us, ``mflops'': 881.01
    Problem: ic512, setup: 16.76 s, time: 23.84 us, ``mflops'': 966.37
    Problem: ic1024, setup: 22.04 s, time: 57.34 us, ``mflops'': 892.92
    Problem: ic2048, setup: 28.13 s, time: 127.79 us, ``mflops'': 881.45

    speedtest-neon:
    It took 3.188 seconds to perform 100.000 million additions.
    That corresponds to 31.373 mflops.

    It took 3.180 seconds to perform 100.000 million multiplications.
    That corresponds to 31.450 mflops.

    It took 5.734 seconds to perform 100.000 million divisions.
    That corresponds to 17.439 mflops.


    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    20 Feb 2015 08:27 AM
    As a comparison from before running at 600MHz, with both L1NEON and L2 Cache turned on using the 2011 compiler:

    speedtest-neon:
    It took 3.164 seconds to perform 100.000 million additions.
    That corresponds to 31.605 mflops.

    It took 3.242 seconds to perform 100.000 million multiplications.
    That corresponds to 30.843 mflops.

    It took 8.008 seconds to perform 100.000 million divisions.
    That corresponds to 12.488 mflops.

    DM-37x# ./bench orf1024
    Problem: orf1024, setup: 22.13 s, time: 54.71 us, ``mflops'': 467.88
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 28.46 s, time: 116.11 us, ``mflops'': 485.06
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.30 s, time: 267.02 us, ``mflops'': 460.2
    DM-37x#

    DM-37x# ./fftw.sh
    c i 1-D powers of two
    Problem: ic2, setup: 90.79 ms, time: 205.35 ns, ``mflops'': 48.697
    Problem: ic4, setup: 84.99 ms, time: 345.52 ns, ``mflops'': 115.77
    Problem: ic8, setup: 86.52 ms, time: 704.10 ns, ``mflops'': 170.43
    Problem: ic16, setup: 90.36 ms, time: 2.04 us, ``mflops'': 156.76
    Problem: ic32, setup: 2.53 s, time: 1.92 us, ``mflops'': 416.18
    Problem: ic64, setup: 4.30 s, time: 3.76 us, ``mflops'': 511.3
    Problem: ic128, setup: 8.20 s, time: 10.07 us, ``mflops'': 444.79
    Problem: ic256, setup: 11.72 s, time: 20.15 us, ``mflops'': 508.28
    Problem: ic512, setup: 16.88 s, time: 39.70 us, ``mflops'': 580.42
    Problem: ic1024, setup: 22.50 s, time: 89.41 us, ``mflops'': 572.67
    Problem: ic2048, setup: 28.42 s, time: 212.67 us, ``mflops'': 529.64
    Problem: ic4096, setup: 35.97 s, time: 559.81 us, ``mflops'': 439
    Problem: ic8192, setup: 43.00 s, time: 1.27 ms, ``mflops'': 420.43
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    20 Feb 2015 08:46 AM
    Lastly, I ran ltib -c to set the system to build for performance. Still using the 2011 compiler, but the results are comparable to the one above using the ondemand governor.

    Let me know if you need anything further on this. I am still waiting to hear back on the memory timings before I can do the temperature testing.

    adam



    Jared
    New Member
    New Member
    Posts:19


    --
    20 Feb 2015 10:24 AM
    Hi Adam,

    Thank you for all your help and support. Your benchmark numbers look good. The memory timings (if deemed valid over full temperature range) will improve the results even more. In my tests with L2EN enabled, I found that for very small array sizes L1NEON marginally or slightly helps. However, for large array sizes L1NEON seems to significantly slower performance. This particular setting may be application specific, but I would recommend not setting the L1NEON with L2 cache enabled.

    Sincerely,

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    20 Feb 2015 10:53 AM
    I am glad it worked out. Once you guys noticed it was L2 cache related, since I ran into a wall before. That gave me some different options and things to test.

    I have created a ticket in our bug tracking system and this whole conversation has spawned internal conversations, so I'll come back with updates as I get them.

    If you find other issues like that, let me know and I will work with you as best as I can.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    24 Feb 2015 04:03 PM
    Do you guys have a link to where you got the spreadsheet you used for calculating the timings?

    I'd like to review it if you don't mind. I went looking for such a document and I didn't see one on TI's site. A colleague of mine also looking around for bit, but he couldn't find anything either.

    I am running into resistance to wanting to change the timings, so I wanted to independently review it all.

    Thanks

    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    24 Feb 2015 04:17 PM
    Hi Adam,

    I am having trouble finding the link, but I still have the TI spreadsheet. My email address is in my profile. If you send me an email, I'll reply with it as an attachment.

    -Jared
    Jared
    New Member
    New Member
    Posts:19


    --
    24 Feb 2015 04:22 PM
    Hi Adam,

    I found it. It is under AM37XX, but it is the same for OMAP35XX and DM37XX. The link is:
    http://processors.wiki.ti...AM37x_SDRC_registers

    halfway down the page there is the file:

    OMAP35x/AM/DM37x DDR register calc tool

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    25 Feb 2015 06:47 AM
    Thank you very much.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Feb 2015 08:16 AM

    I went through the spreadsheet, and I concur with your findings. I loaded them into my board and the 1GHz board showed a 1-6% improvement in performance with the L2 Cache enabled. I will run some temperature testing to look at stability over the range of temperatures. 

     

    I am going to be out of the office for a few days, but when I return, I'll try to get some testing in and I'll let you know my findings. I have passed this information along to the developers for further review. 


    adam
    Sergey Brandis
    New Member
    New Member
    Posts:79


    --
    16 Mar 2018 03:27 AM

    Could you please describe modifications you did to build kernel 3.0.101 with CodeSourcery 2014.05? I'm now facing issue when trying to build kernel that way: it can't boot.

    Everything stops here:
     

    bootm 0x81000000

    ## Booting kernel from Legacy Image at 81000000 ...
       Image Name:   Linux-3.0.101-BSP-dm37x-2.4-4
       Image Type:   ARM Linux Kernel Image (uncompressed)
       Data Size:    3756164 Bytes = 3.6 MiB
       Load Address: 80008000
       Entry Point:  80008000
       Verifying Checksum ... OK
       Loading Kernel Image ... OK
    OK
    setup_product_id_tag: Huh? Can't find product ID data

    Starting kernel ...

    My bootargs are (I'm booting from nand after "run makeyaffsboot"):
    == Kernel bootargs ==
    nand-ecc=chip console=ttyO0,115200n8 display=28 ignore_loglevel early_printk no_console_suspend mtdparts=omap2-nand.0:512k(x-loader),1664k(u-boot),384k(u-boot-env),5m(kernel),20m(ramdisk),-(fs) root=/dev/mtdblock5 rw rootfstype=yaffs2

    Sergey Brandis
    New Member
    New Member
    Posts:79


    --
    16 Mar 2018 03:46 AM

    I forgot to mention my compiller flags:

    -O2 -fsigned-char -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp

    Sergey Brandis
    New Member
    New Member
    Posts:79


    --
    16 Mar 2018 11:11 AM

    Sorry, I hurried a little bit. Issue was in combination of Linux 3.0.x kernel and GCC 4.8+. GCC to aggressively optimises some parts of kernel. I changed file arch/arm/lib/memset.S according to latest commits for kernel 3.10.x and everything booted up successfully. Anyway thanks
    Commits:
    1) https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/arch/arm/lib/memset.S?id=455bd4c430b0c0a361f38e8658a0d6cb469942b5
    2) https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/arch/arm/lib/memset.S?id=418df63adac56841ef6b0f1fcf435bc64d4ed177

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    16 Mar 2018 11:32 AM
    That is good news, because I was going to suggest that we haven't tested any of the new compilers and found some of the newer compilers don't. I am glad you were able to work through it. (you beat me to the punch)

    Thanks for posting a fix, so we at least have it documented for future reference.

    adam


    ---