Search

Technical Discussion Group Forum

This forum is provided for user discussion. While Beacon EmbeddedWorks support staff and engineers participate, Beacon EmbeddedWorks does not guarantee the accuracy of all information within in the Technical Discussion Group (TDG).

The "Articles" forums provide brief Articles written by Beacon EmbeddedWorks engineers that address the most frequently asked technical questions.

To receive email notifications when updates are posted for a Beacon EmbeddedWorks product download, please subscribe to the TDG Forum of interest.

TDG Forum

PrevPrev Go to previous topic
NextNext Go to next topic
Last Post 16 Mar 2018 11:32 AM by  Adam Ford
ltib custom toolchain
 50 Replies
Sort:
You are not authorized to post a reply.
Page 1 of 3123 > >>
Author Messages
Jared
New Member
New Member
Posts:19


--
21 Jan 2015 05:16 PM

    I am trying to use a different toolchain other than the Codesourcery-2011.09-70 gcc-4.6.1 ARMv5te/glibc-2.13.  Everything seems to build without errors and I can load it onto the torpedo with 'run makeyaffsboot' .  However, when I try to execute the image It always seems to hang at the starting of the kernel:

    //cut and paste- of terminal window

       Verifying Checksum ... OK
       Loading Kernel Image ... OK
    OK

    Starting kernel ...

    //end cut and paste - hangs at Starting Kernel ...

     

    I am having crosstools-ng use glibc-2.13 and I've tried different GCC versions with out much luck.  It seems like the u-boot portions compiled with the custom toolchain  appear to be working. I can load previously compiled kernels / system builds with the new u-boot builds.  However, none of the kernel files generated with the  custom toolchain seem to load.   Any suggestions?

    -J

     

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 06:30 AM
    We haven't tested or certified other tools, so my first question would be why do you want a different tool? If it's for size, there are some options to compile for size with the existing compiler.

    If you really feel it necessary to use a different compiler, there are kernel debugging options you can enable to help debug. I have been told there are some incompatibilities with other compilers and the TI DVSDK, so we generally recommend against using different compilers.


    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 08:38 AM
    Hi Adam,

    I had been testing some statically compiled DSP performance test programs from very similar platforms OMAP 3530 with very similar PoP memory. The other platform was significantly faster (3x-8x) for certain tests. Since the binary is static, the hardware is very similar, and the CPU frequency was the same, I was thinking it was something with kernel or system calls that was responsible for the difference. The other hardware platform is running a slightly newer kernel, and was built with more recent toolchains. I was thinking that this may be resulting in the difference in performance. Additionally, I believe there has been some compiler optimization improvements for the ARM especially with the NEON / SIMD in the more modern compilers.

    My other thought was RAM timing settings. Are the Torpedo's RAM timings set to the fastest allowed between the memory and platform by default?

    Thank you for your assistance,

    -Jared
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 08:59 AM

    Jared,

     

    I've spent a some time this morning trying to compile with Code Sourcery 2014.05 without success.  The kernel behaves the same way as you describe.  You say that the OMAP35 was 3-8x faster, but it was using a different Kernel, I was wondering if you could share what kernel it's using, what compile you're usingon the 3530 and how you are running the benchmarks. 

     

    I'll do some digging into the memory interface and let you know what I find.

     

    adam

    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 09:24 AM
    Hi Adam,

    The other hardware platform is very similar to the original Beagleboard. Infact, it is so similar it uses the same demo kernel as beagleboard platform. The uname -a :
    Linux arm 3.18.1-armv7-x2 #1 SMP Wed Dec 17 14:22:02 UTC 2014 armv7l GNU/Linux
    http://elinux.org/BeagleBoardDebian

    The test program is a combination of calls to common DSP libraries and some homebrew algorithms. The DSP libraries are FFTW, and NE10, which both have options for NEON optimizations. Both libraries seem to be commonly used with ARM devices. The test program was originally used to compare FFTW and NE10, but also tested different versions of some homebrew algorithms. Thank you again for your assistance.

    Sincerely,

    Jared



    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 09:29 AM
    I forgot the compiler. it looks like the kernel was compiled with 4.7.2, but the test programs look like they were compiled with 4.8.2.

    -J
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    22 Jan 2015 10:15 AM
    Are you using a benchmark tool? If so , can you tell me which benchmarks? I'd like to try to replicate your findings.

    I noticed the kernel you are using a very new compared to the one we have. The 3.18 kernel might have a fair amount of optimizations as well. I can't work on this full time, but I would suggest trying to build the kernel with the stock compilers and possibly trying your new computer tools to compile the app to see if the speed is affected when compiling the app.

    At home, I've been working on a 3.17 kernel for this board compiled with gcc 4.8.3, but it's not endorsed by Logic. I'd like to see if I notice a difference in speed between our stock kernel and the 3.17 assuming I can get the benchmark tools working, but I'd like to know what and how you are testing it so it's comparable.

    adam
    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 10:51 AM
    Hi Adam,

    Both FFTW and NE10 have benchmarking programs as part of their libraries. When the libraries are built, they also compile benchmarking programs. In FFTW the tests directory contains a program called bench, which allows performance testing. in NE10 the build/test directory contains a program NE10_dsp_unit_test_performance, which allows performance testing.

    my test program uses these libraries. for some comparison here are some numbers I get when I take the average of multiple executions of these algorithms.

    FFTW with 1024pt rfft - 290 us vs 95us
    NE10 with 1024pt rfft (using NEON) 44 us vs 402us

    -Jared


    Jared
    New Member
    New Member
    Posts:19


    --
    22 Jan 2015 10:52 AM
    in the example below, the slower times are on the Torpedo.
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    23 Jan 2015 10:17 AM
    I found this website: http://www.vesperix.com/a...cc-a8-fma/index.html that shows some benchmarks, and I agree with you that our module is running significantly slower than these benchmarks listed.

    I'm going to contact our software engineer who is in charge of the Linux kernel to see if he has any opinions.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 09:36 AM
    I spent a few hours over the weekend running some tests on our stock BSP 2.4-3.

    I found that if I ran ./ltib -c and changed kernel default preconfig (Normal) to performance, and I also changed the compiler from 2009 to 2011 which increased the version number GCC from 4.3.3 to 4.6.1.

    Can you tell me what compiler flags you are using to compile FFTW? http://www.fftw.org/doc/I...llation-on-Unix.html has some suggestions.

    Either way, I am still not getting the highest speeds, but for my testing, it seemed to double the performance. I also experimented with some other compiler flags based on feedback from TI http://processors.wiki.ti.../index.php/Cortex-A8 In there they recomended using "-march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp" as GCC flags.

    After doing all of the above suggestions, I was about to get higher MFLOPS ratings than before by about 3x, but it wasn't yet able to acheive the 8x you stated.

    I am still waiting to hear back from our Linux developer to see what he says.

    adam

    jduran.gm
    New Member
    New Member
    Posts:79


    --
    26 Jan 2015 09:41 AM
    Adam,

    Taking a look to the web of FFTW (http://www.fftw.org/doc/I...ation-on-Unix.html), you should enable neon:

    --enable-sse, --enable-sse2, --enable-avx, --enable-altivec, --enable-neon: Enable the compilation of SIMD code for SSE (Pentium III+), SSE2 (Pentium IV+), AVX (Sandy Bridge, Interlagos), AltiVec (PowerPC G4+), NEON (some ARM processors). SSE, AltiVec, and NEON only work with --enable-float (above). SSE2 works in both single and double precision (and is simply SSE in single precision). The resulting code will still work on earlier CPUs lacking the SIMD extensions (SIMD is automatically disabled, although the FFTW library is still larger).

    Joaquim Duran
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 09:46 AM
    For configuring FFTW I did the following:

    ./configure --prefix=/home/aford/1026167_LogicPD_Linux_BSP_2.4-3/rootfs --enable-single --enable-neon --host=arm-none-linux-gnueabi "CC=arm-none-linux-gnueabi-gcc -O3 -march=armv7-a -mfloat-abi=softfp -mfpu=neon -ffast-math"

    This was both based on the link you sent me as well as the TI one I sent you.
    Jared
    New Member
    New Member
    Posts:19


    --
    26 Jan 2015 10:49 AM
    Hi Adam,

    I am already using the 2011 version of CodeSourcery and I believe I had the NEON instructions already enabled. I remember I tried the performance kernel option in the past and did not remember a noticeable difference.

    my configure was
    ./configure --prefix=/home/logic/logic/Logic_BSPs/Linux_3.0/REL-ltib-DM3730-2.3-2/rootfs/usr --with-slow-timer --host=arm-linux-gnueabi --enable-single --enable-neon
    and my CFLAGS were
    "-O2 -fsigned-char -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8 -ftree-vectorize -ffast-math -funsafe-math-optimizations"
    Would it be possible for you to post your bench results with the CPU locked at 600MHz for:
    ./bench orf1024
    ./bench orf2048
    ./bench orf4096

    Thank you again for your help and support.

    -J

    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 10:59 AM
    I have to split my time amongst multiple people, but I should be able to work on this again later tonight and have some results for you tomorrow morning.

    I know our head Linux guy is investigating because I have seen some e-mails going back and forth internally. I'll let you know when I hear something useful to you.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:27 AM
    I mentioned the kernel issue with newer build tools the the Linux developer here, and I'm waiting to hear back from him, so he is aware of both the performance issue as well as the kernel building with the newer toolchain.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 01:47 PM
    Our head linux developer confirmed the cause of the kernel hang was due to a Kernel Oops in the first call to clkddev_add(), specifically that while calling mutex_lock_nested, $r4 changes (which is a definite no-no as #r4 should be a callee save/restored register).

    I won't bore you with the disassembly. He is not sure if the kernel Oops is a bug in the tools or the kernel source. Either way, we're looking for a simple solution to solve the performance concern without creating the Kernel Oops.

    I'll keep you posted as I learn more.

    adam
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:31 PM
    Jared,

    To answer your questions, at 600MHz, I get the following resultswith the 2011 Compiler


    Problem: orf1024, setup: 22.58 s, time: 133.04 us, ``mflops'': 192.42
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 29.10 s, time: 206.95 us, ``mflops'': 272.14
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.84 s, time: 467.28 us, ``mflops'': 262.97
    Adam Ford
    Advanced Member
    Advanced Member
    Posts:794


    --
    26 Jan 2015 11:43 PM
    At 600 MHz, the 2014.05 compiler returns:

    Problem: orf1024, setup: 22.50 s, time: 85.11 us, ``mflops'': 300.79
    DM-37x# ./bench orf2048
    Problem: orf2048, setup: 29.00 s, time: 206.95 us, ``mflops'': 272.14
    DM-37x# ./bench orf4096
    Problem: orf4096, setup: 35.87 s, time: 472.06 us, ``mflops'': 260.3
    DM-37x#

    Jared
    New Member
    New Member
    Posts:19


    --
    02 Feb 2015 04:14 PM
    Hi Adam,
    I am able to build my system with GCC 4.7.4 with the ltib custom tool chain option. When I try to build with GCC versions later than that, it seems to hang at starting kernel. The kernel appears to run, but I am still not seeing the performance I expect out of the hardware. I was wondering if the memory timing settings were ever confirmed to be set to the fastest for the platform? My other guess was an alignment or alignment / trap issue. I noticed on the other systems I was running the static binaries on, the A bit in the ARM control register is not set and on the torpedo it is.

    -Jared
    You are not authorized to post a reply.
    Page 1 of 3123 > >>