Search

Technical Discussion Group Forum

This forum is provided for user discussion. While Beacon EmbeddedWorks support staff and engineers participate, Beacon EmbeddedWorks does not guarantee the accuracy of all information within in the Technical Discussion Group (TDG).

The "Articles" forums provide brief Articles written by Beacon EmbeddedWorks engineers that address the most frequently asked technical questions.

To receive email notifications when updates are posted for a Beacon EmbeddedWorks product download, please subscribe to the TDG Forum of interest.

TDG Forum

PrevPrev Go to previous topic
NextNext Go to next topic
Last Post 16 Mar 2018 11:32 AM by  Adam Ford
ltib custom toolchain
 50 Replies
Sort:
You are not authorized to post a reply.
Page 2 of 3 << < 123 > >>
Author Messages
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
03 Feb 2015 08:41 AM
Jared,

To answer your question about the memory timings, I confirmed with another engineer that the timings should be set to their optimal settings for this hardware. It is possible that other hardware you are evaluating has different memory with differing timings.

As far as the unaligned access and A- register, I had to do a little digging into the A register that you mentioned. I found this article interesting. http://jsolano.net/2012/0...ting-point-in-linux/

We do set the CONFIG_ALIGNMENT_TRAP flag in our kernel by default as recommended by this article.

I also ran some benchmarks against a hacked 3.14 kernel that I cobbled together compiled for our DM3730 and the benchmarks were consistent with the 3.0 kernel.

I also modified the SPEC file to load the 3.0.101 kernel instead of the 3.0 kernel which compiled and executed with some small tweaks with a much newer code sourcery arm2014.05 tool chain as well as a few different incarnations of cross-ng. Each time, I got fairly consistent values of the NEON floating point.

Having said that, since my job is to help people get the board functional, and some of these details go beyond my expertise, we do have some people who could help you investigate this further, but it would require a service contract at some expense. I realize that isn't what you wanted to hear, but if you are interested in setting up a service contract, I can certainly have someone in our sales group contact you.

adam

Mike
New Member
New Member
Posts:11


--
18 Feb 2015 10:03 AM
Hi Adam,

I work with Jared and we have some new developments on this subject.

First, Jared pulled memory timing information from the manufacturer datasheet and applied them in x-loader overriding your memory timing settings, resulting in a notable performance improvement.

Second (and more significant), we discovered that after initial boot, cache is not enabled on the platform.  What is interesting is that if you put the board in sleep state and then wake it up cache is then enabled.  The easiest way to verify this is to boot the system and run a benchmark program of your choice (neon or not, it affects all system performance), then put the system into sleep state and wake it back up, then run the benchmark program again.  The difference, as you might expect, is astounding.  What we haven't yet figured out is the right place in the kernel to enable cache so that it is in fact enabled after initial boot.  Any insight you might have on that would be appreciated.

We are also trying to figure out how to properly enable L1NEON in the Auxiliary Control Register of CP15.  This should allow L1 caching for neon which should further improve neon performance.  Any thoughts on this would also be appreciated.

Thanks,

-Mike
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
18 Feb 2015 10:05 AM
You have a very interesting find. Your findings are not what I expected. I'm going to try to escalate this to our main Linux developer, to get some feedback.

stay tuned...

adam
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
18 Feb 2015 02:02 PM
There is some concern about stability over temperature. Because some of our SOMs are setup for high temperature, we have to adjust the timings to accommodate that since we have one BSP for all temperature ranges.

Having said that, if you are willing to share with us what timings you're using, I can run some tests at temperature to see if we experience any failures. It won't be an exhaustive test, but it would also make us feel better because we normally recommending against doing that.

I also have been given some tasks by our Linux developer to run some tests and based on the outcome of those tests, we can probably provide some patches based on the findings.

I hope to have an update for you soon.


Can you post the timings you have? if not, would be willing to share them over e-mail?

adam



Adam Ford
Advanced Member
Advanced Member
Posts:794


--
18 Feb 2015 02:37 PM
I am doing some testing of the L1NEON in the in the Auxiliary Control Register of CP15.

I will post some results when I am finished along with some recommended changes.

adam
Jared
New Member
New Member
Posts:19


--
18 Feb 2015 02:37 PM
Hi Adam,

The Register values come from the Datasheet and the excel spreadsheet provided by TI for calculating the register values for the processor family. These values should work for both the Commercial and the industrial versions. The commercial version can probably be adjusted even more aggressively, but these settings will work for both. I have used both the commercial and industrial units, bot not tested them each over the full temperature range.

ACTIMA = 0x7AE1B4C6
ACTIMB = 0x00021217
RFR_CTRL = 0x0005E601

Also consider using:
mcfg = 0x3588099

Hope that helps.

-Jared


Adam Ford
Advanced Member
Advanced Member
Posts:794


--
18 Feb 2015 02:53 PM

I did some testing of the L1NEON and the performance actually was worse after coming out of sleep with cache turned back on, so I'm a little reluctant to send you any changes to that yet. I've been sendin my data to our Linux developer and he's giving me stuff to try. I'll keep you posted as we make progress.


adam
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 08:07 AM
Jared & Mike,

I tested this with mixed results, but this is what I received from our Linux developer. It's not considered fully validated, but it should help you modify u-boot to modify the auxcr. Please use at your own risk until we can further test

As for L1NEON, assuming they are using u-boot-2011.06 (i.e. BSP 2.4-3), then to enable L1NEON, I'd think modifying setup_auxcr in arch/arm/cpu/armv7/omap3/cache.S, around line 171 from:

mov r12, #0x3
mrc p15, 0, r0, c1, c0, 1
orr r0, r0, #0x10 @ Enable ASA
@ Enable L1NEON on pre-r2p1 (erratum 621766 workaround)
cmp r1, #0x21
orrlt r0, r0, #1 << 5
.word 0xE1600070 @ SMC
to:

mov r12, #0x3
mrc p15, 0, r0, c1, c0, 1
orr r0, r0, #0x10 @ Enable ASA
@ Enable L1NEON on all variants (may violate an errata, don't know)
orr r0, r0, #1 << 5
.word 0xE1600070 @ SMC


Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 11:01 AM
With the above patch and using the 2011 compiler supplied in the BSP, I was able to compile FFTW and get the following benchmark results at 600MHz:

DM-37x# ./bench orf1024
Problem: orf1024, setup: 22.49 s, time: 60.08 us, ``mflops'': 426.08
DM-37x# ./bench orf2048
Problem: orf2048, setup: 29.00 s, time: 134.23 us, ``mflops'': 419.59
DM-37x# ./bench orf4096
Problem: orf4096, setup: 35.96 s, time: 556.94 us, ``mflops'': 220.64
DM-37x#
Mike
New Member
New Member
Posts:11


--
19 Feb 2015 11:21 AM

Thanks Adam.

Just to clarify, the patch you are referring to only enables L1NEON for all platforms (no other change), correct?  It also appears (assuming your patch only enables L1NEON) that the 1024 and 2048 yielded improvement but the 4096 performed worse.  Is that your assessment as well?

We are very curious about the cache not being enabled on initial boot, only being enabled after a sleep/wake cycle.  This seems to have the most significant impact on performance that we have observed.  Have you confirmed this behavior?  We are wondering how/where to patch to resolve this particular issue.

 

Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 11:43 AM
Mike,

I was able to confirm that upon startup the NEON performance was 3x slower on startup than after returning from sleep in the stock BSP

The patch I posted above is to only enable L1NEON. This seemed to eliminate the performance gap between startup and waking after sleep.

Do determine if the issue is sytem-wide or NEON specific, I ran some of the optional benchmark tests that we have available in our BSP. When running those non-NEON benchmarks, they did not change before and after sleep which leads us to believe that the issue is more specific to NEON rather than the system as a whole. Are you seeing different results? If so, do you have any register dumps that we can compare?


As far as the FFTW results go, I am not sure how to explain why it drops at 4096, but the 1024 and 2048 results are much improvements.

We have a built-in benchmark tool called speedtest-neon which has been my primary focus on verifying performance.

adam



Jared
New Member
New Member
Posts:19


--
19 Feb 2015 12:05 PM
Hi Adam,

JTAG probing of the microprocessor registers before sleep and after sleep show a difference in the L2EN bit in the auxiliary control register. This bit is not particularly neon specific and should effect the whole system. With the NE10 library, there is non-neon equivalents to all of the NEON functions. these functions have the same name with the suffix '_c'. Benchmarks with these non neon routines show a speed increase as well.

example:

ne10_fft_r2c_1d_float32_neon (fout_c, fin_r, cfg); //force neon usage

ne10_fft_r2c_1d_float32_c (fout_c, fin_r, cfg); //force non-neon version,


-Jared
Jared
New Member
New Member
Posts:19


--
19 Feb 2015 03:15 PM
Hi Adam,

Just a quick note. If L2EN is off, then setting L1NEON will probably help performance, However, enabling L1NEON while L2EN is also enabled will probably lead to slower performance than just having L2EN enabled on its own. I've been using the example you posted above to set L2EN on boot-up. It is still running through tests, I'll keep you updated as I get the results.

-Jared
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 03:30 PM
That's great. I'm still waiting to hear back from our main Linux developer to comment on it. If you have something you'd like me to try, I'm willing to do that. I will be out of the office tomorrow afternoon, and I have a few meetings in the morning, so I might not get to it on Monday, but I'm happy to help as I can.

adam
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 04:04 PM
It looks like u-boot intentionally disables L2 cache.

Inside: rpm/BUILD/u-boot-2011.06/include/configs/omap3logic.h


/* L2 was disabled since observed that large displays (720p) weren't working
* Verify can boot kernel using either XGA or 720p HDMI displays settings */
#define CONFIG_L2_OFF /* Keep L2 Cache Disabled */

I am going to run some tests to see if enabling this helps anything by disabling this. I also put in a note to the developer to ask for clarification on it.

adam
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
19 Feb 2015 04:23 PM
I agree with your assessment that L2 makes a difference, but I am still waiting to hear back on the issues with it being disabled for the purposes of high resolution displays.

With L2 Cache enabled and ondemand governor on a 1GHz module, I get the following results:

DM-37x# ./bench orf1024
Problem: orf1024, setup: 22.10 s, time: 34.75 us, ``mflops'': 736.69
DM-37x# ./bench orf2048
Problem: orf2048, setup: 28.58 s, time: 73.20 us, ``mflops'': 769.45
DM-37x# ./bench orf4096
Problem: orf4096, setup: 35.43 s, time: 155.45 us, ``mflops'': 790.46

DM-37x# ./fftw.sh
c i 1-D powers of two
Problem: ic2, setup: 80.47 ms, time: 143.89 ns, ``mflops'': 69.497
Problem: ic4, setup: 92.41 ms, time: 206.76 ns, ``mflops'': 193.46
Problem: ic8, setup: 80.51 ms, time: 425.60 ns, ``mflops'': 281.96
Problem: ic16, setup: 72.30 ms, time: 1.23 us, ``mflops'': 260.3
Problem: ic32, setup: 2.48 s, time: 1.18 us, ``mflops'': 676.4
Problem: ic64, setup: 4.20 s, time: 2.30 us, ``mflops'': 835.34
Problem: ic128, setup: 8.10 s, time: 6.21 us, ``mflops'': 720.97
Problem: ic256, setup: 11.37 s, time: 13.20 us, ``mflops'': 775.63
Problem: ic512, setup: 16.66 s, time: 24.91 us, ``mflops'': 924.78
Problem: ic1024, setup: 22.04 s, time: 59.12 us, ``mflops'': 865.96
Problem: ic2048, setup: 28.03 s, time: 125.41 us, ``mflops'': 898.2
Mike
New Member
New Member
Posts:11


--
19 Feb 2015 04:25 PM

Hi Adam,

Good find.  Since we don't need the display we're going to enable L2 cache in u-boot.

Adam Ford
Advanced Member
Advanced Member
Posts:794


--
20 Feb 2015 07:56 AM
The results above were with u-boot returned to normal with L2 Cache enabled using the 2011 compiler

With the L1NEON enabled AND L2 Cache enabled I got the following results using the same 1GHz processor with the ondemand governor:

DM-37x# ./bench orf1024
Problem: orf1024, setup: 22.05 s, time: 33.26 us, ``mflops'': 769.7
DM-37x# ./bench orf2048
Problem: orf2048, setup: 28.60 s, time: 69.86 us, ``mflops'': 806.24
DM-37x# ./bench orf4096
Problem: orf4096, setup: 35.44 s, time: 159.27 us, ``mflops'': 771.54
DM-37x#

DM-37x# ./fftw.sh
c i 1-D powers of two
Problem: ic2, setup: 93.41 ms, time: 410.71 ns, ``mflops'': 24.348
Problem: ic4, setup: 100.19 ms, time: 206.74 ns, ``mflops'': 193.48
Problem: ic8, setup: 80.02 ms, time: 421.88 ns, ``mflops'': 284.44
Problem: ic16, setup: 71.17 ms, time: 1.23 us, ``mflops'': 261.1
Problem: ic32, setup: 2.48 s, time: 1.15 us, ``mflops'': 694.97
Problem: ic64, setup: 4.18 s, time: 2.25 us, ``mflops'': 851.9
Problem: ic128, setup: 8.08 s, time: 6.54 us, ``mflops'': 684.86
Problem: ic256, setup: 11.47 s, time: 11.62 us, ``mflops'': 881.01
Problem: ic512, setup: 16.76 s, time: 23.84 us, ``mflops'': 966.37
Problem: ic1024, setup: 22.04 s, time: 57.34 us, ``mflops'': 892.92
Problem: ic2048, setup: 28.13 s, time: 127.79 us, ``mflops'': 881.45

speedtest-neon:
It took 3.188 seconds to perform 100.000 million additions.
That corresponds to 31.373 mflops.

It took 3.180 seconds to perform 100.000 million multiplications.
That corresponds to 31.450 mflops.

It took 5.734 seconds to perform 100.000 million divisions.
That corresponds to 17.439 mflops.


Adam Ford
Advanced Member
Advanced Member
Posts:794


--
20 Feb 2015 08:27 AM
As a comparison from before running at 600MHz, with both L1NEON and L2 Cache turned on using the 2011 compiler:

speedtest-neon:
It took 3.164 seconds to perform 100.000 million additions.
That corresponds to 31.605 mflops.

It took 3.242 seconds to perform 100.000 million multiplications.
That corresponds to 30.843 mflops.

It took 8.008 seconds to perform 100.000 million divisions.
That corresponds to 12.488 mflops.

DM-37x# ./bench orf1024
Problem: orf1024, setup: 22.13 s, time: 54.71 us, ``mflops'': 467.88
DM-37x# ./bench orf2048
Problem: orf2048, setup: 28.46 s, time: 116.11 us, ``mflops'': 485.06
DM-37x# ./bench orf4096
Problem: orf4096, setup: 35.30 s, time: 267.02 us, ``mflops'': 460.2
DM-37x#

DM-37x# ./fftw.sh
c i 1-D powers of two
Problem: ic2, setup: 90.79 ms, time: 205.35 ns, ``mflops'': 48.697
Problem: ic4, setup: 84.99 ms, time: 345.52 ns, ``mflops'': 115.77
Problem: ic8, setup: 86.52 ms, time: 704.10 ns, ``mflops'': 170.43
Problem: ic16, setup: 90.36 ms, time: 2.04 us, ``mflops'': 156.76
Problem: ic32, setup: 2.53 s, time: 1.92 us, ``mflops'': 416.18
Problem: ic64, setup: 4.30 s, time: 3.76 us, ``mflops'': 511.3
Problem: ic128, setup: 8.20 s, time: 10.07 us, ``mflops'': 444.79
Problem: ic256, setup: 11.72 s, time: 20.15 us, ``mflops'': 508.28
Problem: ic512, setup: 16.88 s, time: 39.70 us, ``mflops'': 580.42
Problem: ic1024, setup: 22.50 s, time: 89.41 us, ``mflops'': 572.67
Problem: ic2048, setup: 28.42 s, time: 212.67 us, ``mflops'': 529.64
Problem: ic4096, setup: 35.97 s, time: 559.81 us, ``mflops'': 439
Problem: ic8192, setup: 43.00 s, time: 1.27 ms, ``mflops'': 420.43
Adam Ford
Advanced Member
Advanced Member
Posts:794


--
20 Feb 2015 08:46 AM
Lastly, I ran ltib -c to set the system to build for performance. Still using the 2011 compiler, but the results are comparable to the one above using the ondemand governor.

Let me know if you need anything further on this. I am still waiting to hear back on the memory timings before I can do the temperature testing.

adam



You are not authorized to post a reply.
Page 2 of 3 << < 123 > >>