Help: Inline assembly (SSE) slowdown

  • Thread starter Dissident Dan
  • Start date
  • Tags
    Assembly
In summary, the inline assembly is not using the addresses that I explicitly declare in my C++ code. I am having a hard time getting the assembly to accep the addresses from my C++ from anyting other than explicitly-declared pointers (not even arrays work!). I end up putting extra variables on the stack so I can get pointers to my data, and I believe that this is causing a performance penalty (CPU time). Is there any way that I can get the inline assembly to use the needed addresses without the extra Vector3 pointer variables? (And if you have any other optimization suggestions, I'll be glad to hear them. :biggrin: )
  • #1
Dissident Dan
238
2
I'm writing a program in C++ in MS Visual Studio.NET, and I am using inline assembly to do some SSE instructions.

The trouble is that I am having a hard time getting the assembly to accep the addresses from my C++ from anyting other than explicitly-declared pointers (not even arrays work!). I end up putting extra variables on the stack so I can get pointers to my data, and I believe that this is causing a performance penalty (CPU time).

My code is this:
Code:
	inline Vector3 operator +(Vector3 &v) const {
		float result[4];
		Vector3 *vout = (Vector3 *)result;
		Vector3 *vin = &v;
		_asm {
			mov esi, vin;
			mov edi, vout;
			mov eax, this;
			movups xmm0, [eax];
			movups xmm1, [esi];
			addps xmm0, xmm1;
			movups [edi], xmm0;
		}
		return *(Vector3 *)result;
	}

Is there any way that I can get the inline assembly to use the needed addresses without the extra Vector3 pointer variables? (And if you have any other optimization suggestions, I'll be glad to hear them. :biggrin: )
 
Computer science news on Phys.org
  • #2
GCC has a compiler option for optimizing for SSE. Perhaps VS.NET has the same thing.

The base address of the floating point array should be loaded into a 32 bit register like eax. Move array[0] to sse memory:

movups xmm0, [eax];

To move array[1] into sse memory you would do something like this:

movups xmm1,[eax+0x10];

The 0x10 is the byte offset in hex. This value depends on how many bytes an array value takes up. You can find out by using the sizeof function.

If you don't want the temp variable you need to do something like this:

mov eax, v->array;

You need the base address of the floating point array, not the base address of the Vector3 structure variable.
 
  • #3
Are you actually looking at the compiled output? Or just guessing there's a performance penalty? Honestly, I strongly doubt that pushing one address on the stack is a big deal compared to the SSE instructions themselves. You don't really need to move the addresses into registers, either, I don't think. Turn on optimizations and check the compiler's output.

- Warren
 
  • #4
I would definitely turn on sse optimizations and check the asm produced by the compiler to see if it is actually using sse instructions.

I do believe you need to put the address in a cpu register. You can't just do:

movups xmm0,[v];

And you couldn't do the following because the compiler doesn't know the value of v at compile time.

movups xmm0, *v;
 
Last edited:
  • #5
Normally, to check performance, you get it working totally correctly. First.
Then worry about optimizations.

Next. If and only if it's running too slowly: To validate/ignore your optimization fears try using a profiler. I don't know if MS provides one for the NET environment or not.

There is nothing worse than misguided optimization to obfuscate and break code.
 
  • #6
dduardo said:
I do believe you need to put the address in a cpu register. You can't just do:

movups xmm0,[v];

That is correct, I got errors when I tried it without putting it in a register first.

I read something somewhere that you could use the address of your C/C++ variable in the assembly by putting an underscore before the C/C++ variable name, but this doesn't work in VS.NET (It tells me that it's an undeclared symbol). Is there any way you know of to tell the assembly to use the address of the variable without making a pointer to the variable in the C/C++ code?
 
  • #7
I think the main problem is the compiler doesn't (can't) know where 'result' is at compile time as it is in stack memory therefore where it actually lands is calculated by the two moves using fancy addressing mode tricks (like addition). Try declaring 'result' as a static (assuming you don't need multithreading blah blah blah). The additional benefit is you won't have to return result by copy either. As for vin, I'll bet that the first move is only for typing and if you check the optimized assembly output of the compiler it will be gone.

"Premature optimization is the root of all evil" - Donald Knuth
but playing with optimized assembly is fun...
 

FAQ: Help: Inline assembly (SSE) slowdown

1. How can I use inline assembly with SSE instructions to improve performance?

Using inline assembly with SSE instructions allows for efficient parallel processing of data, which can greatly improve performance. This is especially useful for tasks involving large amounts of data, such as image or video processing.

2. What is the difference between SSE and traditional assembly instructions?

SSE (Streaming SIMD Extensions) instructions are specifically designed for parallel processing and can operate on multiple data elements at once, whereas traditional assembly instructions operate on one data element at a time. This makes SSE instructions more efficient for tasks involving large data sets.

3. Do I need to have extensive knowledge of assembly to use SSE instructions?

While some knowledge of assembly language is helpful, it is not necessary to have extensive knowledge to use SSE instructions. Most modern compilers have built-in support for SSE instructions, and there are also high-level programming languages, such as C or C++, that allow for the use of SSE instructions without having to write assembly code directly.

4. Are there any drawbacks to using SSE instructions?

One potential drawback of using SSE instructions is that they are not portable across different processor architectures. This means that code written using SSE instructions may not work on all types of processors. Additionally, SSE instructions are not suitable for all types of tasks and may not always result in a noticeable performance improvement.

5. How can I ensure that my code using SSE instructions is optimized for performance?

To ensure optimal performance when using SSE instructions, it is important to carefully consider the design of your algorithm and the data structures you are using. It is also important to regularly test and benchmark your code to identify any potential bottlenecks or areas for improvement.

Back
Top