# 12-04-2008

Just figured out what was wrong with my dot product calculation. I was multiplying two 16 bit integers together and they were rolling over. A simple cast to int32 and the problem was solved. Onward!

Update: 4:14 PM
Here’s and efficient implementation of the dot product that I wrote for the ARMv5E:

```; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB   SP!, {R4-R11, LR}   ; Save registers to the stack
next_sample
LDMIA   R1!, {R5, R6, R7}   ; Load three vectors from vector 1
LDMIA   R2!, {R8, R9, R10}  ; Load three vectors from vector 2
SMULTT  R11, R5, R8         ; temp = I*I
SMLABB  R11, R5, R8, R11    ; temp += Q*Q
SMULTT  R12, R6, R9         ; Repeat for the other samples
SMLABB  R12, R6, R9, R12
SMULTT  R14, R7, R10
SMLABB  R14, R7, R10, R14
STMIA   R0!, {R11, R12, R14}; Save the dot products
SUBS    R3, R3, #3          ; Subtract 3 from N
BGT     next_sample         ; Branch if N > 0
LDMIA   SP!, {R4-R11, PC}   ; Restore registers and return
```

This code is functionally identical to the following code with the exception that the number of vectors in the assembly code should be evenly divisible by 3.

```void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
{
int16 I1, Q1, I2, Q2;

for(;N--;)
{
I1 = *vector1++;
Q1 = *vector1++;
I2 = *vector2++;
Q2 = *vector2++;

*product++ = I1*I2 + Q1*Q2;
}
}
```

The assembly version is about %25 faster. Use an oversized buffer if your number of samples is not evenly divisible by 3 or use the following code for arbitrary N:

```; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB   SP!, {R4-R6}        ; Save registers to the stack
next_sample
LDR     R5, [R1], #4        ; Load a vector from vector 1
LDR     R6, [R2], #4        ; Load a vector from vector 2
SUBS    R3, R3, #1          ; Subtract 1 from N
SMULTT  R12, R5, R6         ; temp = I*I
SMLABB  R12, R5, R6, R12    ; temp += Q*Q
STR     R7, [R0], #4        ; Save the dot product
BGT     next_sample         ; Branch if N > 0
LDMIA   SP!, {R4-R6}        ; Restore registers
BX      R14                 ; Return
```

This version is about %15 faster.

Update: 5:59 PM
Well, my processing time is up to 2.8ms with IQ demod and dot product computation. It’s going to be tight!