Just figured out what was wrong with my dot product calculation. I was multiplying two 16 bit integers together and they were rolling over. A simple cast to int32 and the problem was solved. Onward!

*Update: 4:14 PM*

Here’s and efficient implementation of the dot product that I wrote for the ARMv5E:

```
; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB SP!, {R4-R11, LR} ; Save registers to the stack
next_sample
LDMIA R1!, {R5, R6, R7} ; Load three vectors from vector 1
LDMIA R2!, {R8, R9, R10} ; Load three vectors from vector 2
SMULTT R11, R5, R8 ; temp = I*I
SMLABB R11, R5, R8, R11 ; temp += Q*Q
SMULTT R12, R6, R9 ; Repeat for the other samples
SMLABB R12, R6, R9, R12
SMULTT R14, R7, R10
SMLABB R14, R7, R10, R14
STMIA R0!, {R11, R12, R14}; Save the dot products
SUBS R3, R3, #3 ; Subtract 3 from N
BGT next_sample ; Branch if N > 0
LDMIA SP!, {R4-R11, PC} ; Restore registers and return
```

This code is functionally identical to the following code with the exception that the number of vectors in the assembly code should be evenly divisible by 3.

```
void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
{
int16 I1, Q1, I2, Q2;
for(;N--;)
{
I1 = *vector1++;
Q1 = *vector1++;
I2 = *vector2++;
Q2 = *vector2++;
*product++ = I1*I2 + Q1*Q2;
}
}
```

The assembly version is about %25 faster. Use an oversized buffer if your number of samples is not evenly divisible by 3 or use the following code for arbitrary N:

```
; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB SP!, {R4-R6} ; Save registers to the stack
next_sample
LDR R5, [R1], #4 ; Load a vector from vector 1
LDR R6, [R2], #4 ; Load a vector from vector 2
SUBS R3, R3, #1 ; Subtract 1 from N
SMULTT R12, R5, R6 ; temp = I*I
SMLABB R12, R5, R6, R12 ; temp += Q*Q
STR R7, [R0], #4 ; Save the dot product
BGT next_sample ; Branch if N > 0
LDMIA SP!, {R4-R6} ; Restore registers
BX R14 ; Return
```

This version is about %15 faster.

*Update: 5:59 PM*

Well, my processing time is up to 2.8ms with IQ demod and dot product computation. It’s going to be tight!