Just figured out what was wrong with my dot product calculation. I was multiplying two 16 bit integers together and they were rolling over. A simple cast to int32 and the problem was solved. Onward!
Update: 4:14 PM
Here’s and efficient implementation of the dot product that I wrote for the ARMv5E:
; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB SP!, {R4-R11, LR} ; Save registers to the stack
next_sample
LDMIA R1!, {R5, R6, R7} ; Load three vectors from vector 1
LDMIA R2!, {R8, R9, R10} ; Load three vectors from vector 2
SMULTT R11, R5, R8 ; temp = I*I
SMLABB R11, R5, R8, R11 ; temp += Q*Q
SMULTT R12, R6, R9 ; Repeat for the other samples
SMLABB R12, R6, R9, R12
SMULTT R14, R7, R10
SMLABB R14, R7, R10, R14
STMIA R0!, {R11, R12, R14}; Save the dot products
SUBS R3, R3, #3 ; Subtract 3 from N
BGT next_sample ; Branch if N > 0
LDMIA SP!, {R4-R11, PC} ; Restore registers and return
This code is functionally identical to the following code with the exception that the number of vectors in the assembly code should be evenly divisible by 3.
void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
{
int16 I1, Q1, I2, Q2;
for(;N--;)
{
I1 = *vector1++;
Q1 = *vector1++;
I2 = *vector2++;
Q2 = *vector2++;
*product++ = I1*I2 + Q1*Q2;
}
}
The assembly version is about %25 faster. Use an oversized buffer if your number of samples is not evenly divisible by 3 or use the following code for arbitrary N:
; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
STMDB SP!, {R4-R6} ; Save registers to the stack
next_sample
LDR R5, [R1], #4 ; Load a vector from vector 1
LDR R6, [R2], #4 ; Load a vector from vector 2
SUBS R3, R3, #1 ; Subtract 1 from N
SMULTT R12, R5, R6 ; temp = I*I
SMLABB R12, R5, R6, R12 ; temp += Q*Q
STR R7, [R0], #4 ; Save the dot product
BGT next_sample ; Branch if N > 0
LDMIA SP!, {R4-R6} ; Restore registers
BX R14 ; Return
This version is about %15 faster.
Update: 5:59 PM
Well, my processing time is up to 2.8ms with IQ demod and dot product computation. It’s going to be tight!