12-04-2008

Just figured out what was wrong with my dot product calculation. I was multiplying two 16 bit integers together and they were rolling over. A simple cast to int32 and the problem was solved. Onward!

Update: 4:14 PM
Here’s and efficient implementation of the dot product that I wrote for the ARMv5E:

; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
        STMDB   SP!, {R4-R11, LR}   ; Save registers to the stack
next_sample
        LDMIA   R1!, {R5, R6, R7}   ; Load three vectors from vector 1
        LDMIA   R2!, {R8, R9, R10}  ; Load three vectors from vector 2
        SMULTT  R11, R5, R8         ; temp = I*I
        SMLABB  R11, R5, R8, R11    ; temp += Q*Q
        SMULTT  R12, R6, R9         ; Repeat for the other samples
        SMLABB  R12, R6, R9, R12
        SMULTT  R14, R7, R10
        SMLABB  R14, R7, R10, R14
        STMIA   R0!, {R11, R12, R14}; Save the dot products
        SUBS    R3, R3, #3          ; Subtract 3 from N
        BGT     next_sample         ; Branch if N > 0
        LDMIA   SP!, {R4-R11, PC}   ; Restore registers and return

This code is functionally identical to the following code with the exception that the number of vectors in the assembly code should be evenly divisible by 3.

void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
{
    int16 I1, Q1, I2, Q2;

    for(;N--;)
    {
        I1 = *vector1++;
        Q1 = *vector1++;
        I2 = *vector2++;
        Q2 = *vector2++;

        *product++ = I1*I2 + Q1*Q2;
    }
}

The assembly version is about %25 faster. Use an oversized buffer if your number of samples is not evenly divisible by 3 or use the following code for arbitrary N:

; void dot_product(int32 *product, int16 *vector1, int16 *vector2, uint32 N)
dot_product
        STMDB   SP!, {R4-R6}        ; Save registers to the stack
next_sample
        LDR     R5, [R1], #4        ; Load a vector from vector 1
        LDR     R6, [R2], #4        ; Load a vector from vector 2
        SUBS    R3, R3, #1          ; Subtract 1 from N
        SMULTT  R12, R5, R6         ; temp = I*I
        SMLABB  R12, R5, R6, R12    ; temp += Q*Q
        STR     R7, [R0], #4        ; Save the dot product
        BGT     next_sample         ; Branch if N > 0
        LDMIA   SP!, {R4-R6}        ; Restore registers
        BX      R14                 ; Return

This version is about %15 faster.

Update: 5:59 PM
Well, my processing time is up to 2.8ms with IQ demod and dot product computation. It’s going to be tight!