### Arduino FHT Library

The Fast Hartley Transform (FHT) is a function that converts time domain signals to the frequency domain. This is exactly what the more well known FFT does, but it is specifically designed for real data, whereas the FFT operates on complex data. As a result, the FHT uses half as much processing power, and half as much memory. You can have your cake and eat it too!

The output of the FHT is slightly different than the FFT. It is as follows:

- Re(k) = (x(k) + x(-k))/2
- Im(k) = (x(k) - x(-k))/2
Mag(k) = sqrt((x(k)

^{2}+ x(-k)^{2})/2)- Phase(k) = atan2((x(k) - x(-k))/(x(k) + x(-k)))

where *k* is the *k*th bin, and *-k* is the *N-k*th bin for an FHT of size N. The 0 bin is repeated twice. atan2 is the arctangent, but through a special function that takes care of the edge cases.

#### About the Arduino FHT Library

The Arduino FHT library is a fast implementation of a standard FHT algorithm which operates on only real data. It can give you up to 256 frequency bins at 16b depth, at a minimum of ~3ms update rate. It is adjustable from 16 to 256 bins, and has several output methods to suit various needs. It can be set to 16b linear, 8b linear, 8b logarithmic, or 8b octave output. All of these different modes are detailed in the read_me file (inside the FFT library folder). The noise floor is slightly better on the FHT in comparison to the FFT, and is around -78dB. When using the onboard ADC, the ADC's noise floor is slightly higher than the FHT's noise floor, giving somewhere between a 9b and 10b SNR (-55dB).

##### Speed characteristics

Function |
run |
reorder |
window |
lin |
lin8 |
log |
oct |

N |
(ms) |
(us) |
(us) |
(us) |
(us)* |
(us) |
(us) |

256 |
3.18 |
363 |
576 |
566 |
441 |
508 |
425 |

128 |
1.30 |
123 |
288 |
287 |
220 |
258 |
223 |

64 |
0.51 |
61 |
144 |
148 |
111 |
130 |
119 |

32 |
0.18 |
27 |
72 |
77 |
56 |
66 |
67 |

16 |
0.12 |
13 |
36 |
44 |
28 |
39 |
39 |

* Note: the lin8 values are approximate, as they vary a small amount due to SCALE factor. See #define section of the read_me for more detials.

##### Memory characteristics

Function |
run |
reorder |
window |
lin |
lin8 |
log |

N |
S/F(B) |
F(B) |
F(B) |
S/F(B) |
S/F(B) |
S/F(B) |

256 |
512/476 |
120 |
512 |
256/768 |
128/640 |
128/256 |

128 |
256/224 |
56 |
256 |
128/768 |
64/640 |
64/256 |

64 |
128/100 |
28 |
128 |
64/768 |
32/640 |
32/256 |

32 |
64/40 |
12 |
64 |
32/768 |
16/640 |
16/256 |

16 |
32/12 |
6 |
32 |
16/768 |
8/640 |
8/256 |

S = SRAM, F = Flash

#### How to Use the FHT Library

These are all the same as the FFT library, so they are just linked from here. Note that the FHT functions will be called *fht_function* rather than *fft_function*, and that you only need to fill 1 data bin for the FHT, i.e. *fht_input[i]* only goes from 0 to (FHT_N - 1).

#### Files

Libraries: Arduino1.0 or Arduino-0022 (should work with both)

#### Installing Libraries

The above files need to be placed in the *libraries* folder inside of your Arduino sketch directory. After you unzip ArduinoFFT.zip, take the FFT folder and place it in your *libraries* folder, restart Arduino and load one of the example programs to test out the library.

If you are not certain where the *libraries* folder is located on your computer, try the following:

##### PC

Open up the Arduino software, and go to **Sketch -> Add File...**, and a window will pop up that is your sketch folder. This is usually *C:\Documents and Settings\<your user name>\My Documents\Arduino*. If you see a *libraries* folder, put the AudioCodec library in there. If you don't already have one, create the *libraries* folder in that directory.

##### Mac

Open up your Arduino sketchbook folder. This is typically */Users/<your user name>/Documents/Arduino*, or */Users/<your user name>/Documents/Maple* if you are using Maple. If there is not a folder already named *libraries*, you should create one and place the unzipped AudioCodec library within it.

#### Implementation Details

This particular FHT uses all the same speed-up techniques as the FFT, so they are included below. But, there are a few additional tricks. The first 3 butterflies are done as a single process, so 2 sets of data fetches/stores are removed. Also, there is a redundant set of data in every other butterfly, so 2 adds and 2 multiplies are removed. Finally, The multiplies were converted to signed times unsigned, which saved a few clock cycles as well.

The speed improvements in this particular implementation are due to 2 things. First, in any FFT, you must multiply the input variables by fixed cosine and sine constants. This is what consumes the most time on the ATmega, as 16b x 16b multiplies take around 18 clock cycles. On the other hand, 16b + 16b adds only take 2 clock cycles. So, its better to add than it is to multiply. As it turns out, a lot of those sine and cosine constants used in the FFT are just 0 or 1, so you don't have to multiply, and can just add. For example, in a 256 point FFT, there are 1024 complex multiplies to be done, of which 382 are either 0 or 1. Thats almost half of them!

The ArduinoFFT checks for those 0 or 1 conditions, and simply does adds instead. as it turns out, those constants occur at regular intervals, and can be easily checked for. The benefits of this sort of approach are limited for larger FFTs. The total savings is (1.5*N - 2) for an N sized FFT, whereas the total number of multiplies is (N/2)*log2(N). This gives a savings ratio of 3/log2(N), which drops as N increases.

The second set of time savings in this implementation comes from using lookup tables to calculate the square roots of the magnitudes. The difficulty in this method is that the input mapping to the lookup table is much, much larger than the actual contents of the lookup table itself. So, to not waste memory space, a compression of the input values must be done. For example, taking the square root of a 16b value has 64k input values which must map down to 256 output values. To have an answer hard coded into memory space for all of those inputs is impossible on the Arduino (and a waste in general). So instead, a linear interpolation of the input space is used, with different slopes for different sections. For 8b linear output, this can be done with no loss of precision with either 3 or 4 linear sections. This means that the input value can be evaluated for which section it lies in, and then the square root fetched, in around 12 clock cycles. This is much less than the usual 150 clock cycles that a standard square root library would require.

The 32b input version is slightly more difficult, as the output mapping space is now 16b (64k), and the linear mapping technique can not compress it any more than that. In this case, a hybrid approach is implemented where the input value is converted to a floating point value with 16b of precision plus 8b of exponent. This can be done very quickly in base 2, and then the above 16b square root lookup table method can be used. If the input compression is done in right shifts of 2, the output value can be reconstructed with left shifts of 1. basically, the exponent is forced to be an even value upon creation, so the square root can return an integer value.

This 32b version is not as precise as a true square root library, but it only takes around 40 clock cycles, compared to 500 for a true square root. This lookup table version only gives an accurate first 8b on the return value, but for the purposes of this FFT, that is good enough. The total bit depth of the FFT is not much past 12b since it is implemented in fixed point (each value must be divided by 2 before adding to prevent overflow - this gives an eventual divide by 256 for a 256 point FFT). The relative accuracy is a function of output value size. For a return value of 8b, it is as close as you can get. For a 9b value, its lsb might be wrong. for a 10b value, 2 lsbs might be wrong, and so on. So the worst case scenario is a 16b return value where you get +/-0.5% accuracy.

#### References

Before settling on the FHT, i read up on a number of different FFT implementations. There is a great deal of academic work to optimize the FFT, and all of that points to the fact that an FFT specifically designed for real input values can be done in few operations than an FHT. But, after reading through the Sorenson paper, i decided that the extra structural complexity of the split-radix variants (which are the fastest ones) out-weighed the reduction in multiplies. There is a pretty hefty processing cost for data fetches in the ATmega, and checking for state and branching takes up time as well. Also, after looking through Sorenson's code, it looked like it wasn't as great of a reduction as claimed. So the FHT won out in the end.

If you are interested in learning more about the FHT, here is a good resources that i used in writing my code.

Stanislav's FHT implementation: Stanislav has a great website with a lot of audio applications for AVR. He introduced me to the FHT, and got this whole mess started. There isn't too much info on the web about the FHT, and his webiste does a good job explaining how it works. A bit of background in the FFT is very helpful for understanding it. See the ArduinoFFT page for more information on FFTs and window functions.