Arduino FHT Library
Note: The serial output of the examples is in binary, not ASCII. This means it will not be human readable on the serial port. Change serial.write() to serial.print() to fix this. You may need to write a for() loop to manually output each frequency bin.
About the Arduino FHT Library
The Fast Hartley Transform (FHT) is a function that converts time domain signals to the frequency domain. This is exactly what the more well known FFT does, but it is specifically designed for real data, whereas the FFT operates on complex data. As a result, the FHT uses half as much processing power, and half as much memory. You can have your cake and eat it too!
The output of the FHT is slightly different than the FFT. It is as follows:
- Re(k) = (x(k) + x(-k))/2
- Im(k) = (x(k) - x(-k))/2
Mag(k) = sqrt((x(k)2 + x(-k)2)/2)
- Phase(k) = atan2((x(k) - x(-k)), (x(k) + x(-k)))
- Frequency(k) = (k)*(sample_rate)/(FHT_N)
where k is the kth bin, and -k is the N-kth bin for an FHT of size N. The 0 bin is repeated twice. atan2 is the arctangent, but through a special function that takes care of the edge cases.
About the Arduino FHT Library
The Arduino FHT library is a fast implementation of a standard FHT algorithm which operates on only real data. It can give you up to 256 frequency bins at 16b depth, at a minimum of ~3ms update rate. It is adjustable from 16 to 256 bins, and has several output methods to suit various needs. It can be set to 16b linear, 8b linear, 8b logarithmic, or 8b octave output. All of these different modes are detailed in the read_me file (inside the FFT library folder). The noise floor is slightly better on the FHT in comparison to the FFT, and is around -78dB. When using the onboard ADC, the ADC's noise floor is slightly higher than the FHT's noise floor, giving somewhere between a 9b and 10b SNR (-55dB).
Speed characteristics
Function |
run |
reorder |
window |
lin |
lin8 |
log |
oct |
N |
(ms) |
(us) |
(us) |
(us) |
(us)* |
(us) |
(us) |
256 |
3.18 |
363 |
576 |
566 |
441 |
508 |
425 |
128 |
1.30 |
123 |
288 |
287 |
220 |
258 |
223 |
64 |
0.51 |
61 |
144 |
148 |
111 |
130 |
119 |
32 |
0.18 |
27 |
72 |
77 |
56 |
66 |
67 |
16 |
0.12 |
13 |
36 |
44 |
28 |
39 |
39 |
* Note: the lin8 values are approximate, as they vary a small amount due to SCALE factor. See #define section of the read_me for more detials.
Memory characteristics
Function |
run |
reorder |
window |
lin |
lin8 |
log |
N |
S/F(B) |
F(B) |
F(B) |
S/F(B) |
S/F(B) |
S/F(B) |
256 |
512/476 |
120 |
512 |
256/768 |
128/640 |
128/256 |
128 |
256/224 |
56 |
256 |
128/768 |
64/640 |
64/256 |
64 |
128/100 |
28 |
128 |
64/768 |
32/640 |
32/256 |
32 |
64/40 |
12 |
64 |
32/768 |
16/640 |
16/256 |
16 |
32/12 |
6 |
32 |
16/768 |
8/640 |
8/256 |
S = SRAM, F = Flash
How to Use the FHT Library
These are all the same as the FFT library, so they are just linked from here. Note that the FHT functions will be called fht_function rather than fft_function, and that you only need to fill 1 data bin for the FHT, i.e. fht_input[i] only goes from 0 to (FHT_N - 1).
The library is setup to either sample from the onboard ADC, or from the Audio Codec Shield. If you are using the onboard ADC, you can adjust the sample rate by setting the ADC prescaler bits accordingly. Or, if you are triggering the ADC from a timer, by setting that timer's prescaler appropriately. There is a tradeoff with setting the sample rate: the faster you sample, the less likely you will have problems with Nyquist aliasing, but the less resolution you will have in your frequency bins. When using the onboard ADC, which doesn't have any anti-aliasing filters, it is usually good to sample at least twice as fast as you need to.
Files
Library(4.0): Arduino1.6.12 and higher (updated for PROGMEM problem, fixed STRINGIFY error, fixed clobber list). Should work with older versions of Arduino as well.
Library(2.0): Arduino1.0 (updated for PROGMEM problem - thanks to forum user kirill9617, and added serial example). Should work with older versions of Arduino as well.
Library(3.0): Arduino1.0 (experimental - saved registers removed from clobber list to eliminate "r28"/"r29" compiler error).
Library(1.0): Arduino1.0 or Arduino-0022 (should work with both)
Installing Libraries
The above files need to be placed in the libraries folder inside of your Arduino sketch directory. After you unzip ArduinoFFT.zip, take the FFT folder and place it in your libraries folder, restart Arduino and load one of the example programs to test out the library.
If you are not certain where the libraries folder is located on your computer, try the following:
PC
Open up the Arduino software, and go to Sketch -> Add File..., and a window will pop up that is your sketch folder. This is usually C:\Documents and Settings\<your user name>\My Documents\Arduino. If you see a libraries folder, put the AudioCodec library in there. If you don't already have one, create the libraries folder in that directory.
Mac
Open up your Arduino sketchbook folder. This is typically /Users/<your user name>/Documents/Arduino, or /Users/<your user name>/Documents/Maple if you are using Maple. If there is not a folder already named libraries, you should create one and place the unzipped AudioCodec library within it.
FHT Data Visualizer in Processing or Pure Data
Included in the FHT library folder, is a file arduinofft_display.pd, which is a visualizer for Pure Data for displaying the spectrum returned by the FHT. The user Tempo on the OML forums made a data visualizer in Processing. So, you can use which ever programming environment you are more comfortable with. The following files are required for the processing Application:
Download and unzip FHT_128_channel_analyser.zip and place the entire folder inside of your Processing folder. You should then be able to run the file from within Processing. It is a display for 128 frequency bins (256 point FHT).
Implementation Details
This particular FHT uses all the same speed-up techniques as the FFT, so they are included below. But, there are a few additional tricks. The first 3 butterflies are done as a single process, so 2 sets of data fetches/stores are removed. Also, there is a redundant set of data in every other butterfly, so 2 adds and 2 multiplies are removed. Finally, The multiplies were converted to signed times unsigned, which saved a few clock cycles as well.
The speed improvements in this particular implementation are due to 2 things. First, in any FFT, you must multiply the input variables by fixed cosine and sine constants. This is what consumes the most time on the ATmega, as 16b x 16b multiplies take around 18 clock cycles. On the other hand, 16b + 16b adds only take 2 clock cycles. So, its better to add than it is to multiply. As it turns out, a lot of those sine and cosine constants used in the FFT are just 0 or 1, so you don't have to multiply, and can just add. For example, in a 256 point FFT, there are 1024 complex multiplies to be done, of which 382 are either 0 or 1. Thats almost half of them!
The ArduinoFFT checks for those 0 or 1 conditions, and simply does adds instead. as it turns out, those constants occur at regular intervals, and can be easily checked for. The benefits of this sort of approach are limited for larger FFTs. The total savings is (1.5*N - 2) for an N sized FFT, whereas the total number of multiplies is (N/2)*log2(N). This gives a savings ratio of 3/log2(N), which drops as N increases.
The second set of time savings in this implementation comes from using lookup tables to calculate the square roots of the magnitudes. The difficulty in this method is that the input mapping to the lookup table is much, much larger than the actual contents of the lookup table itself. So, to not waste memory space, a compression of the input values must be done. For example, taking the square root of a 16b value has 64k input values which must map down to 256 output values. To have an answer hard coded into memory space for all of those inputs is impossible on the Arduino (and a waste in general). So instead, a linear interpolation of the input space is used, with different slopes for different sections. For 8b linear output, this can be done with no loss of precision with either 3 or 4 linear sections. This means that the input value can be evaluated for which section it lies in, and then the square root fetched, in around 12 clock cycles. This is much less than the usual 150 clock cycles that a standard square root library would require.
The 32b input version is slightly more difficult, as the output mapping space is now 16b (64k), and the linear mapping technique can not compress it any more than that. In this case, a hybrid approach is implemented where the input value is converted to a floating point value with 16b of precision plus 8b of exponent. This can be done very quickly in base 2, and then the above 16b square root lookup table method can be used. If the input compression is done in right shifts of 2, the output value can be reconstructed with left shifts of 1. basically, the exponent is forced to be an even value upon creation, so the square root can return an integer value.
This 32b version is not as precise as a true square root library, but it only takes around 40 clock cycles, compared to 500 for a true square root. This lookup table version only gives an accurate first 8b on the return value, but for the purposes of this FFT, that is good enough. The total bit depth of the FFT is not much past 12b since it is implemented in fixed point (each value must be divided by 2 before adding to prevent overflow - this gives an eventual divide by 256 for a 256 point FFT). The relative accuracy is a function of output value size. For a return value of 8b, it is as close as you can get. For a 9b value, its lsb might be wrong. for a 10b value, 2 lsbs might be wrong, and so on. So the worst case scenario is a 16b return value where you get +/-0.5% accuracy.
References
Before settling on the FHT, i read up on a number of different FFT implementations. There is a great deal of academic work to optimize the FFT, and all of that points to the fact that an FFT specifically designed for real input values can be done in few operations than an FHT. But, after reading through the Sorenson paper, i decided that the extra structural complexity of the split-radix variants (which are the fastest ones) out-weighed the reduction in multiplies. There is a pretty hefty processing cost for data fetches in the ATmega, and checking for state and branching takes up time as well. Also, after looking through Sorenson's code, it looked like it wasn't as great of a reduction as claimed. So the FHT won out in the end.
If you are interested in learning more about the FHT, here is a good resources that i used in writing my code.
Stanislav's FHT implementation: Stanislav has a great website with a lot of audio applications for AVR. He introduced me to the FHT, and got this whole mess started. There isn't too much info on the web about the FHT, and his webiste does a good job explaining how it works. A bit of background in the FFT is very helpful for understanding it. See the ArduinoFFT page for more information on FFTs and window functions.