Diff for "ArduinoFHT" - Open Music Labs Wiki

Differences between revisions 4 and 7 (spanning 3 versions)

Arduino FHT Library

The Fast Hartley Transform (FHT) is a function that converts time domain signals to the frequency domain. This is exactly what the more well known FFT does, but it is specifically designed for real data, whereas the FFT operates on complex data. As a result, the FHT uses half as much processing power, and half as much memory. You can have your cake and eat it too!

About the Arduino FHT Library

The Arduino FHT library is a fast implementation of a standard FHT algorithm which operates on only real data. It can give you up to 256 frequency bins at 16b depth, at a minimum of ~3ms update rate. It is adjustable from 16 to 256 bins, and has several output methods to suit various needs. It can be set to 16b linear, 8b linear, 8b logarithmic, or 8b octave output. All of these different modes are detailed in the read_me file (inside the FFT library folder). The noise floor is slightly better on the FHT in comparison to the FFT, and is around -78dB. When using the onboard ADC, the ADC's noise floor is slightly higher than the FHT's noise floor, giving somewhere between a 9b and 10b SNR (-55dB).

Speed characteristics

Function	run	reorder	window	lin	lin8	log	oct
N	(ms)	(us)	(us)	(us)	(us)*	(us)	(us)
256	3.18	363	576	566	441	508	425
128	1.30	123	288	287	220	258	223
64	0.51	61	144	148	111	130	119
32	0.18	27	72	77	56	66	67
16	0.12	13	36	44	28	39	39

* Note: the lin8 values are approximate, as they vary a small amount due to SCALE factor. See #define section of the read_me for more detials.

Memory characteristics

Function	run	reorder	window	lin	lin8	log
N	S/F(B)	F(B)	F(B)	S/F(B)	S/F(B)	S/F(B)
256	512/476	120	512	256/768	128/640	128/256
128	256/224	56	256	128/768	64/640	64/256
64	128/100	28	128	64/768	32/640	32/256
32	64/40	12	64	32/768	16/640	16/256
16	32/12	6	32	16/768	8/640	8/256

S = SRAM, F = Flash

How to Use the FHT Library

These are all the same as the FFT library, so they are just linked from here. Note that the FHT functions will be called fht_function rather than fft_function, and that you only need to fill 1 data bin for the FHT, i.e. fht_input[i] only goes from 0 to (FHT_N - 1).

Files

Libraries: Arduino1.0 or Arduino-0022 (should work with both)

Arduino FHT library

Installing Libraries

The above files need to be placed in the libraries folder inside of your Arduino sketch directory. After you unzip ArduinoFFT.zip, take the FFT folder and place it in your libraries folder, restart Arduino and load one of the example programs to test out the library.

If you are not certain where the libraries folder is located on your computer, try the following:

PC

Open up the Arduino software, and go to Sketch -> Add File..., and a window will pop up that is your sketch folder. This is usually C:\Documents and Settings\<your user name>\My Documents\Arduino. If you see a libraries folder, put the AudioCodec library in there. If you don't already have one, create the libraries folder in that directory.

Mac

Open up your Arduino sketchbook folder. This is typically /Users/<your user name>/Documents/Arduino, or /Users/<your user name>/Documents/Maple if you are using Maple. If there is not a folder already named libraries, you should create one and place the unzipped AudioCodec library within it.

Implementation Details

This particular FHT uses all the same speed-up techniques as the FFT, so they are included below. But, there are a few additional tricks. The first 3 butterflies are done as a single process, so 2 sets of data fetches/stores are removed. Also, there is a redundant set of data in every other butterfly, so 2 adds and 2 multiplies are removed. Finally, The multiplies were converted to signed times unsigned, which saved a few clock cycles as well.

The speed improvements in this particular implementation are due to 2 things. First, in any FFT, you must multiply the input variables by fixed cosine and sine constants. This is what consumes the most time on the ATmega, as 16b x 16b multiplies take around 18 clock cycles. On the other hand, 16b + 16b adds only take 2 clock cycles. So, its better to add than it is to multiply. As it turns out, a lot of those sine and cosine constants used in the FFT are just 0 or 1, so you don't have to multiply, and can just add. For example, in a 256 point FFT, there are 1024 complex multiplies to be done, of which 382 are either 0 or 1. Thats almost half of them!

The ArduinoFFT checks for those 0 or 1 conditions, and simply does adds instead. as it turns out, those constants occur at regular intervals, and can be easily checked for. The benefits of this sort of approach are limited for larger FFTs. The total savings is (1.5*N - 2) for an N sized FFT, whereas the total number of multiplies is (N/2)*log2(N). This gives a savings ratio of 3/log2(N), which drops as N increases.

The second set of time savings in this implementation comes from using lookup tables to calculate the square roots of the magnitudes. The difficulty in this method is that the input mapping to the lookup table is much, much larger than the actual contents of the lookup table itself. So, to not waste memory space, a compression of the input values must be done. For example, taking the square root of a 16b value has 64k input values which must map down to 256 output values. To have an answer hard coded into memory space for all of those inputs is impossible on the Arduino (and a waste in general). So instead, a linear interpolation of the input space is used, with different slopes for different sections. For 8b linear output, this can be done with no loss of precision with either 3 or 4 linear sections. This means that the input value can be evaluated for which section it lies in, and then the square root fetched, in around 12 clock cycles. This is much less than the usual 150 clock cycles that a standard square root library would require.

The 32b input version is slightly more difficult, as the output mapping space is now 16b (64k), and the linear mapping technique can not compress it any more than that. In this case, a hybrid approach is implemented where the input value is converted to a floating point value with 16b of precision plus 8b of exponent. This can be done very quickly in base 2, and then the above 16b square root lookup table method can be used. If the input compression is done in right shifts of 2, the output value can be reconstructed with left shifts of 1. basically, the exponent is forced to be an even value upon creation, so the square root can return an integer value.

This 32b version is not as precise as a true square root library, but it only takes around 40 clock cycles, compared to 500 for a true square root. This lookup table version only gives an accurate first 8b on the return value, but for the purposes of this FFT, that is good enough. The total bit depth of the FFT is not much past 12b since it is implemented in fixed point (each value must be divided by 2 before adding to prevent overflow - this gives an eventual divide by 256 for a 256 point FFT). The relative accuracy is a function of output value size. For a return value of 8b, it is as close as you can get. For a 9b value, its lsb might be wrong. for a 10b value, 2 lsbs might be wrong, and so on. So the worst case scenario is a 16b return value where you get +/-0.5% accuracy.

References

Before settling on the FHT, i read up on a number of different FFT implementations. There is a great deal of academic work to optimize the FFT, and all of that points to the fact that an FFT specifically designed for real input values can be done in few operations than an FHT. But, after reading through the Sorenson paper, i decided that the extra structural complexity of the split-radix variants (which are the fastest ones) out-weighed the reduction in multiplies. There is a pretty hefty processing cost for data fetches in the ATmega, and checking for state and branching takes up time as well. Also, after looking through Sorenson's code, it looked like it wasn't as great of a reduction as claimed. So the FHT won out in the end.

If you are interested in learning more about the FHT, here is a good resources that i used in writing my code.

Stanislav's FHT implementation: Stanislav has a great website with a lot of audio applications for AVR. He introduced me to the FHT, and got this whole mess started. There isn't too much info on the web about the FHT, and his webiste does a good job explaining how it works. A bit of background in the FFT is very helpful for understanding it. See the ArduinoFFT page for more information on FFTs and window functions.

-  ← Revision 4 as of 2012-09-06 00:59:01 →
  Size: 8573
  Editor: guest
  Comment:
+  ← Revision 7 as of 2012-09-06 03:15:22 →
  Size: 9522
  Editor: guest
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 36:
-These are all the same as the FFT library, so they are just linked from here.
+These are all the same as the FFT library, so they are just linked from here.  Note that the FHT functions will be called ''fht_function'' rather than ''fft_function'', and that you only need to fill 1 data bin for the FHT, i.e. ''fht_input[i]'' only goes from 0 to (FHT_N - 1).
 Line 38:
- * [[FFTunctions|FHT functions]]
+ * [[FFTFunctions|FHT functions]]
 Line 76:
+Before settling on the FHT, i read up on a number of different FFT implementations.  There is a great deal of academic work to optimize the FFT, and all of that points to the fact that an FFT specifically designed for real input values can be done in few operations than an FHT.  But, after reading through the Sorenson paper, i decided that the extra structural complexity of the split-radix variants (which are the fastest ones) out-weighed the reduction in multiplies.  There is a pretty hefty processing cost for data fetches in the ATmega, and checking for state and branching takes up time as well.  Also, after looking through Sorenson's code, it looked like it wasn't as great of a reduction as claimed.  So the FHT won out in the end.

Quick Links

Search Wiki

Page Tools