Tempo Detection

The algorithm used in tempo detection was developed by the MIT Media Lab [8]. The code implementing the algorithm was taken from this website[9]. The implementation of tempo detection into the playlist generation algorithm was a stretch goal after successfully implementing key detection. Therefore, given the time constraints, we chose to implement the code written by another group and instead focus on understanding what DSP the algorithm used. Our understanding of the algorithm was

also mostly derived from the site we took the code from and the original paper.

Figure 1: MIT Media Lab Tempo Detection Algorithm [8]

Understanding the Algorithm

The tempo detection algorithm is a four-step process which is shown in Fig 1.

The algorithm works on 2.2 seconds of a song taken from the middle of the song.

First, the song is split up into several different frequency bands to analyze.

Doing this makes it easier to see patterns within specific frequency ranges. To do

this, first, the FFT of the song was found. Then the FFT was divided up into 6

frequency bands: 0-200Hz, 200-400Hz, 400-800Hz, 800-1600Hz, 1600-3200Hz, and

3200Hz-Sampling Frequency. This can be viewed as using an ideal band-pass filter

on the song 6 times. Finally, the inverse FFT of each frequency band was found

and sent to the next step.

The following steps were performed on each individual frequency band. Step 2 was to

smooth the time domain signal by finding its envelope. This was achieved by using a

Hanning window and full-wave rectifying the audio signal. Full-wave rectification was to make analysis easier by making the envelope only positive. The signal was then low-pass filtered by multiplying the signal in the frequency domain to the Hanning window in the frequency domain. The sampled signal was then transformed back to the time domain.

The third step is to differentiate the envelope signal. This will accentuate when the sound amplitude changes. Each band was half-wave rectified to only show increased sounds. A half-wave rectifier in this case means that all negative amplitude values are changed to zero. This is done in the code by finding the difference from one sample

to the next and only retaining if the difference is positive.

The final step was to convolve the differentiated and half-wave rectified envelope signals with various comb filters. The comb filter's were made up of periodic impulses according to the tempo they represented. There was a different comb filter for each of the tempo's tested. In order to convolve the signal with each comb filter, the FFT of both signals were found. Then the two signals were multiplied together. This is because convolution in the time domain is multiplication in the frequency domain. The way the comb filter worked was by causing higher peaks in the envelope signal where both the signal and the comb filter had an impulse. This meant that they were both experiencing a beat. The energy of the envelope signal after being convolved with the comb filter was found. The energy for all 6 filter bands were then added together. Finally, the energies were all compared to find a maximum. The maximum energy and its corresponding comb filter were taken as the tempo of the song.

The process of finding the tempo of a song used several DSP techniques discussed in EECS 351. First several different filter types were used. A bandpass filter was used to generate the filter bands, a low pass filter was used to find the envelope, and finally a comb filter, a filter type not discussed in class, was used to determine the tempo of the song. Another DSP technique used in the algorithm was the FFT which was used several times throughout the algorithm to convert the signal between the frequency and time domain depending on what was more appropriate. Additionally, another DSP technique that was not discussed in class was rectification. Both full and half-wave rectification were used in the algorithm. Full-wave rectification is when negative amplitude values are changed to identical positive amplitude values. Half-wave rectification is where negative amplitude values are changed to zero[10]. Finally, convolution was used in the final step between the comb filter and the song. As discussed in class, convolution is multiplication in the frequency domain and this was leveraged in the algorithm to make it computationally simpler.

Running the algorithm downloaded off the internet had a very low accuracy, often getting a multiple of the correct tempo or a random number. Several changes were made to the code to improve the accuracy. First, when testing the algorithm we noted that the sampling frequency they were using was 4096Hz. Most music actually has a sampling frequency of 44.1kHz. In order to account for this we modified the code to make the sampling frequency the length of the song's FFT vector. This is the number of bins that the frequencies in the song were put into and was a good representation

of the sampling frequency of the song. Next we rescaled the 6 frequency bands generated in the first step of the algorithm to better divide up the expanded frequency range. The band sizes were recalculated with each song because the sampling frequency changed with each song and we wanted the size of the frequency bands to geometrically increase. This way low frequencies would have a smaller band and a higher resolution compared to high frequencies.

The other change made to the algorithm was testing how many frequency bands we should use. The original algorithm divided the frequencies into 6 bands. However, by changing the sample frequency from 4096Hz to about 44.1Hz made the bands became much wider so we wanted to test the effect of increasing the number of bands. The algorithm was tested with bands ranging from 6 to 30. The result of these test suggested that 8 bands was ideal. Increasing the number of bands to 8 increased the number of songs the algorithm correctly determined the beat for. However increasing beyond 8 did not increased the number of songs the algorithm correctly determined the beat for and only served to slow the algorithm down.

DSP Techniques Used

Modifications Made to the Algorithm

Results

Table 1: Output of the tempo algorithm

A sample of the results of tempo detection are shown in Table 1. For the songs tested it got about 50% completely correct. The ones it got wrong were often half or twice the correct bmp. This can be seen in the table for the songs First Love and Anagram. For the song Anagram, we listened to the song with a metronome and determined the output of the algorithm was fairly accurate as opposed to the tempo value available online. Further analysis of the results would be to determine if the algorithm output is a plausible tempo value by listening to each song with a metronome as well as ways to considering was to modify the algorithm to have less occurrences of the bpm being half or twice the actual bpm.