Here’s how the fingerprinting works:
When you’re humming a song to someone, you’re creating a fingerprint because you’re extracting from the music what you think is essential (and if you’re a good singer, the person will recognize the song).
You can think of any piece of music as a time-frequency graph called a spectrogram. On one axis is time, on another is frequency, and on the 3rd is intensity. Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise.
Human ears have more difficulties to hear a low sound (<500Hz) than a mid-sound (500Hz-2000Hz) or a high sound (>2000Hz). As a result, low sounds of many “raw” songs are artificially increased before being released. If you only take the most powerful frequencies you’ll end up with only the low ones and If 2 songs have the same drum partition, they might have a very close filtered spectrogram whereas there are flutes in the first song and guitars in the second.
Here is a simple way to keep only strong frequencies while reducing the previous problems:
For each FFT result, you put the 512 bins you inside 6 logarithmic bands. For each band you keep the strongest bin of frequencies. You then compute the average value of these 6 powerful bins. You keep the bins (from the 6 ones) that are above this mean (multiplied by a coefficient). The last step is very important because you might have an a cappella music involving soprano singers with only mid or mid-high frequencies
a jazz/rap music with only low and low-mid frequencies