Audio Fingerprinting

Musicbrainz is having grief with their acoustical fingerprint server.

I thought I'd give a modification of the markov chaining playlist generator a try for matching tracks.

The good news is that the matching works well. Extracting 180 distinct features yields a correct answer 83.4% of the time when querying for the nearest neighbour. That number goes up to 87.8% if you query for the 3 nearest neighbours.

The bad news is that it works better with more features extracted. This is bad because the matching process is finding the nearest neighbours to a vector in a n-dimensional space. n is the number of features extracted. For example, if 180 features are extracted, then a query to find out which track in the database is closest translates to a finding the nearest neighbouring point in 180 dimensional space. This turns out to be a hard problem. Most spacial indexing systems (R-Trees, X-Trees, M-Trees) suffer from "the curse of dimensionality" -- query times are dependant on the square of the dimension.

The good news is that there appear to be techniques for avoiding the so called "curse of dimensionality": The Pyramid Technique, iDistance, p+-tree.

The pieces

HTK for extracting MFCCs.
treeq for building the feature extraction decision tree, classifying mfccs as features, and then counting the features in a track.
Tree::M for storing each track as a point in n-space, where the coordinates is determined by the feature counts returned by treeq.
My collection of 10424 mp3 encoded tracks.
My collection of 1532 ogg vorbis encoded tracks (a strict subset of the mp3 encoded tracks).
cdparanoia
ogg vorbis encoder
lame encoder

The puzzle

After a few iterations, here's the recipe I came up with.

decode mp3s, and extract mfccs
decode oggs, and extract mfccs
extract 9 tracks (Painting Daisies -- Beale Street, Meg Lee Chin -- Bottle, Amon Tobin -- Defocus, Apocolyptica -- Faraway, Ani DiFranco -- In the Way, Screaming Headless Torsos -- Mr. Softee's Nightmare, Aphex Twin -- omgya-switch 7, Opeth -- The Leper Affinity, Orbital -- Tunnel Vision), and encoded them in a bunch of different ways
concatenate the different encodings together, and build a decision tree using growtree
prune the decision tree down into 5,10,15,....200 leaf trees
foreach tree
- run the mfccs through the decision tree, and build a histogram of the number of samples ending up on each leaf
- for the mp3s, insert these histograms as vectors into a n-dimensional MTree, where n is the number of bins in the histogram
- for the oggs, do a nearest neighbour query using the histogram as vectors against the MTree created from the mp3 corpus
- tally the number of correct matches

The creation of the decision tree determines what constitutes a feature. Care must be taken to ensure that the decision tree creation process is tolerant of encoder artifacts.

This could potentially map onto something similar to the current TRM approach:

user fires up tagger, and adds a bunch of tracks to be identified
tagger downloads a treeq decision tree from the MB server
tagger decodes track, and calculates mfccs
tagger creates a histogram from the mfcc and the decision tree
tagger sends the histogram to a MB server
the MB server calculates the k Nearest Neighbours, and sends the resultant metadata back to the tagger
the tagger decides if the match is good enough, perhaps by comparing the euclidean distance from the result vector to the query vector to a threshold
the tagger submits the histogram and the selected metadata

There are quite a few improvements left to be made to the basic approach. First it needs to scale well, both in database size, and number of dimensions. The MTrees currently being used do not scale well. There is still a large amount of tuning that could be made to the decision tree to improve quality of match.