How Shazam Works
+
Intro to machine learning for
multimedia analysis
Perception & Multimedia Computing
Lecture 19
Rebecca Fiebrink
Lecturer, Department of Computing
Goldsmiths, University of London
1
Last time
• Perceptually-motivated features for IR
of audio and music
• Evaluation of IR systems
2
Today
• Shazam
• Machine learning intro
• Supervised learning introduction
• Wekinator demo
3
Shazam
4
Using Shazam
1. Open app
2. Record a few
seconds of audio
3. Find out
immediately
which song you’re
listening to (and
have the
opportunity to
buy it)
5
Fingerprinting
Goal: find a function f(song) such that
f(song1) = f(song2)
IF AND ONLY IF
song1 = song2
(*** if song1 and song2 are same
recording, but possibly with different
compression, background noise, etc.)
6
Challenge:
Requires ability to quickly find
matching song out of a database of
millions!
Can’t just compare f(song) to
f(song1), f(song2), … f(songN) for all N
songs in the database.
Solution: use hashing
7
Typical hash functions in
computing
Hash Table:
Each data item
mapped onto
fixed-length (often
shorter) key
Facilitates rapid
search (only search
among collisions)
8
http://en.wikipedia.org/wiki/Hash_table
Review: Perceptual hash
• Hash is a function computed on a song:
h(song)
• Perceptually identical (or possibly “very
similar”) items will end up in same “bucket”
e.g., MP3, WAV versions of same song
• Bucket is a specific location in memory /
on disk
• When receive a new query, compute
h(query) and look in this “bucket” for possible
matches
9
Creating a Shazam hash function
Requirements:
1. Ability to use a short snippet of a song as a query
(rather than having to use whole song as query)
• Hash should be computable from just a small part of
a song, and shouldn’t require knowledge of where in
the song the query lies
2. Robust to degradation (compression, background
noise, etc.)
• Hash values for clean DB version of song should
exactly match hash values on degraded versions
3. Hash should be “high entropy”: don’t map too many
songs to the same bucket! (This makes search take
longer)
10
High-level solution
• For each song in the DB, compute many hash values
using slightly different local segments of the song.
• Create on disk a series of “buckets” (hash values), each
of which contains a list of songs containing hashes that
fall into that bucket
• For each song, also store a list of hashes and the song
times associated with them
• For a query, compute several hash values using slightly
different local segments
• For each hash value, go to the corresponding bucket.
• For all songs in the bucket, determine whether the
query matches hat song. (*** some magic here)
11
Hash Function
1. Compute the spectrogram
12
Hash Function
2. Identify peaks in spectrogram
Peak = a time-frequency point with higher energy than its
neighbors in a small region
13
Pretty robust to encoding scheme, background noise, EQ
Hash Function
3. Find pairs of nearby peaks.
Hash value = (peak1_freq, peak2_freq, time2-time1)
Can concatenate each hash into a 32-bit uint
14
Storing hashes
4. Store within each hash bucket:
(song number, time point of peak1)
e.g., contents of bucket “3235293”:
3235293:
(32572, 39280)
(209371, 94830)
(32572, 3927)
3235294:
(2324, 2323)
15
Receiving a query
Repeat hash process for query:
1. Compute spectral peaks
2. Find pairs of nearby peaks
3. Compute list of hashes: (freq1, freq2, time2-time1)
4. For each hash value:
a. Find hash bucket.
b. For each song in bucket:
Check if query matches song
Want this to be VERY FAST!
16
Checking for a match
Does query match song?
Create two lists:
1. List of (hash values, times) for song
2. List of (hash values, times) for query
Assumption: If query is a snippet of the song, then
1. They’ll share lots of hash values
And
2. The relative timing of different hashes will be
the same!
17
Relative hash timings
If query matches song, we expect to see something like
this:
Song Hash (value, time) Query hash (value, time)
235, 4 23958, 5
23958, 10 9384, 17
9384, 23 39582, 23
39582, 28
273, 35
18
Query begins
40 seconds
into song
19
20
Results
Robust to noise:
Only 1-2% of hash values
must survive in order to enable
identification
21
22
23
Results
Very fast:
On a regular PC, search through
20,000 tracks takes 5-500ms
Very robust: few false positives
24
For more info
Read the academic paper with
details:
http://www.ee.columbia.edu/
~dpwe/papers/Wang03-
shazam.pdf
25
Intro to machine
learning
26
Feature values give a data item (e.g., song)
a point in feature space
Feature
2:
Average
centroid
Feature
1:
Average
RMS
27
27
Retrieval: Find items closest to a query
Feature
2:
Average
centroid
Feature
1:
Average
RMS
28
28
Classification: Assigning a data point a label
based on its position in feature space
David Bowie
songs
Feature
2:
Average
centroid
Lady Gaga songs
Feature
1:
Average
RMS
29
29
You be the classifier: Bowie or Gaga?
Feature
2:
Average
centroid
Feature
1:
Lady Gaga songs
30
30
You be the classifier: Bowie or Gaga?
Feature
2:
Average
centroid
Feature
1:
Lady Gaga songs
31
31
K-nearest-neighbor (kNN) classifier:
Take a vote of k nearest points
K=3
Feature
2:
Average
centroid
Feature
1:
Lady Gaga songs
32
32
Supervised learning builds models.
Models are functions.
inputs
training
data
algorithm model
Training
outputs
33
33
Supervised learning algorithms
build models from data.
Each example represented as a feature vector
.01, .59, .03, 32 .05, 1.2, 3.2, 31 -.1, .34, .20, 8.2 .01, .64, .02, 20
inputs
“C Major” “F minor” “G7”
training
data
algorithm model
Training
Running “Coutputs
Major”
34
Classification: Assign 1 of N discrete labels
to each point in feature space
feature2
This model: a
separating
line or
hyperplane
(decision
boundary)
feature1
35
Regression
output
This model: a
real-valued
function of the
input features
feature
36
Unsupervised learning
Dataset includes examples, but no labels
Example: Infer clusters from data:
feature2
37
feature1
How supervised learning
algorithms work
(the basics)
38
The learning problem
Goal: Build the best** model given the training
data
Definition of “best” depends on context,
assumptions…
39
Which classifier is best?
“Underfit” “Overfit”
Competing goals:
Accurately model training data
**Accurately classify unseen data points**
40
Image from Andrew Ng
A simple classifier: nearest neighbor
feature2 feature1
?
41
Another simple classifier: Decision tree
Images: http://ai.cs.umbc.edu/~oates/classes/2009/ML/homework1.html,
42
http://nghiaho.com/?p=1300
AdaBoost: Iteratively train a “weak” learner
Image from http://www.cc.gatech.edu/~kihwan23/
imageCV/Final2005/FinalProject_KH.htm
43
Support vector machine
Re-map input space into a higher number of
dimensions and find a separating hyperplane
44
Supervised learning and music
Create classifiers for:
• Genre (rock, pop, …)
• Artist, album
• Pitch (apply frame by frame in audio)
• Mood
• Tag/semantic descriptor (e.g., last.fm tags)
• Instrument identification
• Beatbox vocalization identification &
remapping
• … and many others
45
Practical tips
May have to explore many feature sets
Normalize features! (e.g., all between [0,1])
Using too many features can hurt
performance: “curse of dimensionality”
May have to explore many learning algorithms
(and many parameter settings for each)
Compare using “test accuracy” or cross-
validation, NOT performance on training data
Usually more examples is better
46
Choosing a classifier: Practical considerations
k-Nearest Neighbor
+ Can tune k to adjust smoothness of decision boundaries
- Sensitive to noisy, redundant, irrelevant features; prone to overfitting;
weird in high dimensions
Decision tree:
+ Can prune to reduce overfitting, produces human-understandable
model
- Can still overfit
AdaBoost
+ Theoretical benefits, less prone to overfitting
+ Can tune by changing base learner, number of training rounds
Support Vector Machine
+ Theoretical benefits similar to AdaBoost
Many parameters to tune, training can take a long time
Other considerations…
47
For more info
Wekinator:
http://wekinator.cs.princeton.edu/
For non-realtime ML: Weka –
http://www.cs.waikato.ac.nz/ml/weka/
A good intro book on machine learning:
Data Mining by
Witten, Frank, and Hall
48
More detailed and technical textbooks
Bishop, 2006: Pattern Recognition & Machine
Learning. Science and Business Media, Springer
Duda, 2001: Pattern Classification, Wiley-
Interscience
49
Next week
Visualization
50