Shazam in 320 bytes per song

I wanted to know if you could build an offline Shazam that runs entirely on a phone — no server, no network. The answer is: mostly yes, and you don’t even need to train a model.

The finding

MERT-v1-95M, a pretrained music transformer, produces embeddings that are already discriminative enough for song identification. No fine-tuning, no adapter, no custom loss function. Just freeze it, mean-pool the output, and search by cosine similarity.

On a corpus of 6,839 songs (Billboard Hot 100, 1920–2020s), frozen MERT achieves 96.6% top-1 recall. The interesting part is what happens when you compress it.

The compression pipeline

MusicPrint indexing pipeline

A 3-minute song produces ~175 overlapping 5-second windows, each encoded as a 768-dim float vector. That’s 632 KB per song — way too much for a phone database.

Three compression stages bring it down:

K-means clustering — Most windows in a song are redundant (repeated chorus, sustained sections). Clustering 175 windows to 10 centroids loses nothing at 100 songs and only 3.4% at 7,000 songs.
PCA — The 768-dim embedding space has an effective dimensionality around 256 for fingerprinting. PCA to 256 dims preserves 96.1% recall.
Binary hashing — Take the sign of each PCA dimension (positive → 1, negative → 0). Search with Hamming distance instead of cosine similarity.

The surprise: PCA 256 + binary hashing (96.5% recall) actually outperforms raw binary hashing without PCA (95.1%). Removing noise dimensions before binarization makes the sign bits more discriminative.

Config	Storage/song	Recall	10M songs
Full embeddings	30 KB	96.6%	286 GB
k=10 + PCA 256 + binary	320 B	96.5%	3 GB
k=10 + PCA 128 + binary	160 B	92.0%	1.5 GB

At 320 bytes per song, 10 million songs fit in 3 GB. That’s an iPhone. For context, spectrogram-based approaches like Shazam typically store 8–24 KB per song (based on back-of-envelope math: a 3-minute song produces thousands of 32-bit landmark hashes). MusicPrint is 25–75x smaller.

What didn’t work

Before discovering frozen MERT was sufficient, I spent time on fine-tuning approaches that all failed:

ArcFace with Tanh adapter: Hash collapse — the Tanh activation saturated and all songs produced the same binary hash. Also the margin was accidentally set to 28.6 radians instead of 0.5.
ArcFace with MLP adapter: 40-50% recall. Turns out the evaluation was flawed (comparing only 2 clips per song instead of searching a full index).
Contrastive loss with full-song training: Seemed to work, but when I fixed the evaluation, frozen MERT without any training matched it.

The lesson: test your evaluation before blaming the model.

What’s left

The 96.5% recall is on 7,000 songs. The real question is whether it holds at 10 million. The embedding space gets more crowded as you add songs, and 3.4% degradation from 100 to 7,000 songs might extrapolate badly.

Other open questions: how does it handle noisy recordings (phone mic in a bar), clips not aligned to the 1-second grid, and corpora of very similar songs (all classical piano, all EDM drops).

The code, experiments, and paper are on GitHub: alainbrown/musicprint

Research paper — full results and methodology
Experiments notebook — reproducible in JupyterLab