Locality Sensitive Hashing

I was reading about universal hashing last week. You know, the standard stuff: pick random coefficients a and b, compute h(x) = (ax + b) mod p, and you get a family of hash functions where collisions are provably rare. It is one of those foundational ideas that shows up everywhere, from hash tables to load balancing to streaming algorithms.

And then a thought hit me. What if I wanted the exact opposite?

What if, instead of scattering keys as far apart as possible, I wanted similar things to land in the same bucket?

That question turns out to have a name: Locality-Sensitive Hashing. And it turns out to be the trick behind some pretty significant systems.

The Problem That Started All of This

Think about what Google had to deal with in, say, 2007. Billions of web pages. A huge chunk of them are duplicates or near-duplicates. Mirror sites, scraped content, syndicated articles, press releases copy-pasted across a hundred news outlets. If you don't catch these, your search index bloats, your rankings get polluted, and users see the same article five times on the first page.

The naive approach is to compare every pair of documents. If you have a million documents, that is about 500 billion comparisons. Even if each comparison takes a microsecond, you are looking at roughly 6 days of compute. For a billion documents, the math just falls apart entirely.

Google needed a way to find near-duplicates without comparing every pair. And they needed it to work on a scale where even O(n log n) felt expensive.

This is also exactly the problem plagiarism detection tools face. Turnitin, Copyscape, and similar services need to take a submitted essay and check it against millions of existing documents. They can't afford to do a full text comparison against every document in their database. They need a shortcut.

The Core Insight

Here is the key idea behind LSH, and I think it is genuinely clever.

A regular hash function tries to make h("cat") != h("bat") even though the inputs differ by one character. That is the whole point of hashing. Uniform distribution, minimal collisions, all of that.

An LSH hash function does the opposite. It is designed so that similar inputs are likely to produce the same hash value. Not guaranteed. Likely. And you can tune exactly how likely.

This gives you a filter. Instead of comparing all pairs, you hash everything, and only compare documents that ended up in the same bucket. Most of the truly dissimilar pairs never even get looked at.

The tradeoff is that you might miss some similar pairs (false negatives) and you might flag some dissimilar pairs (false positives). But you can control both of those error rates by tuning parameters. And you go from O(n^2) comparisons to something much closer to O(n).

Two Flavors of the Same Idea

These are two main algorithms that I studied that fall under the LSH umbrella, and they solve slightly different problems. (Ref: Image generated by Google's Gemini)

MinHash: "Did someone copy-paste this?"

MinHash is built for detecting structural overlap. It works with sets, and it estimates something called Jaccard similarity, which is just "what fraction of elements do these two sets share?"

The setup works like this. You take a document and break it into overlapping character sequences called shingles. The sentence "the cat sat" with shingle size 5 becomes {"the c", "he ca", "e cat", " cat ", "cat s", "at sa", "t sat"}. Now your document is a set, and you can ask set overlap questions.

The problem is that these sets are huge. A typical article might produce tens of thousands of shingles. Comparing two sets directly is fine, but comparing all pairs is not.

MinHash compresses each set into a short signature (say, 128 numbers) with a nice property: the probability that two signatures agree at any position equals the Jaccard similarity between the original sets. So you compare 128 numbers instead of tens of thousands of shingles.

This is not a heuristic. There is a clean theorem behind it. If you pick a random hash function h and look at the minimum hash value across all elements of a set, then:

P( min_hash(A) = min_hash(B) ) = |A ∩ B| / |A ∪ B|

The intuition is straightforward. Pool all shingles from both documents together, randomly order them by h, and look at whichever shingle lands first. That shingle is shared by both documents with probability equal to the fraction of the pool that is shared. That fraction is exactly the Jaccard similarity. Each position in the signature is an independent trial of this experiment, so averaging over 128 positions gives you a tight estimate.

And here is where the LSH part comes in. You don't even compare all pairs of signatures. You split each 128-number signature into bands (say 16 bands of 8 numbers each) and hash each band separately. Two documents become candidates only if they match on at least one band. The math works out so that pairs with high similarity almost always match on at least one band, while pairs with low similarity almost never do.

The result is a system with a sharp threshold. Below some similarity level, pairs are almost never detected. Above it, they almost always are. And you can tune exactly where that cutoff sits by adjusting how many bands you use and how many numbers go in each band.

This is what plagiarism detectors use. If someone copy-pastes a paragraph and changes a few words, the shingle sets still overlap heavily, MinHash catches the overlap, and the LSH index surfaces the pair without having to scan the whole database.

SimHash: "Is this the same article in different words?"

SimHash solves a different problem. It was developed by Moses Charikar, and Google used a variant of it for web-scale deduplication. Where MinHash asks "do these documents share literal text?", SimHash asks "do these documents have similar content?" even if the wording is completely different.

The idea is to compress an entire document into a single 64-bit number (a fingerprint) such that similar documents produce fingerprints that differ in very few bits.

How you get there: take each word in the document, weight it by importance (common words like "the" get low weight, distinctive words get high weight), hash each word to a 64-bit value, and then do a weighted vote across all 64 bit positions. If more weight voted for a 1 in position k, the fingerprint gets a 1 there. Otherwise it gets a 0.

Two documents about the same topic, using similar vocabulary, will have fingerprints that differ in maybe 2 or 3 bits out of 64. Two completely unrelated documents will differ in around 32 bits. You measure similarity by counting the number of differing bits (Hamming distance).

Again, there is a theorem that makes this precise. It comes from a result by Goemans and Williamson, originally in a different context, but the relevant part is this: if you represent two documents as weighted vectors in some high-dimensional feature space, then the probability that a random hyperplane hash disagrees at a single bit position is:

P( bit_i(A) ≠ bit_i(B) ) = θ(A, B) / π

where θ is the angle between the two document vectors. Small angle means similar documents, and similar documents will disagree on very few bits. The expected Hamming distance between two fingerprints is 64 × θ / π. So Hamming distance is not just a rough proxy for similarity. It is a direct, linear function of the angle between the original document representations.

For the index, you use the pigeonhole principle. Split the 64-bit fingerprint into 4 blocks of 16 bits. If two fingerprints differ in at most 3 bits total, then at least one of the four blocks must be identical between them. So you build 4 separate indexes (one per block), and two documents become candidates if they match on any block.

Google used this (and variations of it) to deduplicate their web crawl. When you are indexing billions of pages and a huge fraction of them are just copies of the same content with different ads wrapped around it, SimHash lets you identify the duplicates in time proportional to the number of documents rather than the number of pairs.

Why Not Just Use Embeddings?

This is a fair question. Today, if you wanted to find semantically similar documents, you would probably reach for a sentence embedding model, throw the vectors into a vector database, and call it done.

But consider the constraints Google faced in 2007. Transformer models did not exist. Word2Vec was still six years away. The embedding models we take for granted today simply were not available.

Even today, embeddings are not always the right tool. They are expensive to compute, they require GPU infrastructure, and the vectors are large (768 or 1536 floats per document). SimHash gives you a 64-bit integer. That is 8 bytes. You can fit a billion SimHash fingerprints in about 8 GB of RAM. Try doing that with embedding vectors.

For the specific problem of "is this document a near-copy of something I have already seen?", LSH methods are still hard to beat on cost and simplicity. They don't require a model, they don't require GPUs, and they don't require a vector database. You need a hash function and some hash tables.

Building It for Real

I ended up building both pipelines from scratch to really understand how the pieces fit together. A MinHash-LSH pipeline for plagiarism detection, a SimHash-Pigeonhole pipeline for semantic deduplication, both backed by a custom on-disk hash index (because if you are already going down this rabbit hole, why stop at the algorithm and not also build the storage layer?).

I tested it against 10,000 documents: a mix of real news articles, synthetically mutated copies at various similarity levels, and noise documents. The system picks up copy-paste plagiarism with over 95% recall, and the on-disk index handles single-document queries in under 5ms.

The thing that surprised me most was how well the banded LSH approach works as a filter. Out of the roughly 50 million possible document pairs in a 10k corpus, the index narrows it down to a few thousand candidates. The verification step (which does the expensive exact comparison) only runs on those candidates. That is the real win. Not the hashing itself, but the fact that hashing lets you skip 99.99% of the work.

The Takeaway

I started with a silly question about inverting hash functions and ended up understanding how some of the largest-scale data processing systems actually work.

The core idea is worth remembering: sometimes the best way to solve a search problem is to design a hash function that is deliberately bad at being a hash function. Instead of scattering similar items apart, you scatter them together. And then you just look inside the buckets.

If you want to dig into the code, I have put the full implementation (with the custom disk-based index and all the benchmarks) on GitHub. However, given the coding agents at our disposal, I will rather ask you to giving it a try yourself. If you are still interested my implementation (that I generated using Claude), feel free to comment. Cheers! 🥂

What If a Hash Function Wanted Collisions?

The Problem That Started All of This

The Core Insight

Two Flavors of the Same Idea

MinHash: "Did someone copy-paste this?"

SimHash: "Is this the same article in different words?"

Why Not Just Use Embeddings?

Building It for Real

The Takeaway

Comments

More from this blog

Agentic TDD Non-Negotiable

The Halting Problem

Python Dependency Isolation with Subprocesses

The performance cost of dynamic lists

Command Palette

The Problem That Started All of This

The Core Insight

Two Flavors of the Same Idea

MinHash: "Did someone copy-paste this?"

SimHash: "Is this the same article in different words?"

Why Not Just Use Embeddings?

Building It for Real

The Takeaway

Comments

More from this blog