- Each key is equally likely to hash to an integer between 0 and M - 1.
- Bins and balls.
- Collision. Two distinct keys hashing to same index.
- Challenge. Deal with collisions efficiently.
- Use an array of M < N linked lists. [H. P. Luhn, IBM 1953]
- Hash: map key to integer i between 0 and M - 1
- Insert: put at front of iᵗʰ chain (if not already there).
- Search: need to search only iᵗʰ chain.
- Proposition
- Under uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of N / M is extremely close to 1.
- Proof sketch
import scipy, scipy.stats
x = scipy.linspace(0,30,31)
N= 10000
M= 1000
p = 1.0/M
pmf = scipy.stats.binom.pmf(x,N,p)
import pylab
pylab.plot(x,pmf)
pylab.show()
- Consequence: Number of probes for search/insert is proportional to N / M.
- M too large ⇒ too many empty chains.
- M too small ⇒ chains too long.
- Typical choice: M ~ N / 5 ⇒ constant-time ops.
- Open addressing. [Amdahl-Boehme-Rocherster-Samuel, IBM 1953]
- When a new key collides, find next empty slot, and put it there.
- Hash. Map key to integer i between 0 and M-1.
- Insert. Put at table index i if free; if not try i+1, i+2, etc.
- Search. Search table index i; if occupied but no match, try i+1, i+2, etc.
- Note. Array size M must be greater than number of key-value pairs N.
- Cluster. A contiguous block of items.
- Observation. New keys likely to hash into middle of big clusters.
- Model.
- Cars arrive at one-way street with M parking spaces. Each desires a random space i : if space i is taken, try i + 1, i + 2, etc.
- Q:
- What is mean displacement of a car?
- Half-full.
- With M / 2 cars, mean displacement is ~ 3 / 2.
- Full.
- With M cars, mean displacement is ~ √(πM/8)
-
Proposition
-
Parameters
- M too large ⇒ too many empty array entries.
- M too small ⇒ search time blows up
- Typical choice: α = N / M ~ 1⁄2
- probes for search hit is about 3/2
- probes for search miss is about 5/2