DSA: Hashing
Hashing
Section titled “Hashing”Hashing is a technique to map a key to an index== of a hash table using a ==hash functions== enabling $O(1)$ ==average time for search, insert, and delete operations.
Hash Table: A hash table is a data structure that stores key–value pairs== and uses a hash function to ==map each key to an index in an arrays allowing O(1) average-time search, insertion, and deletion.
Terminology
- Key (k): Input value to be stored/searched
- Hash Function h(k): Maps key k to table index
- Hash Table: Array of size m (table size)
- Slot/Bucket: Each index position in hash table
- Load Factor ($\alpha$): $\alpha = \frac{n}{m}$ where n = number of keys, m = table size
- Collision: When two different keys map to same index
- Clustering: Keys accumulate in consecutive slots
Properties of a Good Hash Function
- Deterministic: Same key always gives same index
- Uniform Distribution: Keys spread evenly across table
- Fast Computation: Should compute in $O(1)$ time
- Minimizes Collisions: Reduces probability of two keys mapping to same slot
- Uses all table positions: No slot should be favored
- Low clustering: Avoids grouping of keys
Hash Functions
1. Division Method:
$$h(k) = k \bmod m$$
- Best practice: m should be prime
- Avoid: Powers of 2 (causes poor distribution)
- Why prime? Better distribution, reduces patterns
- Example: k = 85, m = 11 → h(85) = 85 mod 11 = 8
2. Multiplication Method:
$$h(k) = \lfloor m \times (kA \bmod 1) \rfloor, \quad 0 < A < 1$$
- Knuth suggests: $A \approx 0.6180339887$ (golden ratio - 1)
- Less sensitive to m: m can be power of 2
- Steps:
- Multiply k by A
- Extract fractional part
- Multiply by m
- Take floor
- Example: k = 123, m = 1000, A = 0.618
- $kA = 76.014$
- Fractional part = 0.014
- $0.014 \times 1000 = 14$
- h(123) = 14
3. Mid-Square Method:
- Square the key: $k^2$
- Extract middle r digits
- Use as hash value
- Example: k = 60, $k^2 = 3600$, middle 2 digits = 60
4. Folding Method:
- Fold-Shift: Split key into parts, add them
- Fold-Boundary: Reverse alternate parts before adding
- Example: k = 123456789
- Split: 123 | 456 | 789
- Add: 123 + 456 + 789 = 1368
5. Universal Hashing (Advanced):
$$h(k) = ((ak + b) \bmod p) \bmod m$$
- a, b are random coefficients
- p is prime > universe of keys
- Provides theoretical guarantees
Collision
Section titled “Collision”A collision occurs in hashing when two or more different keys are mapped to the same hash table index by a hash function.
- When $h(k_1) = h(k_2)$ but $k_1 \neq k_2$
- Unavoidable when: $n > m$ (Pigeonhole Principle)
- Types:
- Primary Clustering: Long runs of occupied slots (in Linear Probing)
- Secondary Clustering: Keys follow same probe sequence (in Quadratic Probing)
Collision Resolution Techniques
- A. Separate Chaining (Closed Addressing)
- B. Open Addressing (Open Hashing)
- Linear Probing
- Quadratic Probing
- Double Hashing
A. Separate Chaining (Closed Addressing)
Method:
- Each table slot contains pointer to linked list
- All keys hashing to same index stored in that list
Operations:
- Insert:
1. Compute h(k)2. Insert k at head of list at table[h(k)]3. Time: O(1)- Search:
1. Compute h(k)2. Traverse list at table[h(k)]3. Time: O(1 + α) average- Delete:
1. Compute h(k)2. Search in list at table[h(k)]3. Delete node4. Time: O(1 + α) averageTime Complexity:
- Best Case: $O(1)$ when all keys in different slots
- Average Case: $O(1 + \alpha)$ where $\alpha = n/m$
- Worst Case: $O(n)$ when all keys hash to same slot (single long chain)
Space Complexity:
- $O(n + m)$
Expected chain length:
- $\alpha = n/m$
Advantages:
- Simple implementation
- Table never fills ($\alpha$ can be > 1)
- Deletion is easy
- Less sensitive to hash function quality
Disadvantages:
- Extra memory for pointers
- Cache unfriendly (poor locality)
- Memory overhead per node
GATE Formula:
- Average search time = $1 + \frac{\alpha}{2}$ (successful search)
- Average search time = $\alpha$ (unsuccessful search)
B. Open Addressing (Open Hashing)
Method:
- All keys stored in table itself
- No external storage
- One key per slot
- Critical: $\alpha < 1$ (must have empty slots)
Probe Sequence: $h(k, i)$ where i = 0, 1, 2, …, m-1
General Algorithm:
Insert(k): i = 0 repeat: j = h(k, i) if table[j] is empty or deleted: table[j] = k return j i = i + 1 until i == m1. Linear Probing:
$$h(k, i) = (h(k) + i) \bmod m$$
- Probe sequence: $h(k), h(k)+1, h(k)+2, …, (h(k)+m-1) \mod m$
- Example: $h(k) = 5$, $m = 10$
- Sequence: 5, 6, 7, 8, 9, 0, 1, 2, 3, 4
- Primary Clustering:
- Long runs of occupied consecutive slots form
- Increases average search time
- New keys tend to cluster around existing clusters
- Number of probes (average):
- Successful search: $\frac{1}{2}(1 + \frac{1}{1-\alpha})$
- Unsuccessful search: $\frac{1}{2}(1 + \frac{1}{(1-\alpha)^2})$
- Advantages:
- Best cache performance
- No extra memory
- Disadvantages:
- Severe primary clustering
- Performance degrades rapidly as $\alpha \to 1$
2. Quadratic Probing:
$$h(k, i) = (h(k) + c_1 i + c_2 i^2) \bmod m$$
- Common choice: $c_1 = 0, c_2 = 1$ → $h(k, i) = (h(k) + i^2) \bmod m$
- Probe sequence: $h(k), h(k)+1, h(k)+4, h(k)+9, h(k)+16, …$
- Secondary Clustering:
- Keys with same initial hash follow same probe sequence
- Less severe than primary clustering
- Important Theorem:
- If m is prime and $\alpha \leq 0.5$, quadratic probing will find empty slot
- Advantages:
- Eliminates primary clustering
- Better than linear probing
- Disadvantages:
- Secondary clustering
- May not probe all slots
- Requires m to be prime for guarantee
3. Double Hashing:
$$h(k, i) = (h_1(k) + i \times h_2(k)) \bmod m$$
- Uses two hash functions: $h_1(k)$ and $h_2(k)$
- Critical requirement: $h_2(k)$ must never be 0
- Best practice: $h_2(k)$ relatively prime to m
- Common choices:
- $h_1(k) = k \bmod m$
- $h_2(k) = 1 + (k \bmod m’)$ where $m’ < m$
- If m is prime: $h_2(k) = 1 + (k \bmod (m-1))$
- Example:
- m = 13, $h_1(k) = k \bmod 13$, $h_2(k) = 1 + (k \bmod 11)$
- k = 14: $h_1(14) = 1$, $h_2(14) = 4$
- Sequence: 1, 5, 9, 0, 4, 8, 12, 3, 7, 11, 2, 6, 10
- Advantages:
- Best distribution among open addressing
- Eliminates both primary and secondary clustering
- Different keys have different probe sequences
- Number of probes (average):
- Successful search: $\frac{1}{\alpha} \ln(\frac{1}{1-\alpha})$
- Unsuccessful search: $\frac{1}{1-\alpha}$
Comparison Table
| Technique | Primary Cluster | Secondary Cluster | Probes (Success) | Probes (Fail) | Cache Performance |
|---|---|---|---|---|---|
| Chaining | No | No | $1 + \alpha/2$ | $\alpha$ | Poor |
| Linear Probing | Yes | No | $\frac{1}{2}(1 + \frac{1}{1-\alpha})$ | $\frac{1}{2}(1 + \frac{1}{(1-\alpha)^2})$ | Best |
| Quadratic Probing | No | Yes | Medium | Medium | Good |
| Double Hashing | No | No | $\frac{1}{\alpha}\ln(\frac{1}{1-\alpha})$ | $\frac{1}{1-\alpha}$ | Good |
Deletion in Hashing
Chaining:
- Simply delete node from linked list
- No special handling needed
- Time: $O(1 + \alpha)$
Open Addressing:
- Problem: Cannot simply mark slot as empty
- Why? Breaks probe sequence for future searches
- Solution: Lazy Deletion (mark as DELETED)
States of a slot:
- EMPTY: Never used
- OCCUPIED: Contains key
- DELETED: Had key, now removed
Search algorithm:
- Continue probing through DELETED slots
- Stop only at EMPTY slot
Insert algorithm:
- Can reuse DELETED slots
- Stop at first DELETED or EMPTY slot
GATE Important: Search stops at EMPTY, not DELETED
Load Factor Effects
Definition: $\alpha = \frac{n}{m} = \frac{\text{number of keys}}{\text{table size}}$
Chaining:
- $\alpha$ can be > 1
- Average chain length = $\alpha$
- Performance degrades linearly with $\alpha$
Open Addressing:
- Must maintain $\alpha < 1$
- Performance degrades exponentially as $\alpha \to 1$
- Recommended: $\alpha \leq 0.7$ for good performance
Critical values:
- $\alpha = 0.5$: Reasonable performance
- $\alpha = 0.7$: Typical rehashing threshold
- $\alpha = 0.9$: Severe performance degradation
- $\alpha = 1.0$: Table full (open addressing fails)
Rehashing
When to rehash:
- When $\alpha$ exceeds threshold (typically 0.7)
- Table becomes too full
- Too many collisions
Process:
- Create new table of size $m’ \approx 2m$ (choose prime)
- Recompute hash for all keys
- Insert all keys into new table
- Delete old table
Time Complexity:
- $O(n)$ to rehash all n keys
- Amortized cost: $O(1)$ per insertion
GATE Note: Rehashing changes all hash values (new table size)
Time Complexity Summary
Average Case (with good hash function):
| Operation | Chaining | Open Addressing |
|---|---|---|
| Insert | $O(1)$ | $O(1/(1-\alpha))$ |
| Search (success) | $O(1 + \alpha)$ | Varies by method |
| Search (fail) | $O(1 + \alpha)$ | $O(1/(1-\alpha))$ |
| Delete | $O(1 + \alpha)$ | $O(1/(1-\alpha))$ |
Worst Case:
- Chaining: $O(n)$ - all keys in one chain
- Open Addressing: $O(n)$ - must probe entire table
Space Complexity:
- Chaining: $O(n + m)$ - table + linked lists
- Open Addressing: $O(m)$ - table only
Applications
- Symbol Tables: Compiler design, variable storage
- Databases: Index structures, quick lookups
- Caching: Web browsers, DNS cache
- Sets/Maps: Dictionary implementations
- Password Storage: Hash + salt
- Checksums: Data integrity verification
- Cryptography: Digital signatures, blockchain
- Routers: IP address lookup
- Spell Checkers: Word lookup
- Duplicate Detection: File systems
GATE Focus Points
- Hash function identification: Division, multiplication, mid-square
- Collision resolution: Identify chaining vs open addressing
- Probe sequence calculation: Linear, quadratic, double hashing
- Number of comparisons: Given $\alpha$ and technique
- Clustering detection: Primary vs secondary
- Lazy deletion: When to stop search in open addressing
- Load factor calculations: Effect on performance
- Rehashing: When triggered, new table size
- Prime table sizes: Why important for division method
- Double hashing requirements: $h_2(k) \neq 0$, relatively prime to m
Common GATE Traps
- Linear Probing Stop Condition:
- Stop at EMPTY slot, NOT at DELETED slot
- DELETED slots must be probed through
- Open Addressing Load Factor:
- Must maintain $\alpha < 1$
- Table can become full
- Prime Table Size:
- Division method works best with prime m
- Avoids patterns and improves distribution
- Double Hashing Independence:
- $h_2(k)$ must be independent of $h_1(k)$
- $h_2(k)$ must never return 0
- Quadratic Probing Guarantee:
- Only guaranteed to find slot if m is prime and $\alpha \leq 0.5$
- Chaining vs Open Addressing:
- Chaining allows $\alpha > 1$
- Open addressing requires $\alpha < 1$
- Deletion in Open Addressing:
- Cannot simply empty the slot
- Must use lazy deletion (DELETED marker)
- Rehashing Effect:
- All hash values change (new m)
- Not just copying keys to new table
Key Formulae (Must Remember)
Load Factor: $$\alpha = \frac{n}{m}$$
Division Method: $$h(k) = k \bmod m$$
Multiplication Method: $$h(k) = \lfloor m \times (kA \bmod 1) \rfloor$$
Linear Probing: $$h(k, i) = (h(k) + i) \bmod m$$
Quadratic Probing: $$h(k, i) = (h(k) + c_1 i + c_2 i^2) \bmod m$$
Double Hashing: $$h(k, i) = (h_1(k) + i \times h_2(k)) \bmod m$$
Average Probes (Linear Probing - Success): $$\frac{1}{2}\left(1 + \frac{1}{1-\alpha}\right)$$
Average Probes (Linear Probing - Failure): $$\frac{1}{2}\left(1 + \frac{1}{(1-\alpha)^2}\right)$$
Average Search Time (Chaining - Success): $$1 + \frac{\alpha}{2}$$
Average Search Time (Chaining - Failure): $$\alpha$$
Performance Comparison Chart
For $\alpha = 0.5$:
| Method | Successful Search | Unsuccessful Search |
|---|---|---|
| Chaining | 1.25 probes | 0.5 probes |
| Linear Probing | 1.5 probes | 2.5 probes |
| Quadratic Probing | ~1.4 probes | ~2.2 probes |
| Double Hashing | 1.39 probes | 2 probes |
For $\alpha = 0.9$:
| Method | Successful Search | Unsuccessful Search |
|---|---|---|
| Chaining | 1.45 probes | 0.9 probes |
| Linear Probing | 5.5 probes | 50.5 probes |
| Quadratic Probing | ~3 probes | ~10 probes |
| Double Hashing | 2.56 probes | 10 probes |
Final Insights
- Hashing provides $O(1)$ average time, NOT worst case
- Choice of hash function critically affects performance
- Load factor is the single most important performance metric
- Open addressing degrades faster than chaining as $\alpha$ increases
- Double hashing is theoretically best for open addressing
- Linear probing is practically fastest due to cache locality
- Prime table sizes matter for division method and quadratic probing
- Lazy deletion is essential for open addressing
- Rehashing is expensive but necessary for maintaining performance
- Universal hashing provides theoretical worst-case guarantees