Address Deduplication Concepts
Address deduplication identifies duplicate records that represent the same physical location but are written differently.
The Problem
Same address, different representations:
1123 Main Street, Apt 4B, New York, NY 100012123 Main St., Apartment 4-B, New York, NY 100013123 MAIN ST APT 4B NEW YORK NY 10001
Without deduplication, these appear as three different customers in your database.
Why Deduplicate?
| Problem | Impact |
|---|---|
| Multiple records | Wasted storage, inconsistent data |
| Duplicate mailings | Customer annoyance, wasted cost |
| Split order history | Poor customer service |
| Inaccurate analytics | Bad business decisions |
Deduplication Strategies
1. Exact Matching (After Normalization)
Normalize addresses, then compare:
javascript1function exactMatch(addr1, addr2) {2 return normalize(addr1) === normalize(addr2);3}
Pros: Fast, no false positives Cons: Misses near-matches
2. Fuzzy Matching
Use similarity algorithms:
javascript1function fuzzyMatch(addr1, addr2, threshold = 0.85) {2 const normalized1 = normalize(addr1);3 const normalized2 = normalize(addr2);4 const similarity = calculateSimilarity(normalized1, normalized2);5 return similarity >= threshold;6}
Similarity Algorithms
Levenshtein Distance
Number of edits (insert, delete, replace) to transform one string into another:
1"123 MAIN ST" → "123 MAINE ST"2Edit: Replace 'I' with 'INE'3Distance: 2
javascript1function levenshtein(str1, str2) {2 const m = str1.length, n = str2.length;3 const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));45 for (let i = 0; i <= m; i++) dp[i][0] = i;6 for (let j = 0; j <= n; j++) dp[0][j] = j;78 for (let i = 1; i <= m; i++) {9 for (let j = 1; j <= n; j++) {10 if (str1[i-1] === str2[j-1]) {11 dp[i][j] = dp[i-1][j-1];12 } else {13 dp[i][j] = 1 + Math.min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]);14 }15 }16 }1718 return dp[m][n];19}2021// Convert to similarity (0-1)22function levenshteinSimilarity(str1, str2) {23 const distance = levenshtein(str1, str2);24 const maxLen = Math.max(str1.length, str2.length);25 return 1 - distance / maxLen;26}
Jaro-Winkler Similarity
Better for short strings, considers character transpositions:
javascript1function jaroWinkler(str1, str2) {2 // Jaro similarity3 const matchWindow = Math.floor(Math.max(str1.length, str2.length) / 2) - 1;4 // ... (complex algorithm)56 // Winkler adjustment for common prefix7 let prefix = 0;8 for (let i = 0; i < Math.min(4, str1.length, str2.length); i++) {9 if (str1[i] === str2[i]) prefix++;10 else break;11 }1213 return jaroSim + (prefix * 0.1 * (1 - jaroSim));14}
Matching Thresholds
| Threshold | Use Case |
|---|---|
| > 0.95 | High confidence, automated merge |
| 0.85-0.95 | Likely match, flag for review |
| 0.70-0.85 | Possible match, manual review |
| < 0.70 | Different addresses |
Component-Based Matching
Compare individual components for better accuracy:
javascript1function componentMatch(addr1, addr2) {2 const weights = {3 zip: 0.3, // Highest weight - most specific4 street: 0.3,5 city: 0.2,6 state: 0.1,7 unit: 0.18 };910 let score = 0;1112 if (addr1.zip === addr2.zip) score += weights.zip;13 score += weights.street * similarity(addr1.street, addr2.street);14 score += weights.city * similarity(addr1.city, addr2.city);15 if (addr1.state === addr2.state) score += weights.state;16 if (addr1.unit === addr2.unit) score += weights.unit;1718 return score;19}
Best Practices
- Normalize first - Standardize before comparing
- Use component matching - More accurate than full string
- Weight by specificity - ZIP is more specific than state
- Set conservative thresholds - Avoid false merges
- Manual review for edge cases - Flag uncertain matches
- Log decisions - Audit trail for merges