15 minlesson

Address Deduplication Concepts

Address Deduplication Concepts

Address deduplication identifies duplicate records that represent the same physical location but are written differently.

The Problem

Same address, different representations:

1123 Main Street, Apt 4B, New York, NY 10001
2123 Main St., Apartment 4-B, New York, NY 10001
3123 MAIN ST APT 4B NEW YORK NY 10001

Without deduplication, these appear as three different customers in your database.

Why Deduplicate?

ProblemImpact
Multiple recordsWasted storage, inconsistent data
Duplicate mailingsCustomer annoyance, wasted cost
Split order historyPoor customer service
Inaccurate analyticsBad business decisions

Deduplication Strategies

1. Exact Matching (After Normalization)

Normalize addresses, then compare:

javascript
1function exactMatch(addr1, addr2) {
2 return normalize(addr1) === normalize(addr2);
3}

Pros: Fast, no false positives Cons: Misses near-matches

2. Fuzzy Matching

Use similarity algorithms:

javascript
1function fuzzyMatch(addr1, addr2, threshold = 0.85) {
2 const normalized1 = normalize(addr1);
3 const normalized2 = normalize(addr2);
4 const similarity = calculateSimilarity(normalized1, normalized2);
5 return similarity >= threshold;
6}

Similarity Algorithms

Levenshtein Distance

Number of edits (insert, delete, replace) to transform one string into another:

1"123 MAIN ST" → "123 MAINE ST"
2Edit: Replace 'I' with 'INE'
3Distance: 2
javascript
1function levenshtein(str1, str2) {
2 const m = str1.length, n = str2.length;
3 const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
4
5 for (let i = 0; i <= m; i++) dp[i][0] = i;
6 for (let j = 0; j <= n; j++) dp[0][j] = j;
7
8 for (let i = 1; i <= m; i++) {
9 for (let j = 1; j <= n; j++) {
10 if (str1[i-1] === str2[j-1]) {
11 dp[i][j] = dp[i-1][j-1];
12 } else {
13 dp[i][j] = 1 + Math.min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]);
14 }
15 }
16 }
17
18 return dp[m][n];
19}
20
21// Convert to similarity (0-1)
22function levenshteinSimilarity(str1, str2) {
23 const distance = levenshtein(str1, str2);
24 const maxLen = Math.max(str1.length, str2.length);
25 return 1 - distance / maxLen;
26}

Jaro-Winkler Similarity

Better for short strings, considers character transpositions:

javascript
1function jaroWinkler(str1, str2) {
2 // Jaro similarity
3 const matchWindow = Math.floor(Math.max(str1.length, str2.length) / 2) - 1;
4 // ... (complex algorithm)
5
6 // Winkler adjustment for common prefix
7 let prefix = 0;
8 for (let i = 0; i < Math.min(4, str1.length, str2.length); i++) {
9 if (str1[i] === str2[i]) prefix++;
10 else break;
11 }
12
13 return jaroSim + (prefix * 0.1 * (1 - jaroSim));
14}

Matching Thresholds

ThresholdUse Case
> 0.95High confidence, automated merge
0.85-0.95Likely match, flag for review
0.70-0.85Possible match, manual review
< 0.70Different addresses

Component-Based Matching

Compare individual components for better accuracy:

javascript
1function componentMatch(addr1, addr2) {
2 const weights = {
3 zip: 0.3, // Highest weight - most specific
4 street: 0.3,
5 city: 0.2,
6 state: 0.1,
7 unit: 0.1
8 };
9
10 let score = 0;
11
12 if (addr1.zip === addr2.zip) score += weights.zip;
13 score += weights.street * similarity(addr1.street, addr2.street);
14 score += weights.city * similarity(addr1.city, addr2.city);
15 if (addr1.state === addr2.state) score += weights.state;
16 if (addr1.unit === addr2.unit) score += weights.unit;
17
18 return score;
19}

Best Practices

  1. Normalize first - Standardize before comparing
  2. Use component matching - More accurate than full string
  3. Weight by specificity - ZIP is more specific than state
  4. Set conservative thresholds - Avoid false merges
  5. Manual review for edge cases - Flag uncertain matches
  6. Log decisions - Audit trail for merges