Address Deduplication Concepts

Address deduplication identifies duplicate records that represent the same physical location but are written differently.

The Problem

Same address, different representations:


1123 Main Street, Apt 4B, New York, NY 10001
2123 Main St., Apartment 4-B, New York, NY 10001
3123 MAIN ST APT 4B NEW YORK NY 10001

Without deduplication, these appear as three different customers in your database.

Why Deduplicate?

Problem	Impact
Multiple records	Wasted storage, inconsistent data
Duplicate mailings	Customer annoyance, wasted cost
Split order history	Poor customer service
Inaccurate analytics	Bad business decisions

Deduplication Strategies

1. Exact Matching (After Normalization)

Normalize addresses, then compare:


javascript
1function exactMatch(addr1, addr2) {
2  return normalize(addr1) === normalize(addr2);
3}

Pros: Fast, no false positives Cons: Misses near-matches

2. Fuzzy Matching

Use similarity algorithms:


javascript
1function fuzzyMatch(addr1, addr2, threshold = 0.85) {
2  const normalized1 = normalize(addr1);
3  const normalized2 = normalize(addr2);
4  const similarity = calculateSimilarity(normalized1, normalized2);
5  return similarity >= threshold;
6}

Similarity Algorithms

Levenshtein Distance

Number of edits (insert, delete, replace) to transform one string into another:


1"123 MAIN ST" → "123 MAINE ST"
2Edit: Replace 'I' with 'INE'
3Distance: 2


javascript
1function levenshtein(str1, str2) {
2  const m = str1.length, n = str2.length;
3  const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
4
5  for (let i = 0; i <= m; i++) dp[i][0] = i;
6  for (let j = 0; j <= n; j++) dp[0][j] = j;
7
8  for (let i = 1; i <= m; i++) {
9    for (let j = 1; j <= n; j++) {
10      if (str1[i-1] === str2[j-1]) {
11        dp[i][j] = dp[i-1][j-1];
12      } else {
13        dp[i][j] = 1 + Math.min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]);
14      }
15    }
16  }
17
18  return dp[m][n];
19}
20
21// Convert to similarity (0-1)
22function levenshteinSimilarity(str1, str2) {
23  const distance = levenshtein(str1, str2);
24  const maxLen = Math.max(str1.length, str2.length);
25  return 1 - distance / maxLen;
26}

Jaro-Winkler Similarity

Better for short strings, considers character transpositions:


javascript
1function jaroWinkler(str1, str2) {
2  // Jaro similarity
3  const matchWindow = Math.floor(Math.max(str1.length, str2.length) / 2) - 1;
4  // ... (complex algorithm)
5
6  // Winkler adjustment for common prefix
7  let prefix = 0;
8  for (let i = 0; i < Math.min(4, str1.length, str2.length); i++) {
9    if (str1[i] === str2[i]) prefix++;
10    else break;
11  }
12
13  return jaroSim + (prefix * 0.1 * (1 - jaroSim));
14}

Matching Thresholds

Threshold	Use Case
> 0.95	High confidence, automated merge
0.85-0.95	Likely match, flag for review
0.70-0.85	Possible match, manual review
< 0.70	Different addresses

Component-Based Matching

Compare individual components for better accuracy:


javascript
1function componentMatch(addr1, addr2) {
2  const weights = {
3    zip: 0.3,      // Highest weight - most specific
4    street: 0.3,
5    city: 0.2,
6    state: 0.1,
7    unit: 0.1
8  };
9
10  let score = 0;
11
12  if (addr1.zip === addr2.zip) score += weights.zip;
13  score += weights.street * similarity(addr1.street, addr2.street);
14  score += weights.city * similarity(addr1.city, addr2.city);
15  if (addr1.state === addr2.state) score += weights.state;
16  if (addr1.unit === addr2.unit) score += weights.unit;
17
18  return score;
19}

Best Practices

Normalize first - Standardize before comparing
Use component matching - More accurate than full string
Weight by specificity - ZIP is more specific than state
Set conservative thresholds - Avoid false merges
Manual review for edge cases - Flag uncertain matches
Log decisions - Audit trail for merges