Address Parsing Strategies
Parsing addresses is challenging because formats vary by country, and even within a country, addresses can be written in multiple ways. This lesson covers strategies for building robust parsers.
The Challenge
Consider these equivalent US addresses:
1123 Main Street, Apartment 4B, New York, NY 100012123 Main St., Apt. 4B, New York, NY 100013123 Main St Apt 4B New York NY 10001
All represent the same location, but the parser must handle each format.
Two Approaches to Parsing
1. Regex-Based Parsing
Use regular expressions to match patterns:
javascript1const US_PATTERN = /^(.+),\s*(.+),\s*([A-Z]{2})\s+(\d{5}(-\d{4})?)$/i;23function parseUS(address) {4 const match = address.match(US_PATTERN);5 if (!match) return null;67 return {8 streetLine: match[1],9 city: match[2],10 state: match[3].toUpperCase(),11 zip: match[4]12 };13}
Pros: Fast, no dependencies Cons: Brittle, hard to maintain, struggles with variations
2. Token-Based Parsing
Split into tokens and analyze each:
javascript1function parseTokens(address) {2 const tokens = address.split(/[,\s]+/).filter(Boolean);34 // Find ZIP code (anchor point)5 const zipIndex = tokens.findIndex(t => /^\d{5}(-\d{4})?$/.test(t));67 // State is just before ZIP8 const state = tokens[zipIndex - 1];910 // City is before state (may be multiple tokens)11 // Street is at the beginning12 // ...13}
Pros: More flexible, handles variations Cons: More complex logic, still country-specific
Country Detection
Before parsing, detect the country:
javascript1function detectCountry(address) {2 // Check for country names/codes at the end3 if (/\b(USA|United States|US)\s*$/i.test(address)) return 'US';4 if (/\b(Canada|CA)\s*$/i.test(address)) return 'CA';5 if (/\b(United Kingdom|UK|GB)\s*$/i.test(address)) return 'GB';6 if (/\b(Germany|Deutschland|DE)\s*$/i.test(address)) return 'DE';78 // Check postal code patterns9 if (/\b\d{5}(-\d{4})?\b/.test(address)) return 'US'; // or DE/FR10 if (/\b[A-Z]\d[A-Z]\s?\d[A-Z]\d\b/i.test(address)) return 'CA';11 if (/\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b/i.test(address)) return 'GB';1213 return 'UNKNOWN';14}
Parsing by Country
United States
123 Main Street, Apt 4B, New York, NY 10001
Anchor: ZIP code at end (5 digits or ZIP+4) Pattern: Street, [Unit], City, State ZIP
javascript1function parseUS(address) {2 const parts = address.split(',').map(s => s.trim());34 // Last part: State ZIP5 const lastPart = parts.pop();6 const stateZipMatch = lastPart.match(/^([A-Z]{2})\s+(\d{5}(-\d{4})?)$/i);78 if (!stateZipMatch) throw new Error('Invalid US address format');910 const state = stateZipMatch[1].toUpperCase();11 const zip = stateZipMatch[2];1213 // Second to last: City14 const city = parts.pop();1516 // First: Street17 const street = parts.shift();1819 // Remaining: Unit (optional)20 const unit = parts.length > 0 ? parts[0] : undefined;2122 return { street, unit, city, state, zip, country: 'US' };23}
Canada
456 Queen Street West, Toronto, ON M5V 3A8
Anchor: Postal code (A1A 1A1 format) Pattern: Street, City, Province PostalCode
javascript1function parseCA(address) {2 const parts = address.split(',').map(s => s.trim());34 // Last part: Province + Postal Code5 const lastPart = parts.pop();6 const match = lastPart.match(/^([A-Z]{2})\s+([A-Z]\d[A-Z]\s?\d[A-Z]\d)$/i);78 if (!match) throw new Error('Invalid Canadian address format');910 return {11 street: parts.shift(),12 city: parts.pop() || parts.shift(),13 province: match[1].toUpperCase(),14 postalCode: match[2].toUpperCase().replace(/\s/, ' '),15 country: 'CA'16 };17}
Germany
Friedrichstraße 123, 10117 Berlin
Anchor: PLZ before city (5 digits) Pattern: Street Number, PLZ City
Note: Street number comes AFTER street name!
javascript1function parseDE(address) {2 const parts = address.split(',').map(s => s.trim());34 // Last part: PLZ City5 const lastPart = parts.pop();6 const match = lastPart.match(/^(\d{5})\s+(.+)$/);78 if (!match) throw new Error('Invalid German address format');910 return {11 street: parts[0], // includes street number at end12 plz: match[1],13 city: match[2],14 country: 'DE'15 };16}
United Kingdom
10 Downing Street, London, SW1A 2AA
Anchor: Postcode at end (variable format) Pattern: Street, City, Postcode
javascript1function parseGB(address) {2 const parts = address.split(',').map(s => s.trim());34 // Last part should be postcode5 const postcode = parts.pop();6 const postcodePattern = /^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$/i;78 if (!postcodePattern.test(postcode)) {9 throw new Error('Invalid UK postcode');10 }1112 return {13 street: parts.shift(),14 city: parts.pop() || parts.shift(),15 postcode: postcode.toUpperCase(),16 country: 'GB'17 };18}
The Parser Factory Pattern
Create a unified interface:
javascript1class AddressParser {2 static parsers = {3 US: parseUS,4 CA: parseCA,5 DE: parseDE,6 GB: parseGB,7 FR: parseFR8 };910 parse(address, country = null) {11 // Auto-detect country if not provided12 const detectedCountry = country || detectCountry(address);1314 const parser = AddressParser.parsers[detectedCountry];15 if (!parser) {16 throw new Error(`Unsupported country: ${detectedCountry}`);17 }1819 return parser(address);20 }21}
Handling Edge Cases
Multiple Address Lines
1123 Main Street2Apt 4B3New York, NY 10001
Normalize to single line first:
javascript1function normalizeLines(address) {2 return address.replace(/\n/g, ', ').replace(/,\s*,/g, ',');3}
Abbreviations
123 Main St., Apt. 4B, NYC, NY 10001
Expand or keep as-is? Generally keep as-is during parsing, normalize later.
Missing Components
New York, NY 10001 // No street
Return what you can, flag missing fields:
javascript1return {2 street: null,3 city: 'New York',4 state: 'NY',5 zip: '10001',6 warnings: ['Missing street address']7};
Best Practices
- Parse permissively, validate strictly - Extract what you can, validate later
- Preserve original - Store raw input alongside parsed result
- Use anchor points - ZIP codes, postcodes are reliable anchors
- Handle failures gracefully - Return partial results with warnings
- Test with real data - Use actual addresses from each country
What's Next
In the workshop, you'll build a multi-country address parser that handles US, Canada, UK, and German addresses.