15 minlesson

Address Parsing Strategies

Address Parsing Strategies

Parsing addresses is challenging because formats vary by country, and even within a country, addresses can be written in multiple ways. This lesson covers strategies for building robust parsers.

The Challenge

Consider these equivalent US addresses:

1123 Main Street, Apartment 4B, New York, NY 10001
2123 Main St., Apt. 4B, New York, NY 10001
3123 Main St Apt 4B New York NY 10001

All represent the same location, but the parser must handle each format.

Two Approaches to Parsing

1. Regex-Based Parsing

Use regular expressions to match patterns:

javascript
1const US_PATTERN = /^(.+),\s*(.+),\s*([A-Z]{2})\s+(\d{5}(-\d{4})?)$/i;
2
3function parseUS(address) {
4 const match = address.match(US_PATTERN);
5 if (!match) return null;
6
7 return {
8 streetLine: match[1],
9 city: match[2],
10 state: match[3].toUpperCase(),
11 zip: match[4]
12 };
13}

Pros: Fast, no dependencies Cons: Brittle, hard to maintain, struggles with variations

2. Token-Based Parsing

Split into tokens and analyze each:

javascript
1function parseTokens(address) {
2 const tokens = address.split(/[,\s]+/).filter(Boolean);
3
4 // Find ZIP code (anchor point)
5 const zipIndex = tokens.findIndex(t => /^\d{5}(-\d{4})?$/.test(t));
6
7 // State is just before ZIP
8 const state = tokens[zipIndex - 1];
9
10 // City is before state (may be multiple tokens)
11 // Street is at the beginning
12 // ...
13}

Pros: More flexible, handles variations Cons: More complex logic, still country-specific

Country Detection

Before parsing, detect the country:

javascript
1function detectCountry(address) {
2 // Check for country names/codes at the end
3 if (/\b(USA|United States|US)\s*$/i.test(address)) return 'US';
4 if (/\b(Canada|CA)\s*$/i.test(address)) return 'CA';
5 if (/\b(United Kingdom|UK|GB)\s*$/i.test(address)) return 'GB';
6 if (/\b(Germany|Deutschland|DE)\s*$/i.test(address)) return 'DE';
7
8 // Check postal code patterns
9 if (/\b\d{5}(-\d{4})?\b/.test(address)) return 'US'; // or DE/FR
10 if (/\b[A-Z]\d[A-Z]\s?\d[A-Z]\d\b/i.test(address)) return 'CA';
11 if (/\b[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}\b/i.test(address)) return 'GB';
12
13 return 'UNKNOWN';
14}

Parsing by Country

United States

123 Main Street, Apt 4B, New York, NY 10001

Anchor: ZIP code at end (5 digits or ZIP+4) Pattern: Street, [Unit], City, State ZIP

javascript
1function parseUS(address) {
2 const parts = address.split(',').map(s => s.trim());
3
4 // Last part: State ZIP
5 const lastPart = parts.pop();
6 const stateZipMatch = lastPart.match(/^([A-Z]{2})\s+(\d{5}(-\d{4})?)$/i);
7
8 if (!stateZipMatch) throw new Error('Invalid US address format');
9
10 const state = stateZipMatch[1].toUpperCase();
11 const zip = stateZipMatch[2];
12
13 // Second to last: City
14 const city = parts.pop();
15
16 // First: Street
17 const street = parts.shift();
18
19 // Remaining: Unit (optional)
20 const unit = parts.length > 0 ? parts[0] : undefined;
21
22 return { street, unit, city, state, zip, country: 'US' };
23}

Canada

456 Queen Street West, Toronto, ON M5V 3A8

Anchor: Postal code (A1A 1A1 format) Pattern: Street, City, Province PostalCode

javascript
1function parseCA(address) {
2 const parts = address.split(',').map(s => s.trim());
3
4 // Last part: Province + Postal Code
5 const lastPart = parts.pop();
6 const match = lastPart.match(/^([A-Z]{2})\s+([A-Z]\d[A-Z]\s?\d[A-Z]\d)$/i);
7
8 if (!match) throw new Error('Invalid Canadian address format');
9
10 return {
11 street: parts.shift(),
12 city: parts.pop() || parts.shift(),
13 province: match[1].toUpperCase(),
14 postalCode: match[2].toUpperCase().replace(/\s/, ' '),
15 country: 'CA'
16 };
17}

Germany

Friedrichstraße 123, 10117 Berlin

Anchor: PLZ before city (5 digits) Pattern: Street Number, PLZ City

Note: Street number comes AFTER street name!

javascript
1function parseDE(address) {
2 const parts = address.split(',').map(s => s.trim());
3
4 // Last part: PLZ City
5 const lastPart = parts.pop();
6 const match = lastPart.match(/^(\d{5})\s+(.+)$/);
7
8 if (!match) throw new Error('Invalid German address format');
9
10 return {
11 street: parts[0], // includes street number at end
12 plz: match[1],
13 city: match[2],
14 country: 'DE'
15 };
16}

United Kingdom

10 Downing Street, London, SW1A 2AA

Anchor: Postcode at end (variable format) Pattern: Street, City, Postcode

javascript
1function parseGB(address) {
2 const parts = address.split(',').map(s => s.trim());
3
4 // Last part should be postcode
5 const postcode = parts.pop();
6 const postcodePattern = /^[A-Z]{1,2}\d[A-Z\d]?\s?\d[A-Z]{2}$/i;
7
8 if (!postcodePattern.test(postcode)) {
9 throw new Error('Invalid UK postcode');
10 }
11
12 return {
13 street: parts.shift(),
14 city: parts.pop() || parts.shift(),
15 postcode: postcode.toUpperCase(),
16 country: 'GB'
17 };
18}

The Parser Factory Pattern

Create a unified interface:

javascript
1class AddressParser {
2 static parsers = {
3 US: parseUS,
4 CA: parseCA,
5 DE: parseDE,
6 GB: parseGB,
7 FR: parseFR
8 };
9
10 parse(address, country = null) {
11 // Auto-detect country if not provided
12 const detectedCountry = country || detectCountry(address);
13
14 const parser = AddressParser.parsers[detectedCountry];
15 if (!parser) {
16 throw new Error(`Unsupported country: ${detectedCountry}`);
17 }
18
19 return parser(address);
20 }
21}

Handling Edge Cases

Multiple Address Lines

1123 Main Street
2Apt 4B
3New York, NY 10001

Normalize to single line first:

javascript
1function normalizeLines(address) {
2 return address.replace(/\n/g, ', ').replace(/,\s*,/g, ',');
3}

Abbreviations

123 Main St., Apt. 4B, NYC, NY 10001

Expand or keep as-is? Generally keep as-is during parsing, normalize later.

Missing Components

New York, NY 10001  // No street

Return what you can, flag missing fields:

javascript
1return {
2 street: null,
3 city: 'New York',
4 state: 'NY',
5 zip: '10001',
6 warnings: ['Missing street address']
7};

Best Practices

  1. Parse permissively, validate strictly - Extract what you can, validate later
  2. Preserve original - Store raw input alongside parsed result
  3. Use anchor points - ZIP codes, postcodes are reliable anchors
  4. Handle failures gracefully - Return partial results with warnings
  5. Test with real data - Use actual addresses from each country

What's Next

In the workshop, you'll build a multi-country address parser that handles US, Canada, UK, and German addresses.