Unicode and Character Handling
International addresses contain characters from many writing systems. Handling these correctly is essential for global applications.
The Unicode Challenge
Addresses may contain:
- Latin with diacritics: Müller, José, Søren
- Japanese: 東京都渋谷区
- Chinese: 北京市朝阳区
- Korean: 서울특별시
- Cyrillic: Москва
- Arabic: الرياض
Unicode Normalization
The same character can be represented multiple ways:
javascript1// "é" can be:2const composed = 'é'; // Single character (U+00E9)3const decomposed = 'e\u0301'; // e + combining accent45composed === decomposed; // false!6composed.length; // 17decomposed.length; // 289// Normalize to compare10composed.normalize('NFC') === decomposed.normalize('NFC'); // true
Normalization Forms
| Form | Name | Use |
|---|---|---|
| NFC | Composed | Default storage |
| NFD | Decomposed | Searching |
| NFKC | Compatibility Composed | Matching |
| NFKD | Compatibility Decomposed | Indexing |
Transliteration
Convert non-Latin scripts to Latin:
javascript1// Romanization examples2'東京' → 'Tokyo'3'Москва' → 'Moskva'4'서울' → 'Seoul'
Why transliterate?
- Shipping labels often require ASCII
- Carrier systems may not support all scripts
- International display compatibility
Removing Diacritics
javascript1function removeDiacritics(str) {2 return str3 .normalize('NFD')4 .replace(/[\u0300-\u036f]/g, '');5}67removeDiacritics('Müller'); // 'Muller'8removeDiacritics('José'); // 'Jose'9removeDiacritics('naïve'); // 'naive'
Script Detection
javascript1function detectScript(text) {2 if (/[\u3040-\u309F]/.test(text)) return 'hiragana';3 if (/[\u30A0-\u30FF]/.test(text)) return 'katakana';4 if (/[\u4E00-\u9FFF]/.test(text)) return 'han';5 if (/[\uAC00-\uD7AF]/.test(text)) return 'hangul';6 if (/[\u0400-\u04FF]/.test(text)) return 'cyrillic';7 if (/[\u0600-\u06FF]/.test(text)) return 'arabic';8 return 'latin';9}
Best Practices
- Normalize on input - Use NFC for storage
- Store original + transliteration - Keep both versions
- Transliterate for shipping - ASCII for carrier systems
- Test with real data - Use actual international addresses