12 minlesson

Unicode and Character Handling

Unicode and Character Handling

International addresses contain characters from many writing systems. Handling these correctly is essential for global applications.

The Unicode Challenge

Addresses may contain:

  • Latin with diacritics: Müller, José, Søren
  • Japanese: 東京都渋谷区
  • Chinese: 北京市朝阳区
  • Korean: 서울특별시
  • Cyrillic: Москва
  • Arabic: الرياض

Unicode Normalization

The same character can be represented multiple ways:

javascript
1// "é" can be:
2const composed = 'é'; // Single character (U+00E9)
3const decomposed = 'e\u0301'; // e + combining accent
4
5composed === decomposed; // false!
6composed.length; // 1
7decomposed.length; // 2
8
9// Normalize to compare
10composed.normalize('NFC') === decomposed.normalize('NFC'); // true

Normalization Forms

FormNameUse
NFCComposedDefault storage
NFDDecomposedSearching
NFKCCompatibility ComposedMatching
NFKDCompatibility DecomposedIndexing

Transliteration

Convert non-Latin scripts to Latin:

javascript
1// Romanization examples
2'東京''Tokyo'
3'Москва''Moskva'
4'서울''Seoul'

Why transliterate?

  • Shipping labels often require ASCII
  • Carrier systems may not support all scripts
  • International display compatibility

Removing Diacritics

javascript
1function removeDiacritics(str) {
2 return str
3 .normalize('NFD')
4 .replace(/[\u0300-\u036f]/g, '');
5}
6
7removeDiacritics('Müller'); // 'Muller'
8removeDiacritics('José'); // 'Jose'
9removeDiacritics('naïve'); // 'naive'

Script Detection

javascript
1function detectScript(text) {
2 if (/[\u3040-\u309F]/.test(text)) return 'hiragana';
3 if (/[\u30A0-\u30FF]/.test(text)) return 'katakana';
4 if (/[\u4E00-\u9FFF]/.test(text)) return 'han';
5 if (/[\uAC00-\uD7AF]/.test(text)) return 'hangul';
6 if (/[\u0400-\u04FF]/.test(text)) return 'cyrillic';
7 if (/[\u0600-\u06FF]/.test(text)) return 'arabic';
8 return 'latin';
9}

Best Practices

  1. Normalize on input - Use NFC for storage
  2. Store original + transliteration - Keep both versions
  3. Transliterate for shipping - ASCII for carrier systems
  4. Test with real data - Use actual international addresses
Unicode and Character Handling - Anko Academy