Text Normalization in Python
In many languages, such as Spanish, there are characters that do not have ASCII representation, such as á, which does have representation in Unicode.
To avoid problems or for simplification, an equivalence has been established between Unicode and ASCII characters. Below I'll show you a piece of Python code that performs this conversion.
# -*- coding: utf-8 -*-
from unicodedata import normalize
def normalize_text(text):
return normalize('NFKD', text) # (1)
.encode('ASCII', 'ignore') # (2)
- We specify the normal form that we apply in the normalization. In this case
NFKD
. More information about normal form types. - We convert the normalization result to ASCII. In case a character is erroneous, it will simply be ignored.
Running the function
>>> normalize_text('aáaá eéeé iíií oóoó ñnñn AÀAÀ')
b'aaaa eeee iiii oooo nnnn AAAA'