How to remove accented expressions with regular expressions in Python?

4

I'm developing a regular expression to try to replace accents and characters with ç by normal characters

Example:

á = a 
ç = c
é = e 

But my regex is only eliminating, any tips?

import re


string_velha = ("Olá você está ????   ")
string_nova = re.sub(u'[^a-zA-Z0-9: ]', '', string_velha.encode().decode('utf-8'))
print(string_nova)

Result:

Ol voc est 
    
asked by anonymous 21.09.2018 / 19:40

3 answers

7

A simple mode that uses the unicodedata module, included in python, to decompose each unicode accent into its original codepoint + combination codepoint, then filter the combination codepoints to have a clean string:

import unicodedata
string_velha = "Olá você está????"
string_nova = ''.join(ch for ch in unicodedata.normalize('NFKD', string_velha) 
    if not unicodedata.combining(ch))
print(string_nova)

Result:

Ola voce esta????

Another way is to use unidecode - this external module needs to be installed, its purpose is precisely to generate a unique-ascii representation of unicode characters. It covers more character possibilities, but is an external dependency.

import unidecode
string_nova = unidecode.unidecode(string_velha)
print(string_nova)
    
21.09.2018 / 20:24
1

See the signature of the re.sub(pattern, repl, string, count=0, flags=0) function, the second argument of it defines the string or função that will be used when the pattern successfully performs a search in the original string, in this case it is enough to implement a function that will be call every time this occurs for example:

import re

def repl(match):
    data = {"á": "a", "ç": "c", "ê": "e"}
    return data.get(match.group(0))

string_velha = ("Olá você está ????   ")
string_nova = re.sub(u'[^a-zA-Z0-9: ]', repl, string_velha.encode().decode('utf-8'))
print(string_nova)
    
21.09.2018 / 20:08
0

Your code is exchanging all the characters captured in the regular expression by '' and thus removing the accent.

If you want to translate each accent for the respective character without an accent, you can use a normal dictionary and make replace .

import re

# char codes: https://unicode-table.com/en/#basic-latin
accent_map = {
    u'\u00c0': u'A',
    u'\u00c1': u'A',
    u'\u00c2': u'A',
    u'\u00c3': u'A',
    u'\u00c4': u'A',
    u'\u00c5': u'A',
    u'\u00c6': u'A',
    u'\u00c7': u'C',
    u'\u00c8': u'E',
    u'\u00c9': u'E',
    u'\u00ca': u'E',
    u'\u00cb': u'E',
    u'\u00cc': u'I',
    u'\u00cd': u'I',
    u'\u00ce': u'I',
    u'\u00cf': u'I',
    u'\u00d0': u'D',
    u'\u00d1': u'N',
    u'\u00d2': u'O',
    u'\u00d3': u'O',
    u'\u00d4': u'O',
    u'\u00d5': u'O',
    u'\u00d6': u'O',
    u'\u00d7': u'x',
    u'\u00d8': u'0',
    u'\u00d9': u'U',
    u'\u00da': u'U',
    u'\u00db': u'U',
    u'\u00dc': u'U',
    u'\u00dd': u'Y',
    u'\u00df': u'B',
    u'\u00e0': u'a',
    u'\u00e1': u'a',
    u'\u00e2': u'a',
    u'\u00e3': u'a',
    u'\u00e4': u'a',
    u'\u00e5': u'a',
    u'\u00e6': u'a',
    u'\u00e7': u'c',
    u'\u00e8': u'e',
    u'\u00e9': u'e',
    u'\u00ea': u'e',
    u'\u00eb': u'e',
    u'\u00ec': u'i',
    u'\u00ed': u'i',
    u'\u00ee': u'i',
    u'\u00ef': u'i',
    u'\u00f1': u'n',
    u'\u00f2': u'o',
    u'\u00f3': u'o',
    u'\u00f4': u'o',
    u'\u00f5': u'o',
    u'\u00f6': u'o',
    u'\u00f8': u'0',
    u'\u00f9': u'u',
    u'\u00fa': u'u',
    u'\u00fb': u'u',
    u'\u00fc': u'u'
}

def accent_remove (m):
  return accent_map[m.group(0)]

string_velha = "Olá você está ????   "
string_nova = re.sub(u'([\u00C0-\u00FC])', accent_remove, string_velha.encode().decode('utf-8'))

print(string_nova)

I put it in the Repl.it to see it running.

    
21.09.2018 / 20:58