How to do a search ignoring accentuation in Python?

15

Suppose I have a list of words in Python (if necessary, already sorted according to the collation rules):

palavras = [
    u"acentuacao",
    u"divagacão",
    u"programaçao",
    u"taxação",
]

Notice that I have not used cedilla ( ç ) nor tilde ( ã ) consistently. How can I search in this list by "programming" but ignoring the accent so that multiple search modalities return results? Ex.:

buscar(palavras, u"programacao")
buscar(palavras, u"programação")

I searched Google for "collation search" and found nothing useful. I also searched for "ignoring accents search" in a variety of ways, and even found a MySQL solution (which confirms that the right path is even via collate ), but nothing for Python (just references to how to sort a list , which in itself does not answer the question). The module locale also did not offer much help. How to do?

    
asked by anonymous 08.01.2014 / 21:01

3 answers

6

Based on the comment and reference from @bfavaretto, I was able to set up a proof-of-concept. The solution is to remove the diacritics both from the list to be searched and from the search term. To do this, the string is normalized first to ensure that the matching characters are represented separately, then remove those characters (which in the case of accents, etc, have the Unicode Mn ).

I attempted to do the replacement using the regex module, with no success, so I opted for a separate function. The [binary] search code came from that answer in the English OS.

import unicodedata
from bisect import bisect_left

def remover_combinantes(string):
    string = unicodedata.normalize('NFD', string)
    return u''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')

palavras_norm = [remover_combinantes(x) for x in palavras]

def binary_search(a, x, lo=0, hi=None):   # can't use a to specify default for hi
    hi = hi if hi is not None else len(a) # hi defaults to len(a)   
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def buscar(lista, palavra):
    return binary_search(lista, remover_combinantes(palavra))

>>> buscar(palavras_norm, u'programacao')
2
>>> buscar(palavras_norm, u'programação')
2
    
08.01.2014 / 22:10
3

There is an easier solution, that is, to install a module that does this work directly. This module is unidecode , which exists for both Python 2 and Python 3.

If you are on a Unix-like system, the best installation solution is to directly use the terminal pip for Python 2 or pip3 for Python 3 in the following way: / p>

  • pip install unidecode

    for Python 2

  • pip3 install unidecode

    for Python 3

  • This is a practical and complete example using the list you gave as an example:

    import unidecode
    
    palavras = [
        u"acentuacao",
        u"divagacão",
        u"programaçao",
        u"taxação",
    ]
    
    def to_ascii(ls):
        for i in range(len(ls)):
            ls[i] = unidecode.unidecode(ls[i])
    
    to_ascii(palavras)
    print(palavras)
    

    And the output is as follows:

    ['acentuacao', 'divagacao', 'programacao', 'taxacao']
    

    For more information about the module, see here or here in Python's official site. If you are interested in modifying or simply viewing the code, here you have the repository in GitHub .

    For more information, there are at least this post or this on the other OS that may be useful.

        
    13.01.2015 / 12:38
    1

    You can write a method to remove accents:

    import unicodedata
    
    def remove_accents(input_str):
        nkfd_form = unicodedata.normalize('NFKD', input_str)
        only_ascii = nkfd_form.encode('ASCII', 'ignore')
        return only_ascii
    
    lista = [remove_accents(i) for i in ['é', 'á']]
    'e' in lista
    

    So I think it's easy for you to take your need!

        
    29.01.2014 / 15:30