Compare two strings with C accent

4

I have the following problem, I need to compare two strings ignoring the accent, for example:

                                 Étnico | Brasil

Using a normal function of comparison is returned that "Ethnic" comes first that "Brazil" in view of the lexicographic order of the words.

I hope you have given me the chance to understand my doubt.

Does anyone have an idea how to handle this problem?

    
asked by anonymous 11.07.2015 / 18:50

2 answers

3

The lexicographic order or Collation is very relative to the language and alphabet you are using, and let's say it is problem other than the issue of choosing a Charset , which has already been resolved by UNICODE . Home For your doubt I recommend an essential reading:

  

The Absolute Minimum Every Software Developer Absolutely, Positively   Must Know About Unicode and Character Sets (No Excuses!)   

Problem approach in C
Recommendation is always to use a UNICODE representation instead of using literal characters expressed in char mainly for example the extended representation of accented Latin characters are multi-byte, that is, they will not be correctly represented in char (- 128 to 127 ) or even using unsigned char (0 to 255).

Using as a reference:

  

It = LATIN CAPITAL LETTER AND WITH ACUTE

It would be the unicode-codepoint U+00C9 being the hexa c3 89 occupying 2 bytes in UTF-8. This would have to be represented by%% multibyte-character type . Home Suppose the question revolves around receiving an input, converting it and testing it as you expose:

  

I need to compare two strings by ignoring the accent


An approach would be like this example, using the Wide-Character I / O functions to replace all wchar_t :

//constante unicode representado por um type wide char
const wchar_t E_GRANDE_ACENTO L'\u00C9';

int main()
{
    //obtem o locale default do ambiente, linux padrão normalmente UTF-8
    setlocale(LC_ALL, "");
    //fputs para wide char type
    fputws(L"Informe a String: ", stdout);

    wchar_t wbuff[128];
    //fgets para wide char type
    fgetws(wbuff, 128, stdin);

    int len = wcslen(wbuff);
    for (int n = 0; n < len ;++n)
    {
        if (buff[n] == E_GRANDE_ACENTO)
            buff[n] = L'E';
    }

    wprintf(L" %ls\n", buff);

    return 0;
}


This is a reference example in the case of a broader approach to this kind of problem API (UNAC) informed by @Intrusion would be more recommended.

What about Collation of a UNICODE stream?
Maybe this would be the approach you hoped for, I recommend using the ICU - International Components for Unicode API, it solves the ordering using existing standards or even with specific ruleset declared during your instance.

Collator example using ICU API for unicode array sorting

UChar *s [] = { /* lista de strings unicode */ };
uint32_t listSize = sizeof(s)/sizeof(s[0]); 
UErrorCode status = U_ZERO_ERROR; 
UCollator *coll = ucol_open("en_US", &status); 
uint32_t i, j; 
if(U_SUCCESS(status)) {
  for(i=listSize-1; i>=1; i--) {
    for(j=0; j<i; j++) {
      if(ucol_strcoll(s[j], -1, s[j+1], -1) == UCOL_LESS) {
        swap(s[j], s[j+1]);
     }
   }
} 
ucol_close(coll); 
}
    
14.07.2015 / 00:17
2

The answer to this dilemma will depend on the focus of the application, as well as any application that needs to deal with particularities of some kind of culture (date, time, language, spindle, etc.).

In the most specific case of your doubt, the language and the language used. For it is the factors that will guide you in the set of characters you want to deal with. This is clear when you compare an application that has to deal with English and with Brazilian Portuguese. The accent set from one to the other is very different and in English the task would be relatively much easier.

The next step is to analyze the cache used and ensure that the data (if not from the same source) is at least in a single format (encode).

If you are programming a web search engine, for example, the content will vary a lot and the work to do what you want will turn a project apart into the software. But if the project is one for parsing a particular set of single-source documents, then both the language and the encode will be very specific and you can solve it more easily.

My suggestion is to start by analyzing the possible sources of data and then checking the most appropriate in the long run and at first glance, two solutions are more direct answers:

1) Create a character mapping character character, which takes the string and returns the value without the accents.

2) Use something ready like Unac ( link )

  

unac is a C library that removes accents from characters, regardless   of the character set (ISO-8859-15, ISO-CELTIC, KOI8-RU ...) as long as   iconv (3) is able to convert it to UTF-16 (Unicode).

Good stuff on this: link

    
13.07.2015 / 19:00