How to make case-insensitive comparisons in SQLite?

34

As there is an obsession with questions with accents, here's mine:)

SQLite allows you to make direct comparisons or through like with any encoding / charset from byte to . It only allows comparisons to be made regardless of case sensitivity with ASCII characters.

I have some situations that I need to use SQLite with case insensitive and consequently accent insensitive also for Latin characters (ISO 8859-1 or Latin1). I do not need and do not use encoding UTF-8 or UTF-16, not least because there is a lack of implementation of these encodings in SQLite.

The default response from SQLite developers, which is english centric , is that if you need extra handling of ASCII, you should arrange for this. They claim that you already have this available in your application. Which is not true in my case.

I often use SQLite in languages that do not have the proper treatment to normalize the box of characters by taking into account Latin characters, at least not in a way that can be called by SQLite.

If you used UTF8 you would still not use ICU (en) for your weight in memory and processing. It cannon to kill bird and in some cases would bring me unwanted complications.

It would need to be in C to communicate directly with the SQLite API without overhead and to be portable to any language used in the application and also to the operating system, which is another requirement. >

The main requirements:

  • normalize the case handling main characters (disregarding accents) of ISO-8859-1
  • lightness (memory and processing)
  • simplicity (allowing easy deployment and maintenance)
  • Portability (language and operating system)
  • avoid duplicate data in tables (create normalized column)
  • No external dependency or license required to compile with my SQLite application.

It can be a simple library or a function to subscribe to the SQLite API. Preferably it also works with LIKE and FTS , UPPER and LOWER , etc.

I've already researched and it does not solve my problem:

  • link ( only treats ASCII)
  • link (requires ICU)
  • link (requires language not available and I do not know if it solves any problem)
  • link (depends on Python)
  • / a> (promising but does not resolve or does not work as expected)

Reference: SQLite API

    
asked by anonymous 09.01.2014 / 14:01

3 answers

20

Sqlite has case insensitive only for ASCII natively, since one of the goals of the database is to be Lite :) and there is the understanding that any application that depends on specific language comparison will already have a function for this, since the database is just another one of the points where it will be used.

To have an order in the form you want, using your own criteria, simply implement a new collation, using the , which allows you to point to a function of your own in C, which compares strings in the way that you find most convenient.

Once this collation is implemented, simply include it in the desired query:

SELECT nome, endereco FROM cadastro ORDER BY nome COLLATE CollationExemploPtBr

Details of the function to be implemented go from the need of each, the important thing is to know that its function will receive the two strings to be compared, and should return 0 for strings considered equal (action and action, in Portuguese, should return 0), some positive for string1 greater than string2, and negative for string1 smaller than string2.

One of the advantages of having your own collation is that you can not limit yourself to comparing characters by just accenting, but also making & and E have the same "weight" when ordering a string, or anything else that is convenient for the desired result.

  

As long as you call the create_collation function before you start using the database functions themselves, you can use your collation (or several different ones at the same time as you like) typically as an index. This is especially important to maintain DB performance without losing the freedom to sort as you want. These functions work as efficiently or even better than native SQLite, depending only on the quality of the code implemented.

Remember that collation is only part of the process, because when you find a string with sqlite3_create_collation( ) , for example, you must also implement a compatible function, so you can use the LIKE (en) that uses the same principle, point to a function of your code that processes the as you wish.

Here is a simplified example, adapted from a function I use in some applications:

// Atencao: esta implementacao esta muito simplificada,
// e foi postada como mero exemplo. Por ter sido rapidamente
// adaptada e simplificada de um codigo de uso interno para
// ser postada no SO, pode conter erros de tipagem e alguma
// condicao nao tratada (como utf mal formado) ou bugs.
//
// Esta implementacao somente esta considerando os caracteres
// acentuados entre u+0000 e u+00ff, para fins de exemplo, e
// considerando as equivalencias basicas em pt_BR
//
// -- Bacco

sqlite3 *db;
if (SQLITE_OK==sqlite3_open( "databasename.db", &db)
{

   sqlite3_create_collation( db,
                             "CollationExemploPtBr",
                              SQLITE_UTF8,  /* Nota [1] */
                              &example_table_ptbr, /* Nota [2] */
                              &example_collation ); 

// [1] A pergunta original menciona 8859-1. Notar que este codigo
// trata de UTF, mas basta ajustar os "ifs" pra ignorar UTF e
// usar a tabela com 256 caracteres "puros".
//
// [2] Notar que estou usando um cargo pra nao referenciar o
// example_table_ptbr direto no example_collation.
// O SQLite repassa esse pointer ao chamar a funcao indicada. 
// Este e um otimo jeito de usar a mesma funcao com tabelas diferentes.
// Se sua funcao ja tiver a tabela embutida,
// basta usar null neste parametro.

// ... seu codigo aqui ...

}

static const char example_table_ptbr[] = {
    /* u+0000 .. U+007F */
   0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,  0x08,0x09,0x0A,0x0B,0x0C,0x0D,0x0E,0x0F,
   0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,  0x18,0x19,0x1A,0x1B,0x1C,0x1D,0x1E,0x1F,
   0x20,0x21,0x22,0x23,0x24,0x25,'E' ,0x27,  0x28,0x29,0x2A,0x2B,0x2C,0x2D,0x2E,0x2F,
   0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,  0x38,0x39,0x3A,0x3B,0x3C,0x3D,0x3E,0x3F,
   0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,  0x48,0x49,0x4A,0x4B,0x4C,0x4D,0x4E,0x4F,
   0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57,  0x58,0x59,0x5A,0x5B,0x5C,0x5D,0x5E,0x5F,
   0x60,'A' ,'B' ,'C' ,'D' ,'E' ,'F' ,'G' ,  'H' ,'I' ,'J' ,'K' ,'L' ,'M' ,'N' ,'O' ,
   'P' ,'Q' ,'R' ,'S' ,'T' ,'U' ,'V' ,'W' ,  'X' ,'Y' ,'Z' ,0x7B,0x7C,0x7D,0x7E,0x7F,
   /* u+0080 .. U+00FF */
   0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,  0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
   0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,  0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
   0x20,'!' ,'C' ,'L' ,0x20,'Y' ,0x20,0x20,  0x20,'C' ,'A' ,'<' ,0x20,0x20,'R' ,0x20,
   'O' ,0x20,'2' ,'3' ,0x20,'U' ,0x20,0x20,  0x20,'1' ,'O' ,'>' ,0x20,0x20,0x20,'?' ,
   'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' ,  'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
   'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'X' ,  'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'S' ,
   'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' ,  'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
   'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'/' ,  'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'Y' };

int example_collation(void* pArg,int n1,const void* s1,int n2,const void* s2)
{
   int v = 0;
   unsigned char b1;
   unsigned char b2;
   unsigned char *t = (unsigned char *)pArg;
   unsigned char *p1 = (unsigned char *)s1;
   unsigned char *p2 = (unsigned char *)s2;
   unsigned char *c1 = p1 + n1;
   unsigned char *c2 = p2 + n2;

   while( ( p1 < c1 || p2 < c2 ) && v == 0) {
      b1 = *(p1++);
      b2 = *(p2++);

      if ( p1 > c1 )
         b1 = 0x20;

      else if ( b1 < 0x80 )
         b1 = t[ b1 ];

      else if( ( b1 & 0xFE ) == 0xC2 )
         b1 = t[ 0x80 | ( ( b1 << 6 ) & 0x40 ) | ( *(p1++) & 0x3F ) ];

      if ( p2 > c2 )
         b2 = 0x20;

      else if ( b2 < 0x80 )
         b2 = t[ b2 ];

      else if( ( b2 & 0xFE ) == 0xC2 )
         b2 = t[ 0x80 | ( ( b2 << 6 ) & 0x40 ) | ( *(p2++) & 0x3F ) ];

      v = b1 - b2 ;
   }
   return v;
}
    
14.01.2014 / 17:31
3

To switch between comparisons case insensitive and case sensitive there is a configuration :

PRAGMA case_sensitive_like=OFF;

About removing accents, I found this reference with a function that looks promising:

SELECT spellfix1_translit('água'); -- agua 
SELECT * FROM table WHERE spellfix1_tanslit(column) LIKE 'a%'; 

A implementation in question, part of a virtual table to find words "nearby ", uses a mega-table of translations to get the ASCII counterpart of accented characters (that is, does the replacement of Roman characters on the nail). For more details see the translit[] data structure and the transliterate function that I omitted for reasons of space and simplicity (the copyright message of the header is very interesting hehehe).

/*
**    spellfix1_translit(X)
**
** Convert a string that contains non-ASCII Roman characters into 
** pure ASCII.
*/
static void transliterateSqlFunc(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
){
  const unsigned char *zIn = sqlite3_value_text(argv[0]);
  int nIn = sqlite3_value_bytes(argv[0]);
  unsigned char *zOut = transliterate(zIn, nIn);
  if( zOut==0 ){
    sqlite3_result_error_nomem(context);
  }else{
    sqlite3_result_text(context, (char*)zOut, -1, sqlite3_free);
  }
}

As for the implementation of the function, if you want something more "robust", since ICU is out of the question, there are other options like unac . Just use the above function as template and replace the call with transliterate with the normalization function you prefer.

    
09.01.2014 / 16:41
0

Although the question has already been answered and @Maniero have made it clear

  

Avoid duplicate data in tables

In some situations it is not possible to create a collate sqlite3_create_collation . And, using data replication can meet other requirements like:

  

lightness and simplicity

And if you are also interested in accent insensitive you can have the alternative with field doubling with the use of slug.

CREATE TABLE 'posts' (
    'id' INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
    'name' TEXT, 
    'name_slug' TEXT);

When entering, you will need to enter the information twice. The first one is raw and the second, slug . In the example it would be name and name_slug . Slugify in C ++

INSERT INTO items ( name, name_slug )
    VALUES ( "O nome do POST é comprido", "o-nome-do-post-e-comprido" );

And when performing comparison or like , use the same search term slugify procedure.

# Post é
SELECT * FROM items WHERE status = 1 AND name_slug like "%post-e%"
    
31.05.2018 / 00:23