Function refactoring to remove punctuation, spaces, and special characters

18

I have this function too old to "clean" the contents of a variable:

Function

function sanitizeString($string) {

    // matriz de entrada
    $what = array( 'ä','ã','à','á','â','ê','ë','è','é','ï','ì','í','ö','õ','ò','ó','ô','ü','ù','ú','û','À','Á','É','Í','Ó','Ú','ñ','Ñ','ç','Ç',' ','-','(',')',',',';',':','|','!','"','#','$','%','&','/','=','?','~','^','>','<','ª','º' );

    // matriz de saída
    $by   = array( 'a','a','a','a','a','e','e','e','e','i','i','i','o','o','o','o','o','u','u','u','u','A','A','E','I','O','U','n','n','c','C','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_' );

    // devolver a string
    return str_replace($what, $by, $string);
}

Use

<?php
$pessoa = 'João dos Santos Videira';

$pastaPessoal = sanitizeString($pessoa);

// resultado
echo $pastaPessoal; // Joao_dos_Santos_Videira
?>

Being an old function, at the time of its creation making a substitution of a character A by B was the best option, but doing maintenance to an input matrix and an output matrix is not easy and back and forth there appears an unforeseen scenario.

With the evolution of PHP, how to refactor this function by making use of language solutions or easier to maintain?

    
asked by anonymous 22.12.2013 / 17:08

5 answers

18

Just use regular expressions!

<?php
function sanitizeString($str) {
    $str = preg_replace('/[áàãâä]/ui', 'a', $str);
    $str = preg_replace('/[éèêë]/ui', 'e', $str);
    $str = preg_replace('/[íìîï]/ui', 'i', $str);
    $str = preg_replace('/[óòõôö]/ui', 'o', $str);
    $str = preg_replace('/[úùûü]/ui', 'u', $str);
    $str = preg_replace('/[ç]/ui', 'c', $str);
    // $str = preg_replace('/[,(),;:|!"#$%&/=?~^><ªº-]/', '_', $str);
    $str = preg_replace('/[^a-z0-9]/i', '_', $str);
    $str = preg_replace('/_+/', '_', $str); // ideia do Bacco :)
    return $str;
}
?>

The line of code below the comment serves to replace all the characters with "_", except for letters or numbers.

    
22.12.2013 / 17:35
5

You may want to use the URLify.php library ( source code here ), which has extensive testing to support multiple characters and languages, and also supports adding more complex mappings that 1 character -> 1 character .

It also ignores symbols that it can not transliterate, which makes it quite robust to use in a URL or filename.

Here are some examples from the project page:

Clearing to use URL or filename

echo URLify::filter (' J\'étudie le français ');
// "jetudie-le-francais"    
echo URLify::filter ('Lo siento, no hablo español.');
// "lo-siento-no-hablo-espanol"

Only removing special characters per ASCII

echo URLify::downcode ('J\'étudie le français');
// "J'etudie le francais"
echo URLify::downcode ('Lo siento, no hablo español.');
// "Lo siento, no hablo espanol."

Mapping complex characters to expressions

URLify::add_chars (array (
    '¿' => '?', '®' => '(r)', '¼' => '1/4',
    '¼' => '1/2', '¾' => '3/4', '¶' => 'P'
));    
echo URLify::downcode ('¿ ® ¼ ¼ ¾ ¶');
// "? (r) 1/2 1/2 3/4 P"
    
23.12.2013 / 18:24
5

I think this would be the best and simplest solution to your problem:

$valor = "João dos Santos Videira" 
$valor = str_replace(" ","_",preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities(trim($valor))));
// Joao_dos_Santos_Videira

If you want to keep spaces instead of replacing them with " _ ", just remove str_replace :

$valor = "João dos Santos Videira" 
$valor = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities(trim($valor)));
// Joao dos Santos Videira
    
26.12.2013 / 19:40
5

You can use PHP to remove accents in a simple way using iconv, respecting capital letters without conflict. IGNORE will ignore characters that may not have any translation. After preg_replace will remove what is not A-Z and 0-9 , leaving a clean string without spaces, symbols or special characters.

$string = "ÁÉÍÓÚáéíóú! äëïöü";
$string = iconv( "UTF-8" , "ASCII//TRANSLIT//IGNORE" , $string );
$string = preg_replace( array( '/[ ]/' , '/[^A-Za-z0-9\-]/' ) , array( '' , '' ) , $string );

-----------------------------------------------------------------
Input:  ÁÉÍÓÚáéíóú! äëïöü
Output: AEIOUaeiouaeiou

See an example on ideone

    
28.11.2014 / 05:33
3

You search for the strtr() function. Regular expressions help you handle exceptional cases:

function sanitizeString($str)
{
    return preg_replace('{\W}', '', preg_replace('{ +}', '_', strtr(
        utf8_decode(html_entity_decode($str)),
        utf8_decode('ÀÁÃÂÉÊÍÓÕÔÚÜÇÑàáãâéêíóõôúüçñ'),
        'AAAAEEIOOOUUCNaaaaeeiooouucn')));
}

PS: I used the utf8_decode() function because I saved the files as UTF-8 in my system (OSX). You probably do not need to use it if the file is saved in other encodings like ISO-8859-1, CP1252, and the like.

    
22.12.2013 / 19:56