How to validate people's names in Brazilian Portuguese?
How to validate people's names in Brazilian Portuguese?
The Portuguese alphabet is based on the Latin alphabet , which consists of 26 characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
In addition to these characters, the Portuguese alphabet (from Brazil) adds the following Diacritical symbols :
~
(Til) : nasalizes the vowel "a" and the diphthongs "ae", "oe" and "ao" "- ã / ã / ã / ã / ão / ão.- gives the letter "c" the sound of the letter "s" in front of "a", "o" and "u" - ç.
¸
(Circumflex Accent) : indicates the tonic syllable and closes the timbre of vowels "a", "e" and "o", in cases where graphic accentuation is required - / / / /..^
(Acute Acute) : indicates the tonic syllable and opens the vowel timbre in cases where graphic accent is required bye / by / / / / /.´
(Acento Grave) : used to mark the feminine dative case, as opposed to "ao" ( masculine), and the pronouns "that", "that" and "that" - to.'
(Trema) : currently used only in Brazilian Portuguese to indicate the pronouncement of the vowel "u" in the sequences "qüe "," çi "," güe "and" güi "- ü.
Then, in addition to the traditional range ¨
and a-z
, we must also include the characters A-Z
, ãõ
, ç
âêô
, à
and áéíóú
. Of course, we can not forget the white space.
The regex would look like:
[^a-zA-ZáéíóúàâêôãõüçÁÉÍÓÚÀÂÊÔÃÕÜÇ ]
Also remember:
public static string TratarNome(string nome)
{
if (string.IsNullOrWhiteSpace(nome)) throw new ArgumentException("Um nome em branco foi passado.");
// Removendo caracteres em branco no ínicio e no final do nome:
nome = nome.Trim();
// Trocando dois ou mais espaços em branco consecutivos por apenas um:
nome = Regex.Replace(nome, "[ ]{2,}", " ", RegexOptions.Compiled);
// Verificando a ocorrência de caracteres inválidos no alfabeto português (do Brasil):
if (Regex.IsMatch(nome, "[^a-zA-ZáéíóúàâêôãõüçÁÉÍÓÚÀÂÊÔÃÕÜÇ ]", RegexOptions.Compiled)) throw new ArgumentException("Nome inválido: \"" + nome + "\".");
return nome;
}
I ran the above code on a database with tens of thousands of Brazilian names (around 100,000).
From these I obtained the following false positives:
/ li>
ü
: SAINT'CLAIR . ñ
: SAINT-CLAIR . In addition to the name of our colleague @jpkrohling:
'
: KRÖHLING. Another curiosity is that a few records are with the blank NBSP (160), rather than the common SP (32). The validation also detected this (and, in our case, we resolved to override).
Handling names, especially internationally, is not a simple task. The above treatment would fail as relatively common names like Björk , Maric ; or not as common as Graham-Cumming .
In addition, when more permissive, beware of a possible breach to an XSS attack . An example would be the apostrophe . Some names use the apostrophe , which is often represented (erroneously?) by the single-quote character ( -
) instead of the correct character ( ö
).
The warning is left.
In Brazil there are no restrictions on the names of people. The law only mentions that it can not expose the person to ridicule, otherwise it is allowed. And yet nothing prevents a foreigner from living in the country and needs to be registered in your system.
In this way, it is not enough to predict a validation rule that considers the letters of the Latin alphabet (A-Z) and its accents. It is also necessary to provide exceptions for apostrophes, hyphens, Roman sequence numbers (eg William Gates III), Greek characters ... The list of exceptions would be gigantic and would probably leave something out, generating error for some specific user. / p>
There is also the problem of encodings, and depending on the treatment that the related systems give the characters, a name could be printed on another system in a totally incomprehensible way.
In general, if you prevent a user from signing up on your system because his or her name is not accepted, you are losing a potential customer. Just validate if the field has been filled out and do not take any risks.