Is there any problem in using Unicode characters for identifiers in code?

Question

Is there any problem in using Unicode characters for identifiers in code?

Navigation

#1 by (22 votes)
#2 by (16 votes)

33

Today, it is common for compilers of programming languages to allow the code file of their programs to accept code with Unicode characters.

This is useful, especially for those who use Portuguese and other languages that run away from ASCII to create strings with accents and improve comments in our language.

But it is unusual to use identifiers with accents in the code. There are even those who advise not to use it.

I do not usually use it, but it seems to me to give a better sense in these cases (only an isolated example with no language definition):

class Validação {
    bool ÉValido;
    ...
}

Is there any technical reason to avoid accents and other Unicode characters in identifiers?

If there is no technical problem, is there any practical reason to avoid them?

Depends on the programming language? Considering that it supports well accentuation in fullness.

Does it matter if the code is proprietary and developed by a small, closed team or is it developed widely, possibly openly?

Is there any specific care we should take when using accents on identifiers?

When using characters other than ASCII is abuse?

unicode

asked by anonymous 21.05.2014 / 16:04

2 answers

16

Most modern environments actually support working with unicode. But from there to use this in the code has a large space. The first point to consider before thinking about aesthetics and good practices is whether your language supports this. Most define a finite (and small) set of characters from which the source code must be composed. It is usually a subset of ASCII. For example, the C standard reads as follows (C11, 5.2.1 / 3):

Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

Using anything outside of this would be invalid. A compiler can accept, of course. And the majority accepted. But if you want a portable code that will work on any platform, you should restrict yourself.

Another problem is the file encoding. It may happen that two files from the same program are saved with different encodings (for whatever reason). Visually you will see the É character in both, but at the time of execution, it may be that the compiler / interpreter sees different identifiers there. In the end you will have a fairly difficult error to crawl, since the error message will not help.

A language that broadly supports writing code with non-ASCII characters is Ruby. The parser and other tools have been built with this in mind and there is no limiting set of allowed characters. This makes way for some interesting things, as the article demonstrates. Unicode Whitespace Shenigans for Rubyists by Peter Cooper:

Using a unicode symbol for space (same as   of HTML):

link

It is not seen as a space, it becomes part of the identifier. Lets you write something as confusing as this:

link

Since we have a fullness of space characters to use:

link

Using unicode in a codebase opens space for some very tricky bugs and crawling bugs. Another clear problem is trying to copy and paste the code into different tools. You never know what might happen.

Technical problems aside, there is always the question of language (the spoken one). If it's a large project, or it turns out to be opensource, it's always recommended to use English in the code, abolishing unicode usage.

In a small project with a team of few developers, there is plenty of room for rules to be defined and conventions of their own created. If there is an agreement between all, there is no reason not to. Remembering to always weigh the pros and cons of adopting this style.

One case that I have seen happen and which I consider to be somewhat valid is at the time of writing tests. In many frameworks you define a function / member / method that will be a block of asserts to be executed. When a fault, the name of this function is usually displayed on the screen as the name of the test that failed. Since this is a function you never explicitly call, using spaces-unicode in the name might be interesting. It will make the error output much more readable.

21.05.2014 / 16:32

How to make two WHERE clauses inside a nested WITH Join tables

score 22 · Accepted Answer

When it comes to using syntactic in general (and not just identifiers) that go beyond ASCII, there are a number of factors to consider:

The compiler needs to provide appropriate support for Unicode entries. This goes beyond simple encoding of the characters: you need to know if the support is limited to BMP a> or extends to SMPs , if it handles < in> surrogate pairs , if it works with combination or precompound-only characters , if it accepts escape characters in source code or not. There may be other considerations, that's just what comes to mind.

An example would be the way the word "tree" is represented in Unicode:

'\xe1rvore',    // Latin Small Letter A with acute,              r,v,o,r,e
'a\u0301rvore', // Latin Small Letter A, Combining Acute Accent, r,v,o,r,e

If a library was written in an editor that uses precompounds, and the code that attempts to use it was written to one that uses combos, the identifier may not be recognized.

Is the language case-sensitive or not? If the answer is no, there is the collation problem: Unless the computer where the code is being compiled has the same locale as the computer where the code was originally designed, it may occur that the same identifier is interpreted in different ways when normalizing the capitalization. Example :

"mail".toUpperCase(); // MAİL (Turco)
                      // MAIL (Resto do mundo)

Again, if a library has been compiled on a computer with the Turkish locale, and whoever is going to use it does not have that locale, the identifiers may not be recognized (when the compiler tries to normalize them).

How difficult is it to input Unicode characters? For us, who use Portuguese, entering with accented characters is easy - our own keyboard layout supports this. But if we were to use a library with identifiers in Japanese, for example, how would we do it? Likewise, other people may have difficulty typing accented letters, but everyone has at least good ASCII support.

Does this mean that using Unicode handles is always bad? No. It depends much more on the scope of the system being developed. As in the case of "write or not in Portuguese" , there are a number of factors that help determine whether it is acceptable whether or not the system has a more local scope - even though at first this would exclude the global audience (see my answer to the linked question for more details). It is useful to write programs in Portuguese, and it is useful that they be written in correct Portuguese. So, in the absence of problems to the contrary, I see no reason not to use characters other than ASCII.

Explaining: if the entire development team uses the same text editor or IDE, problem 1 practically does not exist (unless it is programmed in traditional Chinese); if all are in the same locale , problem 2 does not apply; and if everyone uses the same keyboard pattern, 3 does not put anyone at a disadvantage. That is, these factors are far less relevant to an in-house project than to one open to the public.

Addendum: Why did not I talk about the encoding problem, in the sense that a programmer edit in one and another programmer edit in another? Because this is a much more general problem - it even affects the comments in the code. The development team's need to always use the same encoding is global, so it is not an impediment to using Unicode identifiers if so desired.