Static string being created with wrong encode

5

Hello,

When creating a string in a Java class (for example: String t = "Ola Java!" ), it seems the compiler is choosing the 'wrong' encode to interpret the bytes that are in the source and generate the String (the 'right' encode should be UTF-8 , which is the encode I'm using in the sources).

To exemplify the error, I did the following test:

String t = "ã";
log.debug("t: " + t);
log.debug("t.length(): " + t.length());
log.debug("t.getBytes().length: " + t.getBytes().length);
log.debug("t.getBytes(utf-8).length: " + t.getBytes("utf-8").length);
log.debug("t.getBytes(UTF-8).length: " + t.getBytes("UTF-8").length);
log.debug("t.getBytes(ISO-8859-1).length: " + t.getBytes("ISO-8859-1").length);

(The logging engine I use is commons-logging with log4j support, but to do the same using System.out )

The result was as follows:

t: ã
t.length(): 2
t.getBytes().length: 4
t.getBytes(utf-8).length: 4
t.getBytes(UTF-8).length: 4
t.getBytes(ISO-8859-1).length: 2

The first line could be explained by some conversion problem when converting the string at the time of writing the log file. But the other lines make the problem clear. Now in the second line ( t.length() ) to see that the String was created with two characters, not one, already showing that in the creation of the string the two bytes that represent the character ã in utf-8 were treated as two characters ( in some other format type ISO-8859-1).

I'm looking for some way to force the encode in the interpretation of a static string by the compiler, but I do not think it's a good way ... is there any way to do this? Or to tell the compiler which encode should be used when interpreting the static strings in the sources ?

    
asked by anonymous 30.03.2014 / 19:10

2 answers

3

From the javac documentation:

  

-encoding encoding

     

Set the source file encoding name, such as EUC-JP and UTF-8. If    -encoding is not specified, the platform default converter is used.

That is, if it is not explicitly specified, the compiler will use the system default. If you think about it, that makes some sense: this is the charset that publishers use by default, and is the charset that Java applications use by default - and javac is a Java application.

Of course, specifying -enconding on the command line resolves. If you are using some build system (ant, maven, gradle, etc.), specify this option to ensure that the files will be treated the same on any platform.

If you are not using a build system (should! :), you can use the JAVA_TOOL_OPTIONS environment variable, putting something like -Dfile.encoding=UTF8 in it.

Finally, there is a way to transform your files, regardless of their encoding, into ASCII. Along with the JDK comes a program called native2ascii . This program will convert the system's encoding file, or an encoding that you specify with -encoding , into an ASCII file using the \uxxxx syntax to represent any special characters. For example:

Daniels-MacBook-Pro:debug-service dsobral$ cat Teste.java 
class Teste {
    public String test = "Teste de codificação"
}
Daniels-MacBook-Pro:debug-service dsobral$ native2ascii Teste.java
class Teste {
    public String test = "Teste de codifica\u00e7\u00e3o"
}

In case, as I did not specify the output file, it played to the console. I've never actually used this program (I use build systems :), but in simple tests it seems to accept that the output file is the same input, but I would experiment with large files before relying on it. >     

31.03.2014 / 19:19
3

It was really a problem compiling the fonts.

Javac was considering that all source files were in the same encode, unlike utf-8. Maybe by default javac will use the default OS encode.

To solve the problem, I used javac's -encoding option, which allows you to define which encode should be considered when reading fonts (the same option exists in ant's javac task).

    
31.03.2014 / 01:47