Java String byte array with negative numbers

3

I'm having trouble discovering the encoding of a string.

The entry is:

São Paulo

The original reading of this content is not for me, because the text goes through a wrapper from Lua to Java.

On my side, I already tried the following "brute force" and I do not think the conversion is correct:

byte[] bytes1 = entrada.getBytes();
System.out.println(Arrays.toString(bytes1));
System.out.println(new String(bytes1));
System.out.println(new String(bytes1, StandardCharsets.UTF_8));
System.out.println(new String(bytes1, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes1, StandardCharsets.US_ASCII));

byte[] bytes2 = entrada.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(bytes2));
System.out.println(new String(bytes2));
System.out.println(new String(bytes2, StandardCharsets.UTF_8));
System.out.println(new String(bytes2, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes2, StandardCharsets.US_ASCII));

byte[] bytes3 = entrada.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(Arrays.toString(bytes3));
System.out.println(new String(bytes3));
System.out.println(new String(bytes3, StandardCharsets.UTF_8));
System.out.println(new String(bytes3, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes3, StandardCharsets.US_ASCII));

byte[] bytes4 = entrada.getBytes(StandardCharsets.US_ASCII);
System.out.println(Arrays.toString(bytes4));
System.out.println(new String(bytes4));
System.out.println(new String(bytes4, StandardCharsets.UTF_8));
System.out.println(new String(bytes4, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes4, StandardCharsets.US_ASCII));

And I have the following output, all wrong:

[83, -29, -81, -96, 80, 97, 117, 108, 111]
S㯠Paulo
S㯠Paulo
S㯠Paulo
S���Paulo

[83, -29, -81, -96, 80, 97, 117, 108, 111]
S㯠Paulo
S㯠Paulo
S㯠Paulo
S���Paulo

[83, 63, 80, 97, 117, 108, 111]
S?Paulo
S?Paulo
S?Paulo
S?Paulo

[83, 63, 80, 97, 117, 108, 111]
S?Paulo
S?Paulo
S?Paulo
S?Paulo

Can anyone help me? Thank you in advance.

    
asked by anonymous 28.01.2016 / 21:12

1 answer

2

If entrada is a String , you already have it decoded and it's no use trying to convert it.

It seems to me that what you're trying to do is convert the String into bytes and then the bytes in a String again. This does not work because since the original input was transformed into a String , usually the bytes are decoded and do not remain the same as they were originally.

When you do entrada.getBytes() , Java will actually use the default encoding defined by your system, so it has no difference whatsoever from other approaches.

Negative numbers are normal, since the primitive type byte in Java is a number that ranges from -128 to +127 . Nothing more normal than some characters being represented by values within the negative range.

The following code decodes the byte vector into all encodings that Java supports in a given environment:

byte[] b = new byte[] { 83, -29, -81, -96, 80, 97, 117, 108, 111 };
SortedMap<String, Charset> charsets = Charset.availableCharsets();
for (Map.Entry<String, Charset> entry : charsets.entrySet()) {
    System.out.printf("%s: %s%n", entry.getKey(), new String(b, entry.getValue()));
}

I tested this on a Mac and no coding was able to decode the o of são , which indicates that the bytes are already corrupted and the problem is not at some earlier point.

You should require from the "other side" a specification of which encoding is used and that the implementation follow what has been defined.

Another approach would be to directly receive the byte array from the input or some format that does not decode the bytes before it reaches your control.

    
29.01.2016 / 03:00