How to identify if an XML is BOM?

4

I have the following problem regarding XML encoding:

Erro: Byte inválido 2 da sequência UTF-8 do byte 3.

This error occurs when trying to canonize an XML.

I do not know exactly what the error might be, I guess it's because the String has the BOM character, so, could anyone tell me if there is a function or library in Java to identify if the XML is BOM? Or some function that removes the BOM?

    
asked by anonymous 04.02.2014 / 11:58

2 answers

3

You can use the apache library BomInpustStream that she does this job for you, I had this problem, and I can tell you with confidence, that using this library is easier for you. A tip since I've also manipulated XML , you should get the contents with bytes vector, check with the suggested API, and then transform the String into charset UTF-8 , just so you will not lose the graphical accent.

Excerpt to transform source into inputStream

String source = FileUtil.takeOffBOM(IOUtils.toInputStream(attachment.getValue()));

Method to get BOM

public static String takeOffBOM(InputStream inputStream) throws IOException {
    BOMInputStream bomInputStream = new BOMInputStream(inputStream);
    return IOUtils.toString(bomInputStream, "UTF-8");
}
    
04.02.2014 / 14:18
1

I have adapted the class below the article found in the link: Removing BOM character from a String in Java

import java.io.UnsupportedEncodingException;

public class BOM {
private String bomString = "";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;

public void BOM(String text) throws UnsupportedEncodingException {
    this.bomString = text;
}

public String removeBOM() {
    final byte[] bytes = this.bomString.getBytes(ISO_ENCODING);
    if (isUTF8(bytes)) {
        return SkippedBomString(bytes);
    } else {
        return this.bomString;
    } 
}

private String getSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
    int length = bytes.length - UTF8_BOM_LENGTH;
    byte[] barray = new byte[length];
    System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
    return new String(barray, ISO_ENCODING);
}


private boolean isUTF8(byte[] bytes) {
    if ((bytes[0] & 0xFF) == 0xEF &&
        (bytes[1] & 0xFF) == 0xBB &&
        (bytes[2] & 0xFF) == 0xBF) {
        return true;
    }
    return false;
}

}

    
04.02.2014 / 12:42