XML returned by web service with encoding error

2
Hello, I have an XML that returns from a webservice with encoding errors, in the encoding of XML is as UTF-8 however it does not display accents correctly and I can not detect which correct encoding it should be . I have no information of how it is stored in the original database or any other of the type, I only have return of it. here is an example of how I get XML:

<?xml version="1.0" encoding="UTF-8"?>
<test>
<title>Muitos cientistas intrØpidos se aventura no coraĿªo dos dois vulcıes mais explosivos do planeta</title>
</test>

How can I convert / detect encoding and correct accent errors? Is there any way to convert this characters after generator by webservice or can this correction only be made in the webservice itself once the XML has already been generated?

Note: I already tried functions like iconv , utf8_decode , mb_convert_encode .

Thank you in advance.

    
asked by anonymous 16.02.2017 / 01:40

4 answers

1

There are a couple of typical mistakes that lead to these kinds of situations. In this case there was a conversion to utf8 from a latin1 but indicating that it was something else (in this case ISO6937).

Solution with Iconv :

iconv -f utf8 -t ISO6937 x.xml | iconv -f latin1 -t utf8

Explanation : But how do you get this miraculous "ISO6937"?: Reverse the process with all known encodings and see the ones that hit!

1: what are the existing encodings known to the iconv? - iconv -l

2: reverse the process for everyone (creating a file _encode ):

for a in 'iconv -l | cut -d/ -f 1 '
do   
   iconv -c -f utf8 -t $a x.xml | iconv -f latin1 -t utf8 > _$a
done

3: Look for the oss that produced the desired result (and choose one):

$ grep -l 'intrépido.*coração.*vulcões' _* 
_ANSI_X3.110
_ANSI_X3.110-1983
_CSA_T500
_CSISO103T618BIT
_CSISO90
_CSISO99NAPLPS
_ISO6937
_ISO_6937
_ISO_6937:1992
.... 

Some of these names are aliases of the same chaset

    
03.03.2017 / 11:58
6

In this case one solution would be to map the patterns of encoding returned, and replace them, for example:

<?php 
$arr = array("Ø" => "é","Ŀ" => "ç","ª" => "ã", "ı" => "õ"); 
$word = "Muitos cientistas intrØpidos se aventura no coraĿªo dos dois vulcıes mais explosivos do planeta"; 
echo strtr($word,$arr); 
?> 

See working at Ideone

    
20.02.2017 / 21:34
4

You are already receiving the data incorrectly and therefore it will be virtually impossible to detect the encoding that the information is arriving at.

My recommendation is that you check the routine that is generating the XML and make a correction in it.

Some important details: The XML standard defines the encoding of the characters in your header with <?xml version="1.0" encoding="UTF-8"?> . The routine that will read this data must respect this encoding and use it to read the data. If the encoding is different from the one entered in the file, the XML is invalid.

    
20.02.2017 / 13:34
2

When I had this problem, the file that generated my XML was saved with ANSI encoding ,IjustchangedthefileformattoUTF-8anditworked.

    
20.02.2017 / 14:07