get original HTML entities with javascript

13

I need all the original HTML entities of a paragraph, especially the accents, the methods I know recover only some entities, as in the example below where ">" is correctly coded but "ç" is not.

It is important that the code can differentiate accents generated or not by entities (as in çã ) because the content comes from an external source and can come without a defined pattern

alert(document.querySelector('p').innerHTML);
<p>situa&ccedil;ão &gt; ativo</p>

Notes: As the accepted response of @mgibsonbr is not possible, the solution adopted was to use the DOMDocument::saveHTML , it interprets entities in the same way as the browser, so that the data is the same on both the server and the client.

    
asked by anonymous 20.08.2015 / 21:46

2 answers

9

Original HTML entities are not preserved when the markup of the document is interpreted ( parsed ) by the browser , so they are not available for you to view them via JavaScript or otherwise. According to the specification, during the tokenization step (reading the text "raw" and "token" production - or tokens - for subsequent analysis) HTML entities (here called Character Reference ) produce a single character when consumed :

  

8.2.4.69 Tokenizing character references

     

...

     

The behavior depends on the identity of the next character (the one immediately after the AMENDERSAND U + 0026), as follows:

     

...

     

"#" (U + 0023)

     

Consume the U + 0023 NUMBER SIGN.

     

...

     

Consume all characters that match the range of characters listed above (hexadecimal ASCII digits or ASCII digits).

     

...

     

Otherwise, if the next character is a SEMICOLON U + 003B, consume it as well. If it is not, it is a parsing error.

     

...

     

Otherwise, return a character token for the Unicode character whose code point is that number.

     

Anything else

     

Consume as many characters as possible as long as the characters consumed match one of the identifiers in the first column of the named character referencing (case-sensitive).

     

...

     

Return one or two character tokens for the character (s) corresponding to the character name in the reference (given by the second column of the

20.08.2015 / 22:44
0

Just complementing, since the content is obtained externally, you can request it through an ajax request

var external = document.getElementById("external");
var innerHTML = document.getElementById("innerHTML");
var responseText = document.getElementById("responseText");

var blob = new Blob(["<p>situa&ccedil;ão &gt; ativo</p>"], { type: "text/html" });
var url = URL.createObjectURL(blob);

var xmlHttp = new XMLHttpRequest();
xmlHttp.onreadystatechange=function()
{
    if (xmlHttp.readyState==4 && xmlHttp.status==200)
    {        
        external.innerHTML = xmlHttp.responseText;
        innerHTML.value = external.innerHTML;
        responseText.value = xmlHttp.responseText;
    }
}

xmlHttp.open("GET", url, true);
xmlHttp.send("");
div {
    margin-bottom: 5px;
}

label {
    display: inline-block;
    width: 100px;
    text-align: right;
}

input {
    width: 400px;
}
<div id="external">
    
</div>
<div>
    <label for="innerHTML">innerHTML:</label>
    <input id="innerHTML" type="text" readonly />
</div>
<div>
    <label for="responseText">responseText:</label>
    <input id="responseText" type="text" readonly />
</div>
    
21.08.2015 / 01:30