HtmlAgilityPack - does not format correctly

0

In my webservice, I always need to format the HTML that I receive. To make sure it gets correctly formatted I use the HtmlAgilityPack .

HTML I get:

<p>
    <div>
        <b>text:</b> 
        <img alt="" height="362" src="/PublishingImages/imageName.png?RenditionID="16&Width=639&Height=362" width="639" style="BORDER: 0px solid; ">
    </div>
    <div>             <!---assim aberto é que está bem-->
        <b>text:</b> 
        <div style="text-align:justify;"></div>
        <div style="text-align:justify;"></div>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                <br>
                text
            </span>
        </p>
        <p style="text-align:justify;">
            <br class="ms-rteThemeForeColor-2-0">
            <span class="ms-rteThemeForeColor-2-0">
                text
                <br>
            </span>
        </p>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                <br>
            </span>
        </p>
        <p style="text-align:justify;">
            <span class="ms-rteThemeForeColor-2-0">
                text
            </span>
            <br>
        </p>
    </div>
</p>

My code to format HTML:

if (!HtmlNode.ElementsFlags.ContainsKey("p"))
    HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("span"))
    HtmlNode.ElementsFlags.Add("span", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["span"] = HtmlElementFlag.Closed;

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;

var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.LoadHtml(myHtml);

foreach (var eachNode in htmlDoc.DocumentNode.SelectNodes("//*"))
{
    var count = 0;
    foreach (var attr in eachNode.Attributes)
        if (attr.Name.ToLower() != "href" && attr.Name.ToLower() != "src" && attr.Name.ToLower() != "alt" && attr.Name.ToLower() != "style")
        {
            attr.Name = "feeds" + count.ToString();
            attr.Value = "";
            count++;
        }
}

var htmlError = htmlDoc.ParseErrors.SafeAny();

if (!htmlError)
    myHtml = htmlDoc.DocumentNode.InnerHtml;

However, HtmlAgilityPack is deformatting the HTML a bit compared to the initial HTML.

HTML after formatted by HtmlAgilityPack :

<p>
   <div>
      <b>text:</b>
      <img alt="" feeds0="" src="/PublishingImages/imageName.png?RenditionID=" feeds1="" feeds2="" style="BORDER: 0px solid; " />
   </div>
   <div />             <!---não devia estar fechado-->
   <b>text:</b>
   <div style="text-align:justify;" />
   <div style="text-align:justify;">          <!---não devia estar aberto-->
      <p style="text-align:justify;">
         <span feeds0="">
            <br />
            text
         </span>
      </p>
      <p style="text-align:justify;">
         <br />
         <span feeds0="">
            text
            <br />
         </span>
      </p>
      <p style="text-align:justify;">
         <span feeds0="">
            <br />
         </span>
      </p>
      <p style="text-align:justify;">
         <span feeds0="">
            text
         </span>
         <br />
      </p>
   </div>
</p>

Why does this happen and how can I resolve this? I have already found that if you comment on the following code the HTML is correctly formatted:

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;

But why? It's because I can not uncomment this code because if HTML has a div badly closed, I'll have problems later and so everything will have to be closed.

NOTE: Where <div /> <!---não devia estar fechado--> should only be <div> as in the first HTML I receive.

    
asked by anonymous 12.01.2017 / 20:27

1 answer

0

Solved! I have found that you can tell HtmlNode.ElementsFlags to be Closed and CanOverlap at the same time, like this:

if (!HtmlNode.ElementsFlags.ContainsKey("div"))
    HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.CanOverlap & HtmlElementFlag.Closed);
else
    HtmlNode.ElementsFlags["div"] = HtmlElementFlag.CanOverlap & HtmlElementFlag.Closed;
    
13.01.2017 / 11:27