In my webservice, I always need to format the HTML that I receive. To make sure it gets correctly formatted I use the HtmlAgilityPack .
HTML I get:
<p>
<div>
<b>text:</b>
<img alt="" height="362" src="/PublishingImages/imageName.png?RenditionID="16&Width=639&Height=362" width="639" style="BORDER: 0px solid; ">
</div>
<div> <!---assim aberto é que está bem-->
<b>text:</b>
<div style="text-align:justify;"></div>
<div style="text-align:justify;"></div>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
<br>
text
</span>
</p>
<p style="text-align:justify;">
<br class="ms-rteThemeForeColor-2-0">
<span class="ms-rteThemeForeColor-2-0">
text
<br>
</span>
</p>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
<br>
</span>
</p>
<p style="text-align:justify;">
<span class="ms-rteThemeForeColor-2-0">
text
</span>
<br>
</p>
</div>
</p>
My code to format HTML:
if (!HtmlNode.ElementsFlags.ContainsKey("p"))
HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("span"))
HtmlNode.ElementsFlags.Add("span", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["span"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("div"))
HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;
var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.LoadHtml(myHtml);
foreach (var eachNode in htmlDoc.DocumentNode.SelectNodes("//*"))
{
var count = 0;
foreach (var attr in eachNode.Attributes)
if (attr.Name.ToLower() != "href" && attr.Name.ToLower() != "src" && attr.Name.ToLower() != "alt" && attr.Name.ToLower() != "style")
{
attr.Name = "feeds" + count.ToString();
attr.Value = "";
count++;
}
}
var htmlError = htmlDoc.ParseErrors.SafeAny();
if (!htmlError)
myHtml = htmlDoc.DocumentNode.InnerHtml;
However, HtmlAgilityPack is deformatting the HTML a bit compared to the initial HTML.
HTML after formatted by HtmlAgilityPack :
<p>
<div>
<b>text:</b>
<img alt="" feeds0="" src="/PublishingImages/imageName.png?RenditionID=" feeds1="" feeds2="" style="BORDER: 0px solid; " />
</div>
<div /> <!---não devia estar fechado-->
<b>text:</b>
<div style="text-align:justify;" />
<div style="text-align:justify;"> <!---não devia estar aberto-->
<p style="text-align:justify;">
<span feeds0="">
<br />
text
</span>
</p>
<p style="text-align:justify;">
<br />
<span feeds0="">
text
<br />
</span>
</p>
<p style="text-align:justify;">
<span feeds0="">
<br />
</span>
</p>
<p style="text-align:justify;">
<span feeds0="">
text
</span>
<br />
</p>
</div>
</p>
Why does this happen and how can I resolve this? I have already found that if you comment on the following code the HTML is correctly formatted:
if (!HtmlNode.ElementsFlags.ContainsKey("div"))
HtmlNode.ElementsFlags.Add("div", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["div"] = HtmlElementFlag.Closed;
But why? It's because I can not uncomment this code because if HTML has a div
badly closed, I'll have problems later and so everything will have to be closed.
NOTE: Where <div /> <!---não devia estar fechado-->
should only be <div>
as in the first HTML I receive.