Read HTML file

2

In my project I need to read an HTML file that in the source code has a structure of an xml. I need to read this HTML file, get the value of the xml tags that have there make a whole process to save this data in my database ....

Read an xml, my system reads a good one, but I need my system to be able to read an HTML file as well.

How can I do this? I have no idea where to start.

Structure of my HTML file

<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><certidao>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
<subtag></subtag>
</certidao>
</body></html>

I need to read everything inside the root tag certidao and disregard HTML tags

The html page is saved on the computer and you do not need to access the link but rather the file path.

    
asked by anonymous 14.07.2015 / 21:27

1 answer

3

You can use the HtmlAgilityPack

PM> Install-Package HtmlAgilityPack

Follow a small example code

HtmlDocument doc = new HtmlDocument();
doc.Load("arquivo.html")
foreach (HtmlNode certidao in doc.DocumentNode.SelectNodes("//certidao"))
    foreach (HtmlNode subtag in certidao.SelectNodes("//subtag"))
        Console.WriteLine(subtag.InnerText);

You have an example with your data modified in DotNetFiddle

    
14.07.2015 / 22:07