Find a snippet of an HTML

3

I get a string , and it contains an HTML.

Here is a table, and its columns:

<td width="24%" valign="top" border="1" style=" 
        BORDER-RIGHT: windowtext 0.5pt solid; 
        BORDER-TOP: windowtext 0.5pt solid; 
        BORDER-LEFT: windowtext 0.5pt solid; 
        BORDER-BOTTOM: windowtext 0.5pt solid; 
        PADDING-LEFT: 3.5pt; 
        ">

        <font face="Arial" style="font-size: 6pt">
        NÚMERO DE INSCRIÇÃO
        </font>
        <br>

        <font face="Arial" style="font-size: 8pt">

        <b>00.000.000</b><br>

        <b>MATRIZ</b>
        </font>
        <br> 
    </td>

What better way to capture just that code: '00 .000,000 '?

PS: That recipe CNPJ data table.

    
asked by anonymous 10.11.2014 / 15:31

2 answers

5

The best I do not know, but it's common for people to use some external library like HTMLAgilityPack to do the parser and deliver everything separated in us with reliability, it is easy to search the elements. It seems to be the most commonly used library for this kind of task among .Net programmers.

I have seen several other options but I do not like any. I'm not a big fan of this, but it's better than nothing.

Any attempt to reinvent the wheel can produce some result but it gives work and is unlikely to be reliable and mostly proof of the future. Other than this it will be complicated, laborious and unreliable. I'm also not saying that these libraries are fail-safe, but that's a good thing.

Anyway this HTML code is quite complicated to interpret. If it is yours would be better to modernize it, do not use HTML this way anymore. If you do not have control over it understand that the code can change and any algorithm created may become invalid and bring unexpected results. Even using a good library to make the parser without a pattern, without a way to unambiguously identify the element becomes very risky.

    
10.11.2014 / 15:49
2

Editing to include a disclaimer: Obviously at some point in your process a thing is done called scraping on the recipe page. As Maniero spoke in his reply and comment, this is not very reliable. My (incomplete) solution below looks for a CPF or CNPJ in any text, which may or may not contain HTML together. It is only because of this consideration that I responded as follows. In general, who does parsing of HTML or does not know what he is doing, or is desperate #ProntoFalei .

If all you want is to extract a CNPJ, a regular expression can work. Just note that the expression will help because you will not treat HTML, but rather just extract a number from the text.

The expression you are looking for is something like:

[0-9]+\.[0-9]+\.[0-9]\[0-9]+-[0-9]+

And to those who understand REGEX: yes, I know that my expression is somewhat lazy. I give a positive vote to everyone who post a response with a more accurate expression.

Explanation:

  • Each block [0-9] means "a numeric character here";
  • The + means that the left character of + must occur at least once, but may occur multiple times. A more correct and efficient way to capture a CPF or CNPJ would be to repeat the numeric block, type [0-9][0-9][0-9] . I leave it to you to do this;
  • The backslash serves to escape certain characters that have special meanings, so that their literal values will be used (in this case, . and the slash itself).

Note that because there are backslashes in the expression, you should also escape them when you put this in a string - or put an arroba in front of the string . You can use code similar to the one below:

string input; // isso deve conter o seu texto de entrada
Regex foo = new Regex(@"[0-9]+\.[0-9]+\.[0-9]\[0-9]+-[0-9]+");
Match m = foo.Match(input);

if (m.Success) {
    string resultado = m.Groups[0]; // Suponho um único CNPJ por entrada.
}

Good luck!

    
10.11.2014 / 15:49