Get TAG value from external HTML link

10

I need to get the value (or values, if I have more than one) of TAG <link> of an HTML from another site.

Attempt:

$url = 'http://localhost/teste/';
$content = trim(file_get_contents($url));
preg_match("/<link(.*?)>/i",$content,$return); 
var_dump($return);

Return:

array (size=2)
  0 => string '<link rel="shortcut icon" href="http://localhost/teste/icon.png">' (length=77)
  1 => string ' rel="shortcut icon" href="http://localhost/teste/icon.png"' (length=71)

I do not know if I made myself clear, but I would like to return the following:

array (size=1)
  0 => 
    array (size=2)
      'rel' => string 'shortcut icon' (length=13)
      'href' => string 'http://localhost/teste/icon.png' (length=31)
    
asked by anonymous 11.02.2014 / 20:28

4 answers

6

In fact, the regular expression (since working with HTML should be done with the DOM) more comprehensive and consequently more appropriate would be:

/<link.*?href="(.*?)".*?>/i

Let's see that:

  • Given the stack rankings as PHP and usage demonstration with preg_match () , the g modifier does not exist among those supported by PCRE Modifiers available .

  • According to the HTML and XHTML specifications, the < link > has no value, only attributes, differing mainly by the closing of the tag.

  • It should be noted that not always the href attribute you want will have its value in the same position, not even if you were writing the HTML. So the consideration of whether there is anything before and after the attribute.

As for usage, to capture all values, just use preg_match_all () .

[EDIT]

As pointed out by @Sergio, with the initial stack edition, the solution presented above no longer applies, however, the explanation contained here is of great value and therefore only permance. p>

I will be removing, however, what is superfluous. Content that may be available in revisions to this response (assuming it is a global resource).

I ask you to read with great attention and understand how things get more complicated when you try to screw using a hammer:

  • First we changed the Regular Expression to find all the attributes.

  • As PHP does not automatically capture "groups of groups", that is, you define something to be captured, and it captures as many instances of this pattern as possible, you must separate each key = value pair.

    With PHP it is done in many ways and a viable alternative would be to remove the spaces between the key pairs = value and use parse_str () . But as for that we would need an ER, since a str_replace () simple mess, for example, the rel , let's do it all for ER.

  • We have to iterate the array produced by preg_match_all () , this is inevitable, but since I am applying the same routine , about each element of the array, mapping your data to something else, I'd rather use array_map () :

  • preg_split () does her service, but even though she's delivering an array, this is not in the format you the attributes as an index. We can work around array_chunk () :

  • But array_chunk () produces N arrays within another we already had, which in turn is inside another . OMFG! I do not want to iterate all this! In that case, a sensational trick is to transpose the array, and for that, probably the best-voted practical answer I've ever seen come of this stack in the English OS.

  • When you transpose this array, it looks like this:

    array (size=2)
      0 => 
        array (size=2)
          0 => string 'rel' (length=3)
          1 => string 'href' (length=4)
      1 => 
        array (size=2)
          0 => string 'shortcut icon' (length=13)
          1 => string 'http://localhost/teste/icon1.png' (length=32)
    

    This is what a array_combine () can handle easily:

    The full code can be copied and viewed running through from that link .

        
    11.02.2014 / 20:44
    8

    Try to fetch the data inside an HTML navigating the DOM, not using regular expressions. It can happen that, hypothetically, there is a link within another link and because of that, its expression fails.

    There is a relatively old -but well-known- post about why you should not use regular expressions to interpret HTML. Basically, HTML is not a regular language and, by definition, could not be interpreted by a regular expression.

    link

    This, of course, if we are talking about a situation where you can navigate through the HTML DOM (as you are using PHP, it is valid).

    My solution then reads as follows:

    <?php
    $html = trim(file_get_contents('http://localhost/teste/'));
    $dom = new DOMDocument;
    $dom->loadXML($html);
    $links = $dom->getElementsByTagName('link');
    foreach ($links as $link) {
        print_r($link->getAttributes());
    }
    
        
    11.02.2014 / 20:42
    4

    You could use the class PHP Simple HTML DOM Parser has good documentation

        
    11.02.2014 / 20:47
    3

    I recommend using the PHP Simple HTML DOM Parser , it is great and very easy to use, I use in several scripts to analyze HTML from other sites.

    Very good Bruno Augusto's answer, I just want to complement his response and give some more details that I think are important to be observed and taken into account. When I need to parse HTML content and use regular expression for this, I try to make a more complete code since HTML is very irregular, attributes have no order defined, and may have codes with line breaks, I suggest using a regular expression plus " complete ", in your case I would use this regular expression:

    /<link.*?href=\"([^\"]*?)\".*?\/?>/si
    

    Basically the improvements are 2 substitutions:

    1 - from (.*?) to ([^\"]*?) because it is the right thing to do, since there are no " characters if the attribute delimiter is also " , same is the ' character.

    2 - from > to \/?> because there may or may not be the character / before the character < .

    3 - from /i to /si since there may be line breaks between attributes, values, etc ... not always the HTML tags in the sites are totally inline, there may be one piece in a row and another piece in the another line.

    If you use the original regular expression suggested by Bruno Augusto , it may not find certain LINK tag codes if they are broken by lines or if they have the / character (slash, which represents the closing tag), example:

    $string = <<<EOF
    <link
    rel="shortcut icon"
    href="http://localhost/teste/icon.png"
    >
    EOF;
    
    if ( preg_match_all( '/<link.*?href="(.*?)".*?>/i', $string, $matches, PREG_SET_ORDER ) ) {
        var_dump( $matches );
        die();
    } else {
        echo 'Nenhuma tag encontrada.';
        /* Esta parte será executada pois não serão encontrados tags, devido as quebras de linhas e adicionalmente também há a presença do caractere "/" (barra) do fechamento da tag LINK */
    }
    

    Now using the same sample code with the most complete regular expression suggested by me, the results will be successfully obtained:

    $string = <<<EOF
    <link
    rel="shortcut icon"
    href="http://localhost/teste/icon.png"
    >
    EOF;
    
    if ( preg_match_all( '/<link.*?href=\"([^\"]*?)\".*?\/?>/si', $string, $matches, PREG_SET_ORDER ) ) {
        /* Tags encontradas com sucesso */
        var_dump( $matches );
        die();
    }
    
        
    14.02.2014 / 09:51