Scan source code and find url from wikipedia

3

I'm having trouble with regex: /

I was using this pattern to get urls from wikipedia from the source code of google search wikipedia\.org[^\" ]+

But urls are encrusted in this way: <a href="/url?q=http://pt.wikipedia.org/wiki/ASP_World_Tour&amp;sa=U&amp;ei=yS6WVOvAA9HLsAShkYGoCw&amp;ved=0CBQQFjAB&amp;usg=AFQjCNFbV5WzVcG-aJbrvGdhbxz3wnPUKg" s it turns out to be: link

But this is not valid wikipedia, the correct one would be just http://pt.wikipedia.org/wiki/ASP_World_Tour

    
asked by anonymous 21.12.2014 / 03:28

2 answers

5

Since the URL does not contain the ? character that is the denotation of the start of the query string, a simple way is to make use of a regular expression that will remove everything after the first & :

$url = 'http://pt.wikipedia.org/wiki/ASP_World_Tour&sa=U&ei=yS6WVOvAA9HLsAShkYGoCw&ved=0CBQQFjAB&usg=AFQjCNFbV5WzVcG-aJbrvGdhbxz3wnPUKg';

$url = preg_replace('/\&.*/', '', $url);

See example on Ideone :

echo $url; // Saída: http://pt.wikipedia.org/wiki/ASP_World_Tour
    
21.12.2014 / 03:37
2

Apparently you want to get the Wikipedia link through a correct Google search?

Well, I just created and tested a solution for you, maybe not one of the best, but it works! : D

<?

    $TermoDeBusca = urlencode('ASP World Tour'); // Termo de Busca

    // Curl! 
    $ch = curl_init ("");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_URL, 'https://www.google.com.br/search?q='.$TermoDeBusca);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'); 
    $html = curl_exec($ch);

    // DOM!
    $dom = new DOMDocument;
    $dom->loadHTML($html);

    $xpath = new DOMXPath($dom); 
    $items = $xpath->query("//h3[contains(@class, 'r')]//a"); //Pega dentro do <H3> (de classe 'r') o valor do <a>



        foreach ($items as $pega){ // Loop, para cada link

            $link = $pega->getAttribute('href'); // Será: http://pt.wikipedia.org/wiki/ASP_World_Tour

                if (strpos($link,'wikipedia.org') == true) { // Verifica se o $link contem o 'wikipedia.org', ou seja, se é do wikipedia ~~ gambiarra
                echo $link.'<br>'; // se for, ele mostra o link
                } // fimse

        } //fim do foreach
?>

I tried to comment as much as I could, unfortunately I do not have time for this. I did what I could! : D

    
21.12.2014 / 06:27