Simple_html_dom what is the difference between the two URL's?

Question

Simple_html_dom what is the difference between the two URL's?

Navigation

#1 by (0 votes)

-1

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";
$URL2 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/nickoff-bingo/3357750";

Url2 works and can extract data, Url1 does not.

<?php 

include "simple_html_dom.php";
$CARDGALGO = file_get_html("$URLX");

echo $CARDGALGO;

?>

php html dom web-scraping

asked by anonymous 07.09.2018 / 16:54

1 answer

How to handle the null values of a vector in the conversion from String to Double? Rank the indexes of a list in Python 3

score 0 · Accepted Answer

I debugged the script and noticed that URL1 passes the MAX_FILE_SIZE limit, which is currently 600000, see simple_html_dom.php line 66:

 define('MAX_FILE_SIZE', 600000);

So you can increase this limit or you can stop using extra libs and use the native PHP API :

link

Example:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

To get a specific element you can use:

Paste by ID link
Get all elements of a type link

Get the text of a specific element by ID:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

echo 'Texto:', $doc->getElementById('logo')->textContent, '<br>';

This example takes this part of the current page:

<header id="header" role="banner">
    <div class="hix">
        <a href="greyhounds" id="logo">Ladbrokes</a>
                <div id="nav-mobile-open"></div>
            </div>            
</header>

To get all elements of a type, like all links, would look something like:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

foreach ($doc->getElementsByTagName('a') as $node) {
    echo 'Texto:', $node->textContent, '<br>';
}

Using DOMXpath

But of course the most practical way to get specific elements is to use XPath, as in this page the column "4" of each line in the table represents the name of the trainer so the XPath to be used would be something like:

//tr/td[4]

Example:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;
$doc->loadHTMLFile($URL1);

$xpath = new DOMXpath($doc);

$colunas = $xpath->query("//tr/td[4]");

echo 'Treinadores:<br>';

foreach ($colunas as $node) {
    $nome = trim($node->textContent);
    echo ' - ', $nome, '<br>';
}

Avoiding warnings / warnings because of HTML errors on a page

These links that you have added have many HTML errors, which can issue many warnings, so to prevent this from appearing you can simply link and delink internal API errors, like this:

<?php

$URL1 = "http://ladbrokes.365dm.com/greyhounds/profile/dog/oor-millie/3334094";

$doc = new DOMDocument;

$estadoOriginal = libxml_use_internal_errors(true);

$doc->loadHTMLFile($URL1);

libxml_clear_errors();

libxml_use_internal_errors($estadoOriginal);