How to analyze malformed HTML syntactically?

11

As part of a procedure, I need to extract the contents of a table present on a page. I'm using cURL to get the raw HTML data and the Simple HTML DOM Parser to parse and render HTML.

<?php

// (...)
require_once('simple_html_dom.php');
// (...)
$objPagina = str_get_html($strPagina);
$objItems =  $objPagina->find('table', 0);
echo $objItems->outertext;

?>

At first everything works as desired. However, in a specific case the received HTML is malformed. At this point the Simple HTML DOM Parser can not correctly render the HTML and returns an incorrect result.

The browser can display content properly, but as far as I know the browsers are designed to correctly render malformed HTML. In fact, if I open the "developer tools" of Firefox, copy the HTML displayed there, paste it as a text file and use this text as the source for the parser, I get the desired result.

Since I can not modify the HTML I receive, what can I do to programmatically process HTML? It looks like I should not use regular expressions .

    
asked by anonymous 29.01.2015 / 17:57

2 answers

4

You can try the tidy extension of php. With this extension it is possible to validate and purify malformed HTML.

An example (taken from php manual )

// Configuração
$config = array(
           'indent'         => true,
           'output-xhtml'   => true,
           'wrap'           => 200);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

// Output
echo $tidy;

Just note that the official extension website looks like the last update occurred in 2009, so this solution might not solve your problem.

    
29.01.2015 / 20:33
0

Try using xmllint directly.

1) install xmllint (free and small tool)

  

I need to extract the contents of a present table

2) Summon

xmllint --html --xpath '//table' 'http://my.remote.page/x.html' > tabelas.txt

(adapt the xpath expression to your needs) and if it gives results, inserts the invocation in Php

    
24.02.2015 / 19:45