As part of a procedure, I need to extract the contents of a table present on a page. I'm using cURL to get the raw HTML data and the Simple HTML DOM Parser to parse and render HTML.
<?php
// (...)
require_once('simple_html_dom.php');
// (...)
$objPagina = str_get_html($strPagina);
$objItems = $objPagina->find('table', 0);
echo $objItems->outertext;
?>
At first everything works as desired. However, in a specific case the received HTML is malformed. At this point the Simple HTML DOM Parser can not correctly render the HTML and returns an incorrect result.
The browser can display content properly, but as far as I know the browsers are designed to correctly render malformed HTML. In fact, if I open the "developer tools" of Firefox, copy the HTML displayed there, paste it as a text file and use this text as the source for the parser, I get the desired result.
Since I can not modify the HTML I receive, what can I do to programmatically process HTML? It looks like I should not use regular expressions .