Check if you have the string and include your tags

2

I would like to take the content of the site, remove only the text and insert my tags, but in this code I did, when he finds the text "Art" it does not leave the if, and then only the first ones are tagged li, the rest are all tagged ul.

Someone could help me


    # Use the Curl extension to query Google and get back a page of results
    $url = "www.planalto.gov.br/ccivil_03/constituicao/constituicaocompilado.htm";
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $html = curl_exec($ch);
    curl_close($ch);

    # Create a DOM parser object
    $dom = new DOMDocument();

    # Parse the HTML from Google.
    # The @ before the method call suppresses any warnings that
    # loadHTML might throw because of invalid HTML in the page.
    @$dom->loadHTML($html);




    # Iterate over all the  tags
    foreach($dom->getElementsByTagName('font') as $link) {

    $mystring = $link->nodeValue;
    $findme   = 'Art';
    $pos = strpos($mystring, $findme);

        if ($pos === false) {

            echo "li";
            echo $link->nodeValue;
            echo "/li";

        } else { 

            echo "/ul";
            echo "ul id='' class='artigo'";
            echo "li";
            echo $link->nodeValue;
            echo "/li";

        }
    }

So the end result is like this

    _ul id="titulo1" class="titulo">
        _h3>TÍTULO I_/h3>
        _p>Dos Princípios Fundamentais_/p>
    _/ul>
    _ul id="titulo1_artigo1" class="artigo">
        _li>
            _ul class="caput">
                _li>
                    Art. 1º A República ... tem como fundamentos:
                _/li>
            _/ul>
        _/li>
        _li>
            _ul class="incisos">
                 _li> I - a soberania;_/li>
                 _li> II - a cidadania_/li>
                 _li> III - o pluralismo político._/li>
            _/ul>
        _/li>
        _li>
            _ul class="paragrafos">
                _li>Parágrafo único. Todo o ... desta Constituição.
                _/li>
            _/ul>
        _/li>

    _/ul>
    _ul id="titulo1_artigo2" class="artigo">
        _li>
            _ul class="caput">
                _li>
                    Art. 2º São Poderes da União, independentes e harmônicos entre si, o Legislativo, o Executivo e o Judiciário.
                _/li>
            _/ul>
        _/li>   
    _/ul>
    
asked by anonymous 18.04.2015 / 21:48

1 answer

1

Testing your code better, I noticed that the text repeats a few times, that is because the use of getElementsByTagName that takes the parent element and the child element and the loop presents both with nodeValue , the texts always will repeat. I thought about using XPath, but the whole problem occurs because this specific HTML document does not have split by block for each content, it simply works with breaking lines.

It may be possible to use XPath or something similar, but apparently it is very laborious.

So thinking about the line breaks, I thought the following, instead of reading as DOM, you can read it as text, line by line and detect where the article starts and ends.

To read row by line I recommend using tmpfile() , foef() and fgets . The tmpfile() will serve to store the page you are downloading.

//Gravar a página em um arquivo temporário
$handle = tmpfile();
fwrite($handle, $html);
fseek($handle, 0);

$html = NULL;

$initiate = false;
$inTitle = false;

//Função usada para remover elementos das linhas desnecessários
function removeTags($data) {
    $data = trim($data);
    $data = preg_replace('/[<][^>]+[>]|[<][^<>]+$|^[^<>]+[>]/', '', $data);
    return trim($data);
}

//No while iremos verificar linha a linha
while (false === feof($handle)) {
    $buffer = fgets($handle);//Lê a linha

    //Se a linha é vazio ignora e vai para a proxima linha
    if (trim($buffer) === '') {
        continue;
    }

    //Detecta aonde começa o artigo
    $findme = strpos($buffer, '>Art.') !== false;

    //Detecta um "possivel" termino do artigo ou titulo
    $endLine = stripos($buffer, '</p>') !== false;

    if ($findme) {

        //Se for já tiver ao menos um artigo adicionado ao corpo então isto detecta que terminou de listar os itens do artigo anterior
        if ($initiate) {
            echo '<hr>', PHP_EOL, PHP_EOL;
        }

        //Informa que encontrou ao menos um artigo
        $initiate = true;

        //Informa que estamos no titulo do artigo
        $inTitle = true;
        echo '<h1>', removeTags($buffer);
    } else if ($inTitle && $endLine) {
        //Se estiver no titulo e detectou um possivel fechamento do titulo
        $inTitle = false;
        echo removeTags($buffer), '</h1>', PHP_EOL;
    } else if ($initiate) {
        //Se não estiver dentro de um titulo ele imprime os dados
        $data = removeTags($buffer);

        //Se a linha for vazia então pula para a proxima linha
        if ($data === '') {
            continue;
        }

        echo $data, $inTitle ? '' : ('<br>' . PHP_EOL);
    }
}

//Fecha o arquivo temporario
fclose($temp);

Note that you can change tmpfile to fopen and save the formatted HTML so you do not need to redo the search.

This code is just an example, so I did not do everything that is needed, it still needs some more details, but the process is the same, you just work using the variables to detect where the article for example.

    
19.04.2015 / 18:04