"Convert" an HTML list to MySQL with PHP

3

I have a large HTML list with about 106000 lines of code. Where these lines are records and these records are subdivided into:

  • 6 lines of information about a game (as name and year of publication)
  • 1 line break

So, each "record" in my HTML has 7 lines and therefore I have over 15,000 different records. The data is presented this way because it was removed from a list on the web and treated with PHP to stay as pleasant as possible.

The HTML data is as follows:

<h4>Jogo: Area 51</h4>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>
<!-- Quebra de linha, não é comentada assim no código original -->

So it's practically unfeasible to put all the data in my database manually and I need a way to at least convert to XML (or a way to already play everything straight into the DB ), so you can somehow get to the DB that is MySQL.

PS: In the middle of the process, it is important that the browser does not crash and stop responding so there is no addition of repeated data!

    
asked by anonymous 10.09.2014 / 07:45

2 answers

2

Quickly made a parser for you:

juegos.txt

<h4>Jogo: Area 51 : teste de : no nome do jogo</h4>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>

<h4>Jogo: Area 52</h4>
<li>Região: 3 - </li>
<li>Sistema: 9 - Sony PlayStation 2</li>
<li>Ano: 2004</li>
<li>Publicadoras: 1191 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1166 - Mesa Logic</li>

<h4>Jogo: Area 53</h4>
<li>Região: 4 - </li>
<li>Sistema: 10 - Sony PlayStation 3</li>
<li>Ano: 2005</li>
<li>Publicadoras: 1192 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1167 - Mesa Logic</li>

PHP:

$lines = file('jogos.txt');

$jogos = array();
foreach ($lines as $line_num => $line) {
    $line = strip_tags($line);
    if (substr($line, 0, 5) == 'Jogo:') $i++;
    list($key, $value) = preg_split('/:/', $line, 2);
    if ($value) $jogos[$i][$key] = $value;
}

print_r($jogos);

Return:

Array
(
    [1] => Array
        (
            [Jogo] => Area 51 : teste de : no nome do jogo
            [Região] => 2 - 
            [Sistema] => 8 - Sony PlayStation
            [Ano] => 2003
            [Publicadoras] => 1190 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1165 - Mesa Logic
        )
    [2] => Array
        (
            [Jogo] => Area 52
            [Região] => 3 - 
            [Sistema] => 9 - Sony PlayStation 2
            [Ano] => 2004
            [Publicadoras] => 1191 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1166 - Mesa Logic
        )
    [3] => Array
        (
            [Jogo] => Area 53
            [Região] => 4 - 
            [Sistema] => 10 - Sony PlayStation 3
            [Ano] => 2005
            [Publicadoras] => 1192 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1167 - Mesa Logic
        )
)

Now just use the array to generate the insert in the BD ...

Note: I ran a file with 11.5mb, 384257 lines and 54894 records and generated the array in less than 15 seconds.

Example with ID separation:

PHP:

$lines = file('jogos.txt');

$jogos = array();
foreach ($lines as $line_num => $line) {
    $line = strip_tags($line);
    if (substr($line, 0, 5) == 'Jogo:') ++$i;
    list($key, $value) = preg_split('/:/', $line, 2);
    $value = trim($value);
    if ($value) {
        if (strpos($value, '-') !== false) {
            if (strpos($value, ',') !== false) {
                $j = 0;
                foreach(explode(', ', $value) as $item) {
                    list($id, $str) = preg_split('/ -/', $item, 2);
                    $str = trim($str);
                    $jogos[$i][$key][$j]['id_'.$key] = $id;
                    if ($str) $jogos[$i][$key][$j][$key] = $str;
                    $j++;
                }
            } else {
                list($id, $str) = preg_split('/ -/', $value, 2);
                $str = trim($str);
                $jogos[$i]['id_'.$key] = $id;
                if ($str) $jogos[$i][$key] = $str;
            }
        } else {
            $jogos[$i][$key] = $value;
        }
    }
}

Return:

Array
(
    [1] => Array
        (
            [Jogo] => Area 51 : teste de : no nome do jogo
            [id_Região] => 2
            [id_Sistema] => 8
            [Sistema] => Sony PlayStation
            [Ano] => 2003
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1190
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1165
            [Desenvolvedora] => Mesa Logic
        )

    [2] => Array
        (
            [Jogo] => Area 52
            [id_Região] => 3
            [id_Sistema] => 9
            [Sistema] => Sony PlayStation 2
            [Ano] => 2004
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1191
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1166
            [Desenvolvedora] => Mesa Logic
        )

    [3] => Array
        (
            [Jogo] => Area 53
            [id_Região] => 4
            [id_Sistema] => 10
            [Sistema] => Sony PlayStation 3
            [Ano] => 2005
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1192
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1167
            [Desenvolvedora] => Mesa Logic
        )

)
    
10.09.2014 / 19:03
2

I would like to propose an alternative that aims to screw with a screwdriver and not with a hammer, that is, to syntactically analyze a hierarchical structure with a real parser .

This is one of the few situations where the excessive verbosity of DOM does not make it difficult to solve the problem. However, for this solution to work accordingly HTML must be semantically formulated. For this reason I'll be assuming an HTML that contains the < UL >:

<h4>Jogo: Area 51</h4>
<ul>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>
</ul>

The solution:

$dom = new DOMDocument;
$dom -> loadHTML( $html );

$data = array();

foreach( $dom -> getElementsByTagName( 'ul' ) as $node ) {

    if( $node -> hasChildNodes() ) {

        foreach( $node -> childNodes as $children ) {

            $nodeValue = trim( $children -> nodeValue );

            if( ! empty( $nodeValue ) ) {

                $structure = preg_split(

                    '/(.*?):\s+(.*?)/', $nodeValue, -1,

                    PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
                );

                $data[ spl_object_hash( $node ) ][ $structure [ 0 ] ] = $structure[ 1 ];
            }
        }
    }
}

Iterate all

    ), but that would require additional code and would personally make less logical sense.

    To avoid errors, notices and cia. we will check if there are child-nodes, even if we are seeing that they exist. For this, we use DOMNode :: hasChildNodes () and, if they exist, we will work with the value of property DOMNode :: childNodes

    From that point on, we are no longer analyzing syntactically but manipulating the text of nodes. Let's break every string, already cleaned with trim () , separating the possible labels from their value.

    It is not the scope of the topic to explain the ER, but as you can see it is quite simple.

    By adding in the array $ data we need a way to make each set of information unique. We could do a little gambiar with a manually incremented counter but since we have many objects in the game, I opted for spl_object_hash () that returns a sequence numerical value for each object at runtime, that is, each time you refresh the page, it will be other values.

    This is purely structural, when iterating this array to insert into the bank just ignore the value of the first key. Simple as that!

        
10.09.2014 / 20:46