I can not extract data from a site or .txt file

Question

I can not extract data from a site or .txt file

Navigation

#1 by (0 votes)

-2

Hello, I would like to extract some data from a website or even an example notebook

#EXTINF:-1 tvg-logo="https://i.imgur.com/rq9vXKI.jpg" group-title="FILMES",Mulher-Maravilha (2017)
http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/MLHRMRVLHA.mp4
#EXTINF:-1 tvg-logo="https://i.imgur.com/ftEGyMy.jpg" group-title="FILMES",Guardiões da Galáxia Vol. 2 (2017)
http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/GRDOESDGLXIAVL2.mp4

I would like to remove

"https://i.imgur.com/rq9vXKI.jpg", Mulher-Maravilha e http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/MLHRMRVLHA.mp4

"https://i.imgur.com/ftEGyMy.jpg", Guardiões da Galáxia e http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/GRDOESDGLXIAVL2.mp4

And send this to a separate DB and in order, already tried to use regular expression, but I can not, this would be a method to facilitate to send movies to the site, send a list and it already separates with your link to your image and your name, please someone help me !!! I do not care if I'm extracting from a file a site, just separate everything correctly and throw everything inside the db, thank you

php regex

asked by anonymous 27.12.2017 / 10:36

1 answer

http requests using the C language on windows? [closed] Move by divs using the arrow keys

score 0 · Answer 1

You can use regular expression ( regex ), an example that would work with your .txt:

#tvg-logo="(https?://[^\s]+)"(\s+|)group-title="\w+",([\s\S]+?)(https?://[^\s]+)#

The explanation of regex:

tvg-logo="(https?://[^\s]+)" will get the photo / image / thumb, the (\s+|) soon after (before the group) is to check spaces, having a separation or more by space or none
group-title="\w+", will catch anything like group-title="FILMES", or group-title="SERIES",
([\s\S]+?) will take everything that comes after the comma until you find the http link
https? will search for occurrences with http or https
(https?://[^\s]+) will get the entire link

So the php script would look like this:

$txt = file_get_contents('arquivo.txt');

$regex = '#tvg-logo="(https?://[^\s]+)"(\s+|)group-title="\w+",([\s\S]+?)(https?://[^\s]+)#';

if (preg_match_all($regex, $txt, $output)) {

    array_shift($output);

    $j = count($output[0]);

    echo '------------------', PHP_EOL;

    for ($i = 0; $i < $j; $i++) {

        $titulo = trim($output[2][$i]); //Pega o titulo

        $imagem = $output[0][$i]; //Pega a imagem

        $url = $output[3][$i]; //Pega a url

        echo 'Titulo: ', $titulo, '<br>';
        echo 'imagem: ', $imagem, '<br>';
        echo 'url: ', $url, '<hr>';
    }
}

An example test (online test: link ):

$txt = '
#EXTINF:-1 tvg-logo="https://i.imgur.com/rq9vXKI.jpg" group-title="FILMES",Mulher-Maravilha (2017)
http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/MLHRMRVLHA.mp4
#EXTINF:-1 tvg-logo="https://i.imgur.com/ftEGyMy.jpg" group-title="FILMES",Guardiões da Galáxia Vol. 2 (2017)
http://cdnv4.ec.cx/RedeCanais/RedeCanais/RCFServer1/ondemand/GRDOESDGLXIAVL2.mp4
';

$regex = '#tvg-logo="(https?://[^\s]+)"(\s+|)group-title="\w+",([\s\S]+?)(https?://[^\s]+)#';

if (preg_match_all($regex, $txt, $output)) {

    array_shift($output);

    $j = count($output[0]);

    echo '------------------', PHP_EOL;

    for ($i = 0; $i < $j; $i++) {

        $titulo = trim($output[2][$i]); //Pega o titulo

        $imagem = $output[0][$i]; //Pega a imagem

        $url = $output[3][$i]; //Pega a url

        echo 'Titulo: ', $titulo, PHP_EOL;
        echo 'imagem: ', $imagem, PHP_EOL;
        echo 'url: ', $url, PHP_EOL;
        echo '------------------', PHP_EOL;
    }
}