file_get_content in specific part of sitemap content

0

Someone can help me:

I have the following code:

 <?php

 $url = file_get_contents('https://www.site.com.br/sitemap.xml');
 echo $url;

 ?>

I need the following:

The sitemap contains several urls with the following structure: www.site.com.br/numero/123/ (I need to get all the numbers between / numero / e and /

Links are listed together

Ex: www.site.com.br/numero/123/www.site.com.br/numero/124/www.site.com.br/numero/125/

I need to list it as follows:

123
124
125 
etc...
    
asked by anonymous 22.06.2017 / 01:20

3 answers

2

You can use with preg_match_all like this: / p>

<?php

$dados = file_get_contents('https://www.site.com.br/sitemap.xml');

if (preg_match_all('#www\.site\.com\.br/numero/([^/]+)/#', $dados, $matches)) {
    $matches = $matches[1];

    foreach ($matches as $value) {
        echo $value, '<br>', PHP_EOL;
    }
}

The #www\.site\.com\.br/numero/([^/]+)/# is the regex, the points have \ in front to escape, because the point matches any character (less line break), which is within ([^/]+) in the case of [^/] indicates that preg_match_all takes any character except / , in this way it will extract everything that comes after www.site.com.br/numero/ and before the next bar.

Example on IDEONE

XML

Now if you are using XML and this:

www.site.com.br/numero/123/www.site.com.br/numero/124/www.site.com.br/numero/125/

In fact it is the view of your browser that did not render the "XML", so the preg_match and nor substr will work, assuming your Xml (if it's even an xml) is more or less like this :

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.site.com.br/numero/123/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.site.com.br/numero/124/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.site.com.br/numero/125/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.site.com.br/numero/126/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

Then you can use DOM or simplexml_load_file (or simplexml_load_string ), in this case using < in> simplexml :

<?php

$urlset = simplexml_load_file('sitemap.xml');

foreach($urlset as $url) {
    if (preg_match('#www\.site\.com\.br/numero/([^/]+)/#', $url->loc, $match)) {
        $numeros[] = $match[1];
    }
}

foreach ($matches as $value) {
    echo $value, '<br>', PHP_EOL;
}

With $url->loc you get the value of the <loc> tag, if your XML may have a different format, just change ->loc by the tag name you use.

Example on IDEONE

    
22.06.2017 / 05:03
0

This will only work if sitemap content is really in the structure of your example. But if you are different you will have to adapt the code.

<?php
$sitemap = file_get_contents('https://www.site.com.br/sitemap.xml');
$lista = array();
$key = 0;
while (strpos($sitemap,'/numero/') > 0) {
    $sitemap = substr($sitemap,strpos($sitemap,'/numero/')+8);
    $lista[$key] = substr($sitemap,0,strpos($sitemap,'/'));
    $key++;
}
/* Aqui você já tem o Array $lista com a seguinte estrutura:
array(3) {
  [0]=>
  string(3) "123"
  [1]=>
  string(3) "124"
  [2]=>
  string(3) "125"
}
*/

//Percorrendo o Array para obter o valor de cada chave...
foreach($lista as $key => $value) {
    echo $value.'<br/>';
}
?>
    
22.06.2017 / 01:59
0

We can make use of a function whose purpose is to extract 3 characters after finding the position by which the extraction should begin.

function esquerda($str, $length) {
   return substr($str, 0, $length);
}

$url = file_get_contents('https://www.site.com.br/sitemap.xml');

while (strpos($url,'/numero/') > 0) {
    $url = substr($url,strpos($url,'/numero/')+8);
    echo esquerda($url, 3);
    echo "<br>";
}

DOCS:

22.06.2017 / 02:44