Get external links with cURL PHP

0

I have this code that returns me the Google page.

<?php

$request = curl_init();
curl_setopt_array($request, [
    CURLOPT_URL             => 'https://www.google.com.br',
    CURLOPT_RETURNTRANSFER  => true,
    CURLOPT_SSL_VERIFYPEER  => false,
]);
$response = curl_exec($request);
curl_close($request);

echo $response;

However, it does not bring me the external links as an image among others. Notice that instead of it bring me google.com / ... it bears the name of my vHOST viperfollowdev.com , see the image below to understand.

Is there any way to fix this?

My second example was:

<?php

$request = curl_init();

curl_setopt_array($request, array(
    CURLOPT_URL                         => 'https://www.instagram.com',
    CURLOPT_RETURNTRANSFER  => true,
    CURLOPT_FOLLOWLOCATION  => true,
    CURLOPT_SSL_VERIFYPEER  => false,
));

$response = curl_exec($request);
curl_close($request);

$response = str_replace('/static/bundles/', 'https://www.instagram.com/static/bundles/', $response);
$response = str_replace('/static/images/', 'https://www.instagram.com/static/images/', $response);
$response = str_replace('/data/manifest.json', 'https://www.instagram.com/data/manifest.json', $response);

echo $response;

It's catching but it does not show yet on my page. I spent the whole url but it does not work.

    
asked by anonymous 22.01.2018 / 09:25

2 answers

2

About the error

This is because Google does not use the full link of the files in the src , srcset , and so on attributes. Instead, it uses only path of the file: Ex: /path/to/image.png

With this the browser will always look for these images in the site accessed, in your case, http://viperfollowdev.com .

Solutions

To correct this, simply add the code below when printing the variable $response .

echo '<base href="https://www.google.com.br/" />';

But this solution will not work in all cases. When you have already set the base url (as in the code above) in your html , the browser will ignore the new "base url".

In this case, you only have to use regex to solve your case (or at least part of it) .

Regex

(src=|href=|srcset=|url)('|"|\()(\/.*?)('|"|\))

The% of% above will capture all values of the attributes regex , src , srcset and href . The latter for url .

Now just use the preg_replace function to replace the values .

Example:

<!DOCTYPE hml>
<html>
    <head>
        <title>Title of the document</title>
        <base href="https://www.bing.com.br/" />
    </head>

    <body>

        <?php

            $url = "https://www.google.com.br";

            $request = curl_init($url);

            curl_setopt_array($request, [
                CURLOPT_RETURNTRANSFER  => true,
                CURLOPT_AUTOREFERER  => true,
                CURLOPT_SSL_VERIFYPEER  => false,
            ]);
            $response = curl_exec($request);
            curl_close($request);

            echo preg_replace("/(src=|href=|srcset=|url)('|\"|\()(\/.*?)('|\"|\))/", "$1$2{$url}$3$4", $response);

        ?>

        <script src="/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
    </body>
</html>
  

css may vary from site to site. Depending on what you want to change, it is necessary to customize regex and make it more complete, but the principle is the same.

    
22.01.2018 / 10:58
1

In the first case what happens is that you got the html code from the request and the features are declared for virtual paths that do not exist in your application. Something you may have already noticed by doing the replace on your second try. But there are other validations and procedures that run on the host and prevent you from submitting their content. (Sessions, Tokens, Headers, etc.)

I advise that if you want to consume or display the content of other sites and services, it should be restricted to your APIs, rules and terms of use.

Take a look at Google API and Instagram API , maybe what you're looking to do is even supported by both platforms.

    
22.01.2018 / 10:54