How to get the HTML code of a page protected with Cloudflare?

-2

I'm trying to get the HTML of a page with Jsoup .

This page has Cloudflare as a protection, and instead of getting the HTML code of the site I'm interested in, you're returning the HTML to the Cloudflare page ( see image below ) that appears before to redirect to the target site. I need to get the HTML of the site to which the Cloudflare will redirect after that page.


Cloudflarepageexample(notthesiteI'mlookingfor,butit'sforexample).

Mycodelookslikethis:

importjava.io.IOException;importorg.jsoup.Jsoup;importorg.jsoup.nodes.Document;publicclassMain{publicstaticvoidmain(String...args)throwsIOException{Documentdocument=Jsoup.connect("http://site.com")
                                 .userAgent("Mozilla/5.0")
                                 .timeout(10000)
                                 .get();

        System.out.println(document.html());
    }
}

The output looks something like this:

<html>
 <head>
  <title>You are being redirected...</title> 
  <script> <!-- código JS enorme --> </script>
 </head>
 <body></body>
</html>

I thought of setting setRedirects to true , but reading the documentation I saw that this is the default value. I found this question with the same title in StackOverflow but the problem there is another one.

I also tried two requests, the second using the first cookies and gave the same, I fall on the same page:

import java.io.IOException;
import java.util.Map;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String...args) throws IOException{

        final String URL = "http://site.com/";

        // Executando a primeira requisição.
        Connection.Response response =
            Jsoup.connect(URL)
                 .timeout(10000)
                 .method(Connection.Method.GET)
                 .execute();

        // Pegando os cookies da resposta    
        Map<String, String> cookies = response.cookies();

        Document doc = Jsoup.connect(URL)
                            .cookies(cookies) // Usando os cookies na 2ª chamada
                            .get();

        System.out.println(doc.html()); // Fail! Cloudflare me bloqueia.                                        
    }
}

I accept a response that does not use Jsoup as well, as long as it solves this problem. I do not need anything complex, just that the return containing the HTML is a String .

    
asked by anonymous 18.03.2016 / 00:56

2 answers

1

I ended up abandoning Jsoup and used a webdriver. I chose the HtmlUnit for this and the code that solves the problem I was encountering is this:

import java.io.IOException;
import com.gargoylesoftware.htmlunit.*;

public class Main {
    public static void main(String...args) throws IOException {

        final String URL = "http://site.com/o/clouflare/bloqueando";

        Page page = new WebClient(BrowserVersion.BEST_SUPPORTED).getPage(URL);
        System.out.println(page.getWebResponse().getContentAsString()); // Feito!
    }
}

A note about HtmlUnit: it printa all validation errors in properties found in the document (HTML, CSS and Javascript) by means of Logger . To disable this, I've followed this answer and include a line in my code:

Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
    
18.03.2016 / 05:14
1

Looking at the cloudflare html, in the link page, I came to the following conclusion: The javascript that you mentioned in the question is an algorithm that uses some page data to perform a calculation and send the result of that calculation for validation. If the calculation is correct, you are redirected to the actual page, otherwise you loop around in the cloudflare. The calculation is done using jjencode, represented by something like:

+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]))

As this jjencode code changes with each request, it is impossible for you to pass through the couldflare without ever deciphering it. I believe it is possible for you to go through it, but it is not trivial. If you are still interested in doing this bypass in the cloudflare, here are some interesting links about jjencode:

Encode tool: link

Explain how jjencode works: link

    
18.03.2016 / 04:44