How to make a web crawler access pages that need authentication? [closed]

-1

I need to develop a web-crowler where he would go to a page (where login is necessary and I have such credentials) and the "robot" would find all links on the page and list somewhere, being a memo or even a txt file. It would be a similar process to the firefox plugin DownThemmAll. Site authentication is simple, done via https. But I also have the option to type captcha to access the page with the files.

    
asked by anonymous 24.03.2014 / 19:23

1 answer

2

I have some crawlers in PHP that access pages that require credentials. It depends on each case, since each has a form of authentication. In my case, I know the forms required. For example, access a site where their login page contains the following form:

<form class="onclick-submit card grid-3" accept-charset="utf-8" method="post" action="https://painel2.oculto.net/conectorPainel.php" id="frmLogin" >
    <input class="hidden" type="text" name="email" id="txtUserName" value="[email protected]" />
    <input class="hidden" type="password" name="senha" id="txtPassword" value="senha" />
    <input class="hidden" type="checkbox" name="permanecerlogado" tabindex="6" id="chkRemember" checked="checked" />
    <input class="hidden" type="hidden" value="login" name="acao" />
    ...
</form>

In this case, my PHP crawler authenticates the site before processing the content:

$curl = new cURL();
$curl->post('https://painel2.oculto.net/conectorPainel.php', '[email protected]&senha=senha&permanecerlogado=1&acao=login');

The site will create a session for my subsequent accesses and the program will have privileged access. I do not even check the site's response, since the chances of login failure are minimal and if it is denied by some other failure (such as connection failure, server down, etc.) the program will stop execution and try again later. p>

Most sites therefore require only 3 basic information:

  • login
  • password
  • URL

But this certainly does not work for everyone, since some sites create tokens for each session (eg icloud.com) or some algorithm that makes automation difficult. In these cases, it requires manual programming.

    
24.03.2014 / 20:24