Robots to search for and interpret information on other pages are also called web crawlers or spiders.
These are scripts that perform the following process:
Request for a URL.
Store the returned return in a variable.
Interpret the return, that is, perform the HTML parser.
Search for relevant information.
Perform the processes with the information obtained.
The process in steps 1 through 3 is easily solved as follows:
$url = 'www.exemplo.com';
$dom = new DOMDocument('1.0');
$dom->loadHTMLFile($url);
In this way you will get an object that will allow you to navigate through HTML as needed.
For example, to get all the links on a page and display the addresses would look like this:
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
echo $href . '<br>';
}
An interesting class that can aid in handling HTML and avoiding thousands of lines of code is the Simple HTML DOM , and a tutorial teaching how to use it can be found on Make Use Of .
In order to fill a form, it is enough to make a request for the URL that the form points to using the expected request method, that is, to request the URL present in the action
attribute using the request method present in the method
.
To simulate the situation we will change the previous requisition code to:
$curl = curl_init();
// Set some options - we are passing in a useragent too here
curl_setopt_array($curl, array(
// Retorna o conteúdo como string
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => 'http://www.exemplo.com',
// Nome de identificação do seu robô
CURLOPT_USERAGENT => 'Nome do seu crawler',
// Indica que a requisição utiliza o método POST
CURLOPT_POST => 1,
// Parâmetros que serão passados via POST
CURLOPT_POSTFIELDS => array(
item1 => 'value',
item2 => 'value2'
)
));
// Fazendo a requisiçnao e salvando na variavel $response
$response = curl_exec($curl);
// Finalizando o objeto de requisição
curl_close($curl);
$dom = new DOMDocument('1.0');
// Realiza o parser da String de retorno da requisição
// Observe que o método mudou de loadHTMLFile para loadHTML
$dom->loadHTML($response);
Learn more about CURL