Crawler that detects changes on a page and saves screenshots

-2

How to make a crawler that allows you to list all ads placed on a site such as costjusto.pt (for example, in the furniture part link ) and that it tick the time that the classified appears as well as the hour (approximate to which it disappears) and that when detect change take screenshot?

    
asked by anonymous 12.02.2014 / 12:30

3 answers

3

phantomjs is perfect for this . It's not in python, but it's relatively trivial for tasks like these and just requires you to know javascript. One of the main advantages is that it has advanced features that no other crawler that does not interpret javascript nor has a complete engine could do

It uses an engine from a WebKit browser (equivalent to Google Chrome) and has a specific function to take screenshots. With this, you would have to make it access the page and if it is in ajax, just add an event that notices that something has been changed, and if the page does not load in ajax, it would have to access the page from times and times and compare with the previous page, and then repair the differences.

Here, an example of how to access a page and take a screenshot of it :

github.js file

var page = require('webpage').create();
page.open('http://github.com/', function() {
   page.render('github.png');
  phantom.exit();
});

Then run the command-line file with command

  

phantomjs github.js

    
12.02.2014 / 13:03
1

For the crawler you can use Scrapy .

    
06.03.2014 / 21:45
0

I've already used Ghost.py , it's a phantom fork of when they decided not to support python and how the name suggests and a lib to whoever is using python.

Internally it uses the qt webkit module, it may not be the fastest thing in the world but it runs js, opens iframes, downloads images and behaves like a browser (or at least tries), other than solutions such as mechanize or requests + beautifulsoup

It depends on PyQt or PySide, I had headache to install pyside but eventually it works fine.

I just stumbled upon a bug that is actually the qt webkit module, which from time to time was a problem in my entire process, I turned it over using the python multiprocessing module, if I finished the process by for whatever reason did not stop my entire program.

Homepage: link

Source code: link

    
12.02.2014 / 16:18