PHP Crawlers for external sites API PHPcrawl

1

Good evening person I am new to the subject, I am trying to build a search engine for external sites (indexer) with PHP, I found an API, which makes a Crawler available, but it seems to only search for things inside only a specific site, the API name is PHPCrawl, I would like someone who has knowledge in this tool, could tell me if it is possible to search other external sites, not just tags within one. link < - this is the API Thank you in advance.

    
asked by anonymous 09.12.2015 / 15:14

1 answer

1

But this is basically what the crawler should do, it will be up to you to use a database with the list of sites you want to scan and a cron to schedule the scans, every cron instead to schedule the In this script you would pass the argument of the site you want to scan, for example: $crawler->setURL($argv[1]) .

Do not expect a single php request to process numerous sites, this will be very bad for your server, Google, Yahoo, Bing periodically scan different sites and routines and probably they have a scanning limit of one site per hour and continue only later.

If only one request and one php script tried to access multiple urls, the application would be in a long process that could take hours and depending on the # PHP would not be able to clear the usage which would cause the processor or memory consumption to increase until your server starts to hang .

The most appropriate (not necessarily correct) way is to scan one site at a time and place a limit and try to continue where you left off if you use the limit. Remember there are sites that can have more than 50,000 pages.

    
09.12.2015 / 15:23