Hello,
I'm developing a multithreading crawler, each job (thread) deals with X sites to parse certain content with the jsoup lib. The sites are all accessible. The problem is that the final results is never the same. That is, when I run the crawler both resolves 200 contents to 180. Analyzing the logs that I am receiving 500 or 400, the next execution already runs well, I return to run around giving me a random result. Jsoup code (executed by each thread)
try {
Connection.Response resp = Jsoup.connect( url )
.userAgent( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21" )
.timeout( 5000 )
.ignoreHttpErrors( true )
.execute( );
doc = null;
if( resp.statusCode( ) == 200 )
doc = resp.parse( );
else {
log.info( "return url["+ url +"] statusCode == " + resp.statusCode( ) );
return;
}
} catch( Exception e ) {
log.error( "[Jsoup] get response url["+ url +"] exception = ", e );
return;
}
String title = doc.title( ); //get page title
Elements links = doc.select( "img" ); //get all tag <img />
for( Element imgItem : links ) {
if( numImgsbyUrl != -1 )
if( countImg == numImgsbyUrl ) break;
String title = getAttribute( hypeItem , "title" );
log.debug( "[Tag Images] title["+titleImg+"] width["+width+"] height["+height+"] alt["+alt+"]" );
if( !checkTerms( src, titleImg , width , height , alt ) )
continue;
resultsImg.add( new ImageSearchResult( title ) );
if( numImgsbyUrl != -1 ) countImg++;
}
log.debug( "Number of results = [" + resultsImg.size( ) + "] to url[" + link + "]" );
First run I got 183 contents. Second execution I got 233 contents. Third execution I got 203 contents. I have 5 threads running parallel to 100 sites. I do not know if I'm being blocked with so many jsoup hits, Any idea what might be happening?
Código da thread master:
ExecutorService pool = Executors.newFixedThreadPool( NThreads );
CountDownLatch doneSignal;
...
// the SAX parser
UserHandler userhandler = new UserHandler( );
XMLReader myReader = XMLReaderFactory.createXMLReader( );
myReader.setContentHandler( userhandler );
myReader.parse( new InputSource(new URL( url ).openStream( ) ) );
resultOpenSearch = userhandler.getItems( );
...
doneSignal = new CountDownLatch( resultOpenSearch.size( ) );
List< Future< List< ContentsResult > > > submittedJobs = new ArrayList< >( );
for( ItemXML item : resultOpenSearch ) { //Search information tag <img>
Future< List< ContentsResult > > job = pool.submit( new CrawlerParser( doneSignal ) );
submittedJobs.add( job );
}
try {
isAllDone = doneSignal.await( timeout , TimeUnit.MILLISECONDS );
if ( !isAllDone )
cleanUpThreads( submittedJobs );
} catch ( InterruptedException e1 ) {
cleanUpThreads( submittedJobs ); // take care, or cleanup
}
//get images result to search
for( Future< List< ContentsResult > > job : submittedJobs ) {
try
// before doing a get you may check if it is done
if ( !isAllDone && !job.isDone( ) ) {
// cancel job and continue with others
job.cancel( true );
continue;
}
List< ContentsResult > result = job.get( ); // wait for a processor to complete
if( result != null && !result.isEmpty( ) ) {
log.debug( "Resultado = " + result.size( ) );
imageResults.addAll( result );
}
} catch (ExecutionException cause) {
log.error( "ContentsResultsController", cause ); // exceptions occurred during execution, in any
} catch (InterruptedException e) {
log.error( "ContentsResultsController", e ); // take care
}
}