Problem with multithreading crawler using jsoup

2

Hello,

I'm developing a multithreading crawler, each job (thread) deals with X sites to parse certain content with the jsoup lib. The sites are all accessible. The problem is that the final results is never the same. That is, when I run the crawler both resolves 200 contents to 180. Analyzing the logs that I am receiving 500 or 400, the next execution already runs well, I return to run around giving me a random result. Jsoup code (executed by each thread)

try {
            Connection.Response resp = Jsoup.connect( url )
                    .userAgent( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21" )
                    .timeout( 5000 )
                    .ignoreHttpErrors( true )
                    .execute( );
            doc = null;
            if( resp.statusCode( ) == 200 ) 
                doc = resp.parse( );
            else {
                log.info( "return url["+ url +"] statusCode == " + resp.statusCode( ) );
                return;
            }
        } catch( Exception e ) {
            log.error( "[Jsoup] get response url["+ url +"] exception = ", e );
            return;
        }
String title = doc.title( ); //get page title
        Elements links = doc.select( "img" ); //get all tag <img />
        for( Element imgItem : links ) {

            if( numImgsbyUrl != -1 ) 
                if( countImg == numImgsbyUrl ) break; 


            String title = getAttribute( hypeItem , "title" );

            log.debug( "[Tag Images] title["+titleImg+"] width["+width+"] height["+height+"] alt["+alt+"]" );
            if( !checkTerms( src, titleImg , width , height , alt ) )
                continue;


            resultsImg.add( new ImageSearchResult( title ) );


            if( numImgsbyUrl != -1 ) countImg++;

        }
        log.debug( "Number of results = [" + resultsImg.size( ) + "] to url[" + link + "]" );

First run I got 183 contents. Second execution I got 233 contents. Third execution I got 203 contents. I have 5 threads running parallel to 100 sites. I do not know if I'm being blocked with so many jsoup hits, Any idea what might be happening?

Código da thread master:
    ExecutorService pool = Executors.newFixedThreadPool( NThreads );
CountDownLatch doneSignal;
...
// the SAX parser
            UserHandler userhandler = new UserHandler( );
            XMLReader myReader = XMLReaderFactory.createXMLReader( );
            myReader.setContentHandler( userhandler );
            myReader.parse( new InputSource(new URL( url ).openStream( ) ) );
            resultOpenSearch = userhandler.getItems( );
...
doneSignal = new CountDownLatch( resultOpenSearch.size( ) );

    List< Future< List< ContentsResult > > > submittedJobs = new ArrayList< >( );
    for( ItemXML item : resultOpenSearch ) { //Search information tag <img>
        Future< List< ContentsResult > > job = pool.submit( new CrawlerParser( doneSignal ) );
        submittedJobs.add( job );
    }

    try {
        isAllDone = doneSignal.await( timeout , TimeUnit.MILLISECONDS );
        if ( !isAllDone ) 
            cleanUpThreads( submittedJobs );
    } catch ( InterruptedException e1 ) {
        cleanUpThreads( submittedJobs ); // take care, or cleanup
    }

    //get images result to search
    for( Future< List< ContentsResult > >  job : submittedJobs ) {
        try
            // before doing a get you may check if it is done
            if ( !isAllDone && !job.isDone( ) ) {
                // cancel job and continue with others
                job.cancel( true );
                continue;
            }
            List< ContentsResult > result = job.get( ); // wait for a processor to complete
            if( result != null && !result.isEmpty( ) ) {
                log.debug( "Resultado = " + result.size( ) );
                imageResults.addAll( result );
            }

        } catch (ExecutionException cause) {
            log.error( "ContentsResultsController", cause ); // exceptions occurred during execution, in any
        } catch (InterruptedException e) {
            log.error( "ContentsResultsController", e ); // take care
        }
    }
    
asked by anonymous 08.12.2016 / 22:26

0 answers