Which language has better performance for multithread webcrawler using parallelism [closed]

2

I'm going to start a project where one of the stages will fetch certain information from other companies' websites.

Given that webcrawler will go through x sites, and each site will visit several pages at least one once a day, it may be interesting to have a Webcrawler multithread , using parallelism to speed up the process.

At this point, I do not know if Python has the same answer as Java.

Question:

Given the tools that each of us has, what does experience prove to be the best performance option, Java or Pyhthon? At this point, does Python lose out on Java?

    
asked by anonymous 09.05.2015 / 16:43

1 answer

0

It depends on which tool you are going to use, and it depends on whether you need to emulate a browser and if so if you need to emulate as well.

I've written a lot of crawlers / webscrappers in python mainly, although they have libs like lxml which are very fast. The biggest problem in processing the third party site is malformed html and content in javascript.

In the end it is worth using selenium, which can use the real browser (you can use firefox, chrome etc).

Selenium does not have a very good performance just because browsers take up a lot of memory and consume a lot of CPU, selenium itself sends commands to the browser you chose.

But it's worth it because it will execute javascript exactly as it has been tested and will work as if a person were using it, if the information you need is visible to the user then you have how to access it with selenium

Selenium has bindings for various languages including python java, you can write your code in either of the two, in the end the best is the one you know best or is more productive.

    
10.05.2015 / 20:19