Crawler for site scanning [closed]

-2

Talk to me, all right?

I'd like to create a Crawler to scan the day on some specific websites and bring me on a spreadsheet or something like the home materials of these sites. In case I would like to do a scan on news portals.

I am a layman in the subject, so I would like to know what I need to use (Database, Server, something of the type) to create and what better language to develop this type of demand.

Thank you very much.

    
asked by anonymous 31.07.2017 / 14:20

1 answer

1

You can use the following features:

1 - Python language for the crawler using one of these libraries ( Scrapy or BeautifulSoup );

2- A Database of your choice (MySQL, PostgreSQL, ...), if you have knowledge of the database, I suggest you use a non-relational (MongoDB, CassandraDB, ...), depending on the amount of data, it works in a more agile way;

3 - Deploy on a server for the program to run 24 hours a day (for example, Heroku );

It is not crucial, but in addition to the database, if you want to store the information in a worksheet, it is quite simple to do with Python using the openpyxl library.

If you need a reference, here is a personal project I wrote on GitHub that deals exactly with this subject: link

    
31.07.2017 / 14:31