Pass a URL list to Scrapy function

0

I have a Python API that gets two arguments (URL and a user-defined word) and provides in the JSON file how many times the specified word appears in the URL.

However, I would like to pass a list of URLs. I would also like to make the request with AsyncIO. Any suggestions?

Follow the code:

from flask import Flask
from flask_restful import Resource, Api, reqparse, abort
import requests

app = Flask(__name__)
api = Api(app)

parser = reqparse.RequestParser()
parser.add_argument('url')
parser.add_argument('word')
parser.add_argument('ignorecase')
	
# Função que faz um GET para a URL e retorna quantas vezes a palavra word aparece no conteudo
def count_words_in(url, word, ignore_case):
	try:
		r = requests.get(url)
		data = str(r.text)
		if (str(ignore_case).lower() == 'true'):
			return data.lower().count(word.lower())
		else:
			return data.count(word)
	except Exception as e:
		raise e
		
# Função que inclui 'http://' na url e retorna a URL valida
def validate_url(url):
	if not(url.startswith('http')):
		url = 'http://' + url
	return url
	

class UrlCrawlerAPI(Resource):
	def get(self):
		try:
			args = parser.parse_args()
			valid_url = validate_url(args['url'])
			return { valid_url : { args['word'] : count_words_in(valid_url, args['word'], args['ignorecase']) }}
		except AttributeError:
			return { 'message' : 'Please provide URL and WORD arguments' }
		except Exception as e:
			return { 'message' : 'Unhandled Exception: ' + str(e) }

		
api.add_resource(UrlCrawlerAPI, "/")

if __name__ == '__main__':
	app.run(debug=True)
    
asked by anonymous 28.09.2018 / 16:33

1 answer

1

You asked two questions in one:

  

I would like to pass a list of URLs.

It looks like you do not have to do anything, just pass the list.

Maybe change the name of your parameter from url to urls just to be consistent?

args = parser.parse_args()
valid_urls = [validate_url(url) for url in args['urls'])    
for valid_url in valid_urls: 
    ...
  

I would also like to make the request with AsyncIO. Any suggestions?

You are using flask, which is a synchronous framework, based on the WSGI standard, does not match much with asyncio . The flask methods do not give control to the event loop as required by asyncio and to meet multiple requests at the same time flask uses threads.

So you will have some difficulty integrating asyncio to flask , and you will not get much gain, since part of your IO is not asynchronous. If you prefer to go this route I suggest you take a look at the flask-aiohttp project that does this "glue" but does not I recommend unless your project has a very large need to take code already written for flask and asyncio .

If you are just starting out, and you want to use asynchronous programming, I suggest that you also dispense flask with a web framework that is also asynchronous. There are several, one successful example in the python community is sanic , which is meant to look like flask , so it will not have much difference.

    
28.09.2018 / 21:43