Error in index, as non-existent

0
I am getting an error message in line 7 of the code, which says url=to_crawl[0] - IndexError: list index out of range
import requests

import re

to_crawl=['https://www.globo.com']

crawled=set()

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; …) Gecko/20100101 Firefox/62.0'}

while True:

    url=to_crawl[0]
    try:
        req=requests.get(url, headers=header)

    except:
        to_crawl.remove(url)
        crawled.add(url)
        continue

    html=req.text
    links=re.findall(r'<a href="?\'?(https?:\/\/[^"\'>]*)', html )
    print("Crawling:", url)

    to_crawl.remove(url)
    crawled.add(url)

    for link in links:
        if link not in crawled and link not in to_crawl:
           to_crawl.append(link)
    
asked by anonymous 14.09.2018 / 03:45

1 answer

0

Within your while block you are placing the first item in the to_crawl list to make the connection, which is checked inside the try / except block. When an error occurs and it enters the except block and removes the only URL that existed in the list, leaving it empty, then it moves on to the next iteration, which repeats the command to assign the first list item in the url variable, but it turns out that the list is empty, so this problem occurs. What is making your connection fall in the except block is in your header dictionary, there is a ... character in the client statement, when I retired to test, the query was made successfully. So you can take away, leaving the definition of this line like this:

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; ) Gecko/20100101 Firefox/62.0'}

A possible solution for your except block could be a check to see if the list is empty, and if so, finalize the script, for example:

try:
    req=requests.get(url, headers=header)
except:
    to_crawl.remove(url)
    crawled.add(url)
    if not to_crawl:
        break
    continue

I hope I have helped! :)

    
14.09.2018 / 06:39