Error in index, as non-existent

Question

Error in index, as non-existent

Navigation

#1 by (0 votes)

0

I am getting an error message in line 7 of the code, which says url=to_crawl[0] - IndexError: list index out of range

import requests

import re

to_crawl=['https://www.globo.com']

crawled=set()

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; …) Gecko/20100101 Firefox/62.0'}

while True:

    url=to_crawl[0]
    try:
        req=requests.get(url, headers=header)

    except:
        to_crawl.remove(url)
        crawled.add(url)
        continue

    html=req.text
    links=re.findall(r'<a href="?\'?(https?:\/\/[^"\'>]*)', html )
    print("Crawling:", url)

    to_crawl.remove(url)
    crawled.add(url)

    for link in links:
        if link not in crawled and link not in to_crawl:
           to_crawl.append(link)

python

asked by anonymous 14.09.2018 / 03:45

1 answer

The wordpress paginate_links function does not work on the main page PDO DAO PHP doubts in insert

score 0 · Answer 1

Within your while block you are placing the first item in the to_crawl list to make the connection, which is checked inside the try / except block. When an error occurs and it enters the except block and removes the only URL that existed in the list, leaving it empty, then it moves on to the next iteration, which repeats the command to assign the first list item in the url variable, but it turns out that the list is empty, so this problem occurs. What is making your connection fall in the except block is in your header dictionary, there is a ... character in the client statement, when I retired to test, the query was made successfully. So you can take away, leaving the definition of this line like this:

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; ) Gecko/20100101 Firefox/62.0'}

A possible solution for your except block could be a check to see if the list is empty, and if so, finalize the script, for example:

try:
    req=requests.get(url, headers=header)
except:
    to_crawl.remove(url)
    crawled.add(url)
    if not to_crawl:
        break
    continue

I hope I have helped! :)