Regular expression (regex) for links in web pages using Python

1

I'm trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links beginning with http or https):

import re   
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

How can I modify or create a new rgex that just takes links that start with http or https? I do not want to save the word "href" just "http: // ..." or "https: // ..." They do not serve, for example: "media / test", "g1 / news"

padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)

default was not 100% functional either:

link

link "

They left some with "in the end, which was not to occur!"

    
asked by anonymous 31.07.2016 / 18:57

2 answers

2
import urllib, re

url = "http://pt.stackoverflow.com/q/143677"
html = urllib.urlopen(url).read()

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)

for url in urls:
    print url
     or    https
    
31.07.2016 / 21:30
4

To complete the excellent response from @zekk, here is a solution for python 3.x:

import requests, re

url = "http://pt.stackoverflow.com/q/143677"
html = requests.get(url).text

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
print(urls)
    
31.07.2016 / 21:57