Regular expression (regex) for links in web pages using Python

Question

Regular expression (regex) for links in web pages using Python

Navigation

#1 by (4 votes)
#2 by (2 votes)

1

I'm trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links beginning with http or https):

import re   
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

How can I modify or create a new rgex that just takes links that start with http or https? I do not want to save the word "href" just "http: // ..." or "https: // ..." They do not serve, for example: "media / test", "g1 / news"

padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)

default was not 100% functional either:

link

link "

They left some with "in the end, which was not to occur!"

python regex

asked by anonymous 31.07.2016 / 18:57

2 answers


                                                
                                                                        
                            
                                
                                    
                                    4
                                    
                                
                            
                            
                                
                                    
                                        
 To complete the excellent response from @zekk, here is a solution for python 3.x: 

import requests, re

url = "http://pt.stackoverflow.com/q/143677"
html = requests.get(url).text

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
print(urls)

    
                                    
                                    
                                         
                                        

                                                                                  31.07.2016 / 21:57



                    
        

         
                            Redirect page by passing "ajax" response to a div
                                        How to be less repetitive in my JavaScript codes?

score 2 · Accepted Answer

import urllib, re

url = "http://pt.stackoverflow.com/q/143677"
html = urllib.urlopen(url).read()

urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)

for url in urls:
    print url
     or    https