I'm trying to learn how to create a webcrawler. Part of the code will be to extract links on a web page (links beginning with http or https):
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
How can I modify or create a new rgex that just takes links that start with http or https? I do not want to save the word "href" just "http: // ..." or "https: // ..." They do not serve, for example: "media / test", "g1 / news"
padrao = re.findall(r'href=[\'"]https?://[\w:/\.\'"_]+' ,html)
default was not 100% functional either:
link "
They left some with "in the end, which was not to occur!"