BeautifulSoup - Real href links

1

I was studying about WebScraping with Python and started using the bs4 library (BeautifulSoup). When I started to get the tags a and the href attribute, I realized that I could not access the link if in href had something like:

href="/alguma_pagina.php"

In the case above, I can not simply make a request for the "/alguma_coisa.php" value, as this is not a valid url.

I need to get the real url to where I will go by clicking on the link, not just the value that is in the href. How can I get this full url?

Remembering that there is a possibility of url being type "url.com.br/" with or without the slash at the end. The href values can be of the type:

"#"
"#alguma_coisa"
"cadastro.php"
"/cadastro.php"
"http://outra_url.com"
"outra_url.com"

and each of these can start or end with a space.

    
asked by anonymous 07.11.2017 / 08:58

1 answer

1

All times you are on a page and there is a link relative to the page that corresponds to the page itself plus your url.

You can use the urlparse lib to make a concatenation ugly and make a new request.

However, as you yourself said, sometimes the url is not relative. Let's try to resolve this:

import urllib
urllib.parse.urljoin('http://google.com', 'http://ddg.gg')

In this case, since both urls are absolute, they will always use the second, so that you can keep a fixed url at the beginning and vary the second.

Another case would be to add an absolute and a relative function with the same function, for example:

urllib.parse.urljoin('http://ddg.gg/', 'teste.php')

The return would be 'http://ddg.gg/teste.php' which kills the case of relative urls.

The only case in which this function will not solve is the case of not having the 'http' prefix in the second string, which would do the same thing to join the two strings:

urllib.parse.urljoin('http://ddg.com/', 'teste.com')

The return would be 'http://ddg.com/teste.com' there it will be up to you to know if the url is valid or not.

Another option to use urlparse

import urllib
urllib.parse.urlparse('teste.com') 
# ParseResult(scheme='', netloc='', path='teste.com', params='', query='', fragment='')

In this case, you will be given a named tuple that can be used to view the netloc attribute. If it does not exist, it means that the url is not an absolute url. This solves the same case as the previous, although I find the first implementation more pythonic.

In case the url has an absurd value, but without the http prefix, it will fit you again. What I would recommend, you can create a list with badwords . A list that contains string suffix values, for example, ['.com', '.net', '.br', '.de'] and do a simple validation to see if any of the elements of that list is contained in the string, so you would also know that it is not relative and could use this criterion to make the request or not.

    
08.11.2017 / 20:14