How do I get the final URL of a JS redirect?

3

I was trying to make a code to get the final redirect url of some links, I could do for most of the links I needed, so I could not: link

All other links worked with urllib2 or requests.

    s = requests.Session()
    r = s.get(lili[i], headers=headers)
    if lili[i] != r.url:
        print i, r.url

or

    response = urllib2.urlopen(lili[i])
    if lili[i] != response.geturl():
        print i, response.geturl()

Does anyone know how to solve this? I would not like to use Selenium for this, it is not viable (very time consuming).

    
asked by anonymous 23.04.2017 / 17:10

1 answer

3

Curious this strategy in this type of services, is very well played to avoid precisely what you want to do.

What happens is as follows. It seems, but it is not a redirection (code 301), that is, when analyzing the body of the answer I was able to see (luckily) what happened:

setTimeout(location.href='https://www.walmart.com.br/dvd-automotivo-pioneer-avh-3880-com-usb-frontal-e-tela-de-7/3820066/pr?utm_term=22696088&utm_campaign=lomadee&utm_medium=afiliados&utm_source=lomadee&lmdsid='+new Date().getTime().toString().slice(8,12)+'29157007', 500);

Now this is a redirect but it is only delegated after the page is already on this side (client side) and javascript is interpreted, so with requests you can not see this being done, this is to "ensure" that the request was made from a browser .

Here you have a workaround to get the url, in this specific service, where to go next with requests (with urllib2 would be the same thing):

import requests, re

req = requests.get('https://redir.lomadee.com/v2/987163d4')
redi_url = re.findall('(?<=location.href=["\'])https?://.+?(?=["\'])', req.text)

if redi_url:
    print(redi_url[0]) # https://www.walmart.com.br/dvd-automotivo-pioneer-avh-3880-com-usb-frontal-e-tela-de-7/3820066/pr?utm_term=22696088&utm_campaign=lomadee&utm_medium=afiliados&utm_source=lomadee&lmdsid=

Here I believe that some of the colleagues who have the best way for regular expressions can help me, in this context it does not seem to be the best way to use regex ( total body of the answer here , the setTimeout redirecting is the same order), and feel free to edit the answer.

    
23.04.2017 / 23:00