Problems with JavaScript, urllib and BeautifulSoup

2

The idea of my code is to get the link of the video and run it directly into VLC, but I came up with a problem: apparently urllib does not execute the JavaScript code, player is placed on the page using JavaScript. See my code below:

from bs4 import BeautifulSoup
import urllib.request

webpage = urllib.request.urlopen('http://www.animesproject.com.br/serie/885/2107/Death-Parade-Episodio-01')
soup = BeautifulSoup(webpage)
player = soup.find(id="player_frame")
print(player)

My question is: Can you do this using urllib ? And if not, what other way? Is there a framework to do this?

Note : print back is always none .

    
asked by anonymous 08.03.2015 / 03:09

2 answers

1

Neither urlib nor beautifulsoup interpret / execute javascript.

Some options are:

Selenium: it will use a real browser like chrome or firefox, if you are using linux you can do this using a headless display

phantomjs: which can be used with selenium as well

QT: they have a component based on an old version of webkit, the idea of them and you can have a component in your window based on html, it is half bugged and if it is to use it recommend using pyhon multiprocess to execute the What do you need?

    
08.03.2015 / 12:54
1

In some situations like this problem you can study the page code a bit and make the calls that javascript would make.

Here's a class I've implemented that mimics this. It is in Python 2.7. If you debug each function it is easy to understand the path.

# anime.py

from bs4 import BeautifulSoup
import urllib2
import re


class Anime2MP4(object):
    anime_zero_url = 'http://www.animesproject.com.br/serie/885/2162/Death-Parade-Episodio-00'  # noqa
    anime_url_format = 'http://www.animesproject.com.br/playerv52/player.php?a=0&0={0}&1={1}'  # noqa

    def build_episode_url(self, url_parameters):
        return self.anime_url_format.format(*url_parameters)

    def get_episodes_url(self):
        webpage = urllib2.urlopen(self.anime_zero_url)
        soup = BeautifulSoup(webpage)
        id_tag = 'serie_lista_episodios'
        episodes = soup.find(id=id_tag).find_all('a', href=True)
        return [ep['href']for ep in episodes]

    def get_parameters(self):
        pars = []
        for ep in self.episodes:
            ep_split = ep.split('/')
            pars.append((ep_split[2], ep_split[3]))
        return pars

    def get_mp4_episode(self, url, quality='MQ'):
        """
        quality: Pode ser HD ou MQ
        """
        webpage = urllib2.urlopen(url)
        html_content = webpage.read()
        pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'  # noqa
        urls = set(re.findall(pattern, html_content))  # unique urls
        urls = filter(lambda s: s.endswith('.mp4'), urls)  # only .mp4
        return next((url for url in urls if quality in url), None)

    def run(self):
        self.episodes = self.get_episodes_url()
        list_episode = map(self.build_episode_url, self.get_parameters())
        mp4_links = map(self.get_mp4_episode, list_episode)
        for num, ep in enumerate(mp4_links):
            print num, ep

if __name__ == '__main__':
    anime = Anime2MP4()
    anime.run()
    
21.04.2015 / 21:01