Questions about the use of BeautifulSoup

0

My code below is to get the genre of movies from the IMDB site, however I'm not sure how to get the specific tag of genres from the site, because sometimes instead of catching the genre it gets the keywords tag, get the first div he thinks.

def get_genero(soup):
genero = soup.find  ('div', attrs={'class':'see-more inline canwrap'})
print(genero)
if genero != None:
    return [a.text for a in genero.findAll('a')]
else:
    return None

Being that I only need to get the genres of IMDB movies. I wanted to know how I can get a specific place using BeautifulSoup.

Link to a sample page from a movie:

link

    
asked by anonymous 26.11.2018 / 01:53

2 answers

2

The problem is with the selector you are looking for, there are several <div> with these three classes together across the page. Ideally, you should try to create a selector that is as specific as possible to what you are trying to get (some browsers provide the " copy selector " or " copy xpath " feature, for a specific element when viewed in the "inspect element").

Viewing the structure of the page, you can see that the genres are within the%% room within the element with <div> . Then you can use the same css selector scheme to get the element:

from requests import get
from bs4 import BeautifulSoup as bs

soup = bs(get('https://www.imdb.com/title/tt4575576/?ref_=adv_li_tt').text)

genres = soup.select('#titleStoryLine div:nth-of-type(4) a')

for genre in genres:
   print(genre.text)

Resulting in:

Animation
Adventure
Comedy
Drama
Family
Fantasy
    
26.11.2018 / 03:34
0

One of the solutions I found was to get the entire div from the history line, and then find / select a href that contains that part of words, as follows:

def get_genero(soup):
genero = soup.find('div', {'id' : 'titleStoryLine'})
genero = genero.select("a[href*=/search/title?genres]")
if genero != None:
    return [a.text for a in genero]
else:
    return None
    
26.11.2018 / 19:45