How to find a "div" with text "Age:" and only bring the number with bs4 and regex python3

0

I'm trying to extract the age value from the following tag, but from several html files with the same structure.

<div class="col-md-12">

            <div class="section-title line-style">
                       <h3 class="title">Acervo Alto de Pinheiros</h3>
            </div>
            <div>
            <b>Realização:</b> Tecnisa,Cyrela Brazil Realty</div>
            <div>
               <b>Idade:</b> 9 anos</div>

I'm using the regular expression with find to extract:

from bs4 import BeautifulSoup
import re

dir_path = "/home/user/pasta/htmls"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    my_data = (file_name)
    soup = BeautifulSoup(open(my_data, "r").read())

    realizacao= soup.find("div",{"class":"col-md-12"}).find(text=re.compile("<div><b>Realização:</b>(.*?)</div>"))
print(realizacao)

idade= soup.find(text=re.compile("<div><b>Idade:</b>(.*?)Anos</div>"))
print(idade)

But it brings "none" to both realization and age variables, which may be wrong in code, I'm new to programming and python is the first language I'm learning.

    
asked by anonymous 31.10.2016 / 15:39

2 answers

0

Firstly, if the intent is to do just one search between tags, there is no need to regex. BeautifulSoup is just for you to handle HTML / XML in a simplified way, avoiding so you have to put your hands in regex, which depending on the case only brings headache.

Assuming that the codes actually have the same structure, that is, the div class="col-md-12" tags with the second internal

and the third containing age, this should resolve:

# Divs é uma lista de tags <div> internas
divs = soup.find("div", {"class":"col-md-12"}).find_all("div")

# divs[i].text dá o texto interno da tag, ignorando outras internas (como <b>)
# .split() devolve uma lista com os termos da string quebrados nos espaços
#     (maxsplit=1 significa que só quebro no 1º espaço)
realizacao = divs[1].text.split(maxsplit=1)[1]
idade      = divs[2].text.split()[1]
    
03.11.2016 / 14:37
0

The command below will return all elements containing the text age. You can use parent to identify the tags.

result = soup.find_all(string=re.compile("idade", re.I))
    
19.06.2018 / 21:36