I'm trying to extract the age value from the following tag, but from several html files with the same structure.
<div class="col-md-12">
<div class="section-title line-style">
<h3 class="title">Acervo Alto de Pinheiros</h3>
</div>
<div>
<b>Realização:</b> Tecnisa,Cyrela Brazil Realty</div>
<div>
<b>Idade:</b> 9 anos</div>
I'm using the regular expression with find to extract:
from bs4 import BeautifulSoup
import re
dir_path = "/home/user/pasta/htmls"
for file_name in glob.glob(os.path.join(dir_path, "*.html")):
my_data = (file_name)
soup = BeautifulSoup(open(my_data, "r").read())
realizacao= soup.find("div",{"class":"col-md-12"}).find(text=re.compile("<div><b>Realização:</b>(.*?)</div>"))
print(realizacao)
idade= soup.find(text=re.compile("<div><b>Idade:</b>(.*?)Anos</div>"))
print(idade)
But it brings "none" to both realization and age variables, which may be wrong in code, I'm new to programming and python is the first language I'm learning.