How to find a "li" tag with a "dorm" text from a tag list in python3?

1

I'm learning to program now, and python is the first language I'm working on and I'm having trouble catching the tag:

<li> 4 dormitórios

from the html below:

<div class="crop">
<ul style="width: 500px;">
<li><strong>587 m²</strong> de área útil</li>
<li><strong>1089 m²</strong> Total</li>
<li>
<strong>4</strong>
           dormitórios                                            </li>
<li>
<strong>4</strong>
       suítes                                            </li>
<li>
<strong>8</strong>
        vagas                                            </li>
</ul>
</div>

I'm using I'm using find with regex, in the expression below:

bsObj.find("div",{"class":"crop"}).find("ul",li=re.compile("^\d*[0-9](dormitórios)*$"))

But it returns none, what's wrong with the code?

    
asked by anonymous 30.08.2016 / 18:02

1 answer

1

The tag <strong> in the middle of <li> spoils the search a little the way you are doing. However you can approach the problem this way too:

from bs4 import BeautifulSoup
import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i for i in data if i.text.startswith('4') and i.text.endswith('dormitórios')]
print(dorms)

It has a list of the <li> that has "4" and "dormitories" between the tags.

If you want what's between <li> tags but without other tags you can:

from bs4 import BeautifulSoup
import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

soup = BeautifulSoup(html, 'html.parser')
data = soup.findAll('li')
dorms = [i.text for i in data if i.text == '4dormitórios']
print(dorms)

To achieve only the number of bedrooms you can only use regex:

import re

html = '<div class="crop"><ul style="width: 500px;"><li><strong>587 m²</strong> de área útil</li><li><strong>1089 m²</strong>Total</li><li><strong>4</strong>dormitórios</li><li><strong>4</strong>suítes</li><li><strong>8</strong>vagas</li> \</ul></div>'

dorms = re.findall('Total</li><li><strong>(.*?)</strong>dormitório', html)
print(dorms) # ['4']
    
30.08.2016 / 18:18