Difficulty removing child, Python

2

Good morning, friends. I'm having trouble removing a child. I wrote a code to collect all the prices of the products of a website (it is a list of products, not a page for each one). On this with no problems, the code works well. It turns out that sometimes some product goes on sale, and on the site there are 2 prices, the old and the new (at a discount), and my code pulls both. The old price is not interesting, so I wanted to ignore it when I'm pulling the data, but I can not make it happen. An example of the source code:

<div class="result-actions"
  <span> ==$0
    $ 1,98
  </span>
<div class="result-actions">
  <span>
    <small class="price-before"> ==$0
      $ 56,70
    </small>
    <span class="price-now">
      $ 39,60
    </span>
  </span>

Each "result-actions" represents a product. I was suggested to pull the "price-now", but in this case the first product of the example would not be pulled by my code, since it is not on promotion and therefore does not contain the class. My code trying to delete the child, but without success:

with open('Lista.csv') as example_file:
  example_reader = csv.reader(example_file)
  for row in example_reader:
      driver.get(row[0])
      html = driver.page_source
      bs = BeautifulSoup(html, 'html.parser')
      precosLista = bs.findAll('div',{'class':'result-actions'})
      f = open(acha_proximo_nome('Arquivo.csv'), 'wt+', newline='')
      writer = csv.writer(f)

      try:
          for precos in precosLista:
              print(precos.get_text())
              csvPreco = []
              csvPreco.append(clean_up_text(precos.get_text()))
              js = "var aa = document.getElementsByClassName('price-before')[0];aa.parentNode.removeChild(aa)"
              driver.execute_script(js)
              writer.writerow(csvPreco)

      finally:
          f.close()

Without the

js = "var aa = document.getElementsByClassName('price-before')[0];aa.parentNode.removeChild(aa)"
driver.execute_script(js)

My code works fine, but it's how I said it, collects everything, including what I do not want. Anyone have any idea how I can fix this?

    
asked by anonymous 07.11.2018 / 12:41

1 answer

2

Since you are using BeautifulSoup, you can use replace_with which each node contains. It allows you to exchange the content of the tag for a specific html. In case I changed the content by an empty string in the code example below:

import bs4

html = '''<div class="result-actions">
<span>
  $ 1,98
</span>
</div>
<div class="result-actions">
<span>
  <small class="price-before">
    $ 56,70
  </small>
  <span class="price-now">
    $ 39,60
  </span>
</span>
</div>'''

soup = bs4.BeautifulSoup(markup=html)
prices = soup.find_all('div', {'class':'result-actions'})

for price in prices:
    # remove o preco antigo
    smalls = price.find_all('small')
    for small in smalls:
        small.replace_with('')

    value = price.find_all('span')[0].text.strip()
    print (value)

The result of this code should print values correctly for this HTML:

> $ 1,98
> $ 39,60
    
07.11.2018 / 18:15