I want to scrape a page, but I can not get a text that has "& nbsp" in it

-1

Well, the title says it all, I want to get the price of a product from link , which looks like this: / p>

<span class="nm-price-value" itemprop="price">R$&nbsp;367,60</span>

Using this code:

import bs4
import requests

def get_page(url):
    r = requests.get(url)

    try:
        r.raise_for_status()
        page = bs4.BeautifulSoup(r.text, 'lxml')
        return page

    except Exception as e:
        return False

def pontofrio(search):
    soup = get_page(f'https://search3.pontofrio.com.br/busca?q={search}')

    price = [i.getText() for i in soup.select('.nm-price-value')]
    product = [i.getText().upper() for i in soup.select('.nm-product-name a')]
    link = [i.get('href') for i in soup.select('.nm-product-name a')]

    return product, price, link

Since the product and link lists return normally, but the price list is returning:

['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']

Please help!

    
asked by anonymous 21.10.2018 / 17:13

1 answer

2

Your problem is that the price does not really exist on the page! Cold website pages, like many others on the internet, are made available without the price , and only then is that price placed on the page through javascript code that runs on your browser after the upload.

So by inspecting the page code with your browser, javascript will already have executed and completed the elements dynamically, so you'll find the prices there. Since BeautifulSoup does not execute javascript, the page that it parseed prices do not exist.

This is very common in today's web pages, which are very dynamic. Leaving you with two options:

  • Analyze the javascript code of the page and find out where it takes the price. In this case it seems to come from url https://preco.api-pontofrio.com.br/V1/Produtos/PrecoVenda/ . You can go reading and following the javascript code until you find a way to mimic what it does to recover those prices, what parameters to pass, etc. It is not an easy task but the code would be quite optimized because it would be python code without having to open a real browser, which would be the second option:

  • Use a real browser, which runs javascript. The Selenium library allows you to open and control a real browser window through your script. As the page will open in a browser, javascript will work and you can get these prices. The downside is that opening a browser is cumbersome and slow, as well as loading many unnecessary elements and images into the process, so it would not be as efficient as directly accessing the source.

  • 22.10.2018 / 02:57