Scraping using Selenium and Beautifulsoup

0

I'm trying to do a scrap on a book of books, I need to get the titles and categories of all the books posted. On the first try, I get an Attribute Error, which should happen several times because the site is poorly done and the things I'm not always getting will be with the same code. To try to deal with this I've added a except so that the looping continues. Here's my code so far:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd

webdriver.Chrome(executable_path = '/home/porco/Downloads/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors') 
options.add_argument('--test-type')
options.binary_location='/usr/bin/chromium' 
driver = webdriver.Chrome(chrome_options=options)

url = 'http://amoraosromances.blogspot.com/'
driver.get(url)

A = []
B = []

while True: 
    soup = BeautifulSoup(driver.page_source, 'lxml')

    try:
        for div in soup.findAll('div', class_='post hentry'):
            titulo = div.find('h3', class_='post-title entry-title')
            A.append(titulo.text.strip().title())
            temas = div.find('span', class_='post-labels')
            B.append(temas.text.strip().replace('\n', ' ').replace('Marcadores:', '').title())

            print(titulo.txt)
            print('...rodando...')
    except AttributeError:
        continue 
    try:
        nextButton = driver.find_element_by_xpath('//*[@id="Blog1_blog-pager-older-link"]')
        nextButton.click()  
    except: 
        break

print('...fazendo .csv e json...') 
df=pd.DataFrame(A, columns=['Título'])
df['Tema'] = B
df

df.to_csv('autor-tema2.csv')
df.to_json('autor-tema2.json', orient='records')

driver.quit()

I have two doubts, one related to what I said on top of except , I'm not sure if it's working because it's been running for a while here and the url has stopped changing. Is there any way I can add something so they can have some output so I can see what stage of the process is going on? I put some print() , but they were not very useful.

The other question is if you have to prevent the browser from opening new windows / tabs, because this site has a lot of advertising and I believe that my computer can halt in the middle of the process if these windows open, they are about 3,057 posts. If you can not prevent it, a way to be closed as soon as they open would be good too.

Hugs.

    
asked by anonymous 27.06.2018 / 11:30

1 answer

0

To log you can use the logger, it is better than the print for this case.

import logging
logger = logging.getLogger(__name__)
self.logger.warning('Aqui você coloca o texto do log')
    
29.06.2018 / 21:05