Scrapy xpath href or span within div

0

Hello, I'm trying to make a scratch where I have to get a link and a text, but I'm having a hard time because of page variations. I have 3 possible variations:

1.

<div>
<strong>
    <span style="font-family: arial, helvetica, sans-serif;">
        <a href="www...com.br" target="_blank">Edição</a>&nbsp;-&nbsp;
    </span>
</strong>
<span style="font-family: arial, helvetica, sans-serif;">01/12/2017
</span>
</div>

2.

<div>
<span style="font-family: arial, helvetica, sans-serif;">
    <a href="www...com.br">
        <strong>Edição</strong>
    </a>&nbsp;- 04/12/2017
</span>
</div>

3.

<div>
    <a href="www...com.br">
        <strong>Edição</strong>
    </a>&nbsp;- 05/12/2017
</div>

I need to get the link inside the href and the date. The link I get with

response.xpath('//a[contains(@href,"www...com.br")]')

I can not get the date. I'm trying to find a solution where I can get the link and date within those code variations.

Thanks in advance for your help.

    
asked by anonymous 06.05.2018 / 00:04

2 answers

0

You can do this in the following way, so choosing to use BeautifulSoup is much simpler and solves it perfectly.

from bs4 import BeautifulSoup
import scrapy

class MgUberlandia(scrapy.Spider):
    name = 'mg_uberlandia'
    start_urls = ['http://www.uberlandia.mg.gov.br/?pagina=Conteudo&id=3077']

    def parse(self, response):
        soup = BeautifulSoup(response.body_as_unicode())
        a = soup.find_all('a')

        for link in a:
            print(link.get('href'))
    
08.05.2018 / 21:25
0

Based on your example, we can see that there are two patterns:

Dates within span (case 1 and 2):

response.xpath('//div/span/text()').extract()

Output:

['01/12/2017\n        ', '\n            ', '\xa0- 04/12/2017\n        ']

Loose dates in div (case 3):

response.xpath('//div/text()').extract()

Output:

'\n        ', '\n        ', '\n    ', '\n        ', '\n    ', '\n        ', '\xa0- 05/12/2017\n    ']

One strategy to solve the problem would be:

1) Check if the first option is found;

2) If it does not find the first, try the second.

Since for both you would have to clean up the data: remove the \n , maybe use regex to find the default DD / MM / YYYY etc.

To reach these conclusions I created an HTML page with just the example you pasted here. Paths may change according to the page.

    
09.05.2018 / 17:37