Information contained in two pages Scrapy

2

I'm not a python programmer, but I'm trying to work with the Scrapy application.

The above example is what I need, this runs in chrome extension.

To explain, I need the post and all information available. In the case of the Post, the categories have some information (Short Desc, and others) and information in the Post (Long Desc). They are different information from the same Post.

My question is in the process, in the first loop I have Posts that need information from a Second Request, which after the parse extract would have the information.

Staying like this

 Post.short_desc = ['xxxx'] ¹ loop

 Post.long_desc = ['xxx'] return ² loop

How do I do this?

Now that complicates a little. Because inside the Second Loop, I need to add the Categories, Tags in the queue to be processed.

Fila.lista -> Add -> Url

How do I do this?

I do not know how to accomplish this, if you can help me. Thanks

    
asked by anonymous 14.05.2016 / 00:39

1 answer

3

The traditional way of extracting data from multiple pages and using the mechanism to pass data between one request and another using the meta .

It works like this: in the callback that is extracting the contents of the first page you mount a dict with the initial data:

def parse_pagina_de_listagem(self, response):
    inicial = dict(
        short_desc=response.css('...').extract(),
        ...
    )
    # pega url da pagina com restante dos dados
    url = response.css('...').extract_first()

    # monta uma requisicao passando os dados com o parametro meta
    request = scrapy.Request(request.urljoin(url), callback=self.parse_restante)
    request.meta['item'] = inicial

Scrapy will send the request asynchronously, and will pass the value on the response.

In this way, you can receive the initial item in the callback parse_restante , and also schedule the requests for other pages within it:

def parse_restante(self, response):
    # recupera item do meta
    inicial = response.meta['item']

    # faz o restante da extracao do post
    yield dict(
        inicial,
        long_desc=response.css('...').extract_first(),
        ...
    )

    # segue para outras paginas, se necessario
    for link in response.css('...').extract():
        yield scrapy.Request(response.urljoin(link),
                             callback=self.parse_pagina_de_listagem)

Read more:

14.05.2016 / 01:18