Difficulty with web scraping

4
<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>

Using the BeautifulSoup library I can read the following page:

I'm having trouble reading via python the html protocol number above '430489'. This number will be used to download a pdf. I want to create a function that will have this number as an argument and which will automatically download the pdf in my mac.

    

asked by anonymous 26.06.2014 / 05:37

3 answers

3

You can do this in several ways, as mentioned by mgibsonbr , to extract a piece from a string using < strong> regular expressions is commonly used for this purpose, such as < strong> manipulate the string .

Assuming we have the variable conteudo that stores the html have the following information:

conteudo = '''
<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>
'''

String manipulation

from BeautifulSoup import BeautifulSoup

def getProtocol(html):
   soup = BeautifulSoup(conteudo)
   href = unicode(soup.a['href'].partition('AbreArquivo')[2])

   numero = [int(i) for i in href if i.isnumeric()]
   return int(numero)

protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui

Above we used the partition method to split the string on the first occurrence of the separator (in this case it is AbreArquivo ).

The string that we want to split comes as follows: Javascript:AbreArquivo('430489'); . Using partition('AbreArquivo')[2] will result in: ('430489');

A list called numero is created that will only contain numbers, we go through character by character and check if it is a number, if it is, it is added to the list.

Regular Expressions

To extract a number you can use the expression \d+ or [0-9]+ to capture one or more numbers.

from BeautifulSoup import BeautifulSoup
import re

def getProtocol(html):
   soup = BeautifulSoup(html)
   href = soup.a['href']

   numero = re.findall(r'\d+', href)[0]
   return int(numero)

protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui

Note that if the content you treat comes in a different format you will probably have to adapt the way you treat the string or expression.

    
26.06.2014 / 15:51
2

I'm assuming you've already gotten a reference to the <a> element you want, and you can also extract the contents of the href attribute (if I'm wrong on these assumptions, add more details in the question). The problem then comes down to extracting the number 430489 from the string Javascript:AbreArquivo('430489'); , right?

There is no generalized solution for this since href would support any valid JavaScript at first. However, if you know that your HTML will always come in this format, simply use a simple substring function to extract the desired part:

href = soup.tr.a['href']
arq_str = href[len("Javascript:AbreArquivo('") : -len("');")]
arq_int = int(arq_str)

If you are not familiar with the substring (sublist) operation, x[inicio:fim] creates a new string / list starting at position inicio and ending just before fim . If fim is negative, it starts counting from the end of the string (i.e. len(x) - fim ).

Making inicio = len(prefixo) and fim = -len(sufixo) ensures that only "middle" is selected, without relying on "magic numbers". Then just convert it to number, if applicable.

    
26.06.2014 / 06:03
1

Only Beautifulsoup

from bs4 import Beautifulsoup

conteudo = '''
<tr bgcolor="FFF8DC">
    <td valign="top">25/06/2014 20:37</td>
    <td valign="top">25/06/2014</td>
    <td>
        <a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
        <br>
        Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
        ações de emissão da Companhia em circulação no mercado
    </td>
</tr>
'''
soup = Beautifulsoup(conteudo, 'html.parser')
print(soup.select('a[href*="AbreArquivo"]')[0]['href'].split("'")[1])

#430489
    
14.03.2018 / 05:33