You can do this in several ways, as mentioned by mgibsonbr ,
to extract a piece from a string using < strong> regular expressions is commonly used for this purpose, such as < strong> manipulate the string .
Assuming we have the variable conteudo
that stores the html have the following information:
conteudo = '''
<tr bgcolor="FFF8DC">
<td valign="top">25/06/2014 20:37</td>
<td valign="top">25/06/2014</td>
<td>
<a href="Javascript:AbreArquivo('430489');">BROOKFIELD INCORPORAÇÕES S.A.</a>
<br>
Disponibilização do Laudo de Avaliação da pretendida oferta pública para a aquisição das
ações de emissão da Companhia em circulação no mercado
</td>
</tr>
'''
String manipulation
from BeautifulSoup import BeautifulSoup
def getProtocol(html):
soup = BeautifulSoup(conteudo)
href = unicode(soup.a['href'].partition('AbreArquivo')[2])
numero = [int(i) for i in href if i.isnumeric()]
return int(numero)
protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui
Above we used the partition
method to split the string on the first occurrence of the separator (in this case it is AbreArquivo
).
The string that we want to split comes as follows: Javascript:AbreArquivo('430489');
. Using partition('AbreArquivo')[2]
will result in: ('430489');
A list called numero
is created that will only contain numbers, we go through character by character and check if it is a number, if it is, it is added to the list.
Regular Expressions
To extract a number you can use the expression \d+
or [0-9]+
to capture one or more numbers.
from BeautifulSoup import BeautifulSoup
import re
def getProtocol(html):
soup = BeautifulSoup(html)
href = soup.a['href']
numero = re.findall(r'\d+', href)[0]
return int(numero)
protocolo = getProtocol(conteudo)
# Fazer a requisição do PDF aqui
Note that if the content you treat comes in a different format you will probably have to adapt the way you treat the string or expression.