Increment URL through concatenation and using urllib2.urlop

Question

Increment URL through concatenation and using urllib2.urlop

Navigation

#1 by (0 votes)

1

I'm using the code below to access and do the scrapp of emails that, at least as far as I've seen, are in three identical URLs that only vary Numconsulta_cadastro= . By running the code below, he can pick up the emails from the first page by repeating them three times.

from bs4 import BeautifulSoup
import urllib2
import re
num=0
while num < 3:
strnum = str(num)
html_page = urllib2.urlopen("http://www.fiepb.com.br/industria/pesquisa.php?page=Numconsulta_cadastro="+strnum+"&totalRows_consulta_cadastro=3372&empresa=&cidade=&atividade=&produto=&materiaprima=&classificador=RAZAOSOCIAL&dados=on&Submit=Enviar+Consulta")
num+=1
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
print link.get('href')

linux python-2.7

asked by anonymous 17.10.2014 / 01:31

1 answer

Transactions in SQL Server, how do they work? [closed] Sorting an Array

score 0 · Answer 1

I understand your "question" as: "how can I generalize the code below?" The version below simplifies the execution of your code using str.format (), a method of str objects that allows you to fill in a string with any value, replacing the key pairs with the number parameter entered within the pair (eg "{0 } ").

You should also avoid the while: counter structure in which a WHILE loop is executed while a variable is incremented. It would be the equivalent of using a Ferrari in a pizza delivery service: theoretically possible, but a beautiful (and expensive) waste of resources. Instead, use the "for X in range (limit):" structure, in which Python will generate all elements between 0 and limit, store the current element in X, and execute the next block of code for each possible value of X , in order.

Finally, you should also define your constants at the beginning of the code, making it easy to change them in the future. This greatly facilitates maintenance because it improves readability.

from bs4 import BeautifulSoup
import urllib2
import re
WEBSERVICE = "http://www.fiepb.com.br/industria/pesquisa.php"
QUERY_VARIAVEL = "page=Numconsulta_cadastro="
QUERY_CONSTANTE = "&totalRows_consulta_cadastro=3372&empresa=&cidade=&atividade=&produto=&materiaprima=&classificador=RAZAOSOCIAL&dados=on&Submit=Enviar+Consulta"

for email_num in range(3):
    html_page = urllib2.urlopen("{0}?{1}{2}{3}".format(WEBSERVICE,
                                                       QUERY_VARIAVEL,
                                                       email_num,
                                                       QUERY_CONSTANTE))

    soup = BeautifulSoup(html_page)

    for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
        print link.get('href')

PRO tip: have a look at the requests module:)