Scraping in Python. Assemble an insert [closed]

0

I would like to extract this table using perhaps Scraping in Python:

link

I would like to know if it is possible to extract this table so that the sql of insert is mounted as below:

INSERT INTO TABELA
(CAMPO1, CAMPO2)
VALUES
(1007, 'Contribuinte Individual - Recolhimento Mensal - NIT/PIS/PASEP')

I only have python installed. I have nothing done.

    
asked by anonymous 27.04.2017 / 22:53

1 answer

1

Starting at the beginning - we would normally use the requests and beautifulsoup modules to read the contents of a webpage - but you can also do sim. We have another problem that these recipe pages use an SSL certificate authority that is not set up in browsers or Python installations. With pure Python, we need to create an SSL context and explicitly turn off certificate verification - gets:

import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://idg.receita.fazenda.gov.br/orientacao/tributaria/pagamentos-e-parcelamentos/codigos-de-receita/codigos-de-receita-de-contribuicao-previdenciaria"
data = urllib.request.urlopen(url, context=ctx).read().decode("utf-8")

With the requests library installed, it is only:

import requests
url = "https://idg.receita.fazenda.gov.br/orientacao/tributaria/pagamentos-e-parcelamentos/codigos-de-receita/codigos-de-receita-de-contribuicao-previdenciaria"
data = requests.get(url, verify=False).text

The next step is to atone the HTML code of the page - it's easy to check that it's an HTML that, although verbose, has a single table on the page ("table" element), which is well structured with <tr> tags and <td> all closed in a well formed. Without any auxiliary libraries, Python has the html.parser.HTMLParser class that can help us. It is one of the few APIs in the standard library that requires you to make a subclass of an existing class for your use. Fortunately, we only need 3 methods: one that is called when the parser encounters a new tag (whatever), one that is called when you close a tag, and another that is called when the parser finds text content inside the tags. Check out the HTMLParser documentation .

That is - we just put some attributes of state, attributes to save the recovered data, and some "if" 's in the methods that check the start and end of the tag - and thus, when we are in the method that is called with the text content of the tags, a single "if" checks if we are inside a table cell, and if it saves the data. This breakdown ignores all the crazy tags on that page - "strong", "p", "span" within the "td" s:

from html.parser import HTMLParser

class TableParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.table_data = []
        self.inside_table = False
        self.inside_tr = False
        self.inside_td = False
        self.tmp_row_data = []
        self.tmp_cell_data = []

    def handle_starttag(self, tag, attrs):
        if tag == "table":
            self.inside_table = True
        elif tag == "tr":
            self.inside_tr = True
        elif tag == "td":
            self.inside_td = True

    def handle_endtag(self, tag):
        if tag == "table":
            self.inside_table = False
        elif tag == "tr":
            self.inside_tr = False
            self.table_data.append(self.tmp_row_data)
            self.tmp_row_data = []
        elif tag == "td":
            self.inside_td = False
            self.tmp_row_data.append(" ".join(self.tmp_cell_data).strip())
            self.tmp_cell_data = []

    def handle_data(self, data):
        if self.inside_td:
            self.tmp_cell_data.append(data)

And eat this class in place, we can have a list containing lists with each row of the page table:

parser = TableParser()
parser.feed(data)
lines = parser.table_data[1:]

And now, to "mount the inserts", just use a "for" om the db api of Python - and make the calls to the bank. That changes a little from bank to bank. To just mount the inserts, as you said, we can write an insert string in the text file:

with open("codigo_receita.sql", "wt") as file_:
    for item, codigo, especificacao in lines:
          especificao = especificacao.replace("'", "''")
          file_.write(f"""INSERT INTO TABELA (CAMPO1, CAMPO2) VALUES ({codigo}, '{especificacao}');\n""")
    
29.04.2017 / 07:17