Web Scraping - convert HTML table into python dict

0

I'm trying to turn an HTML table into dict @ python, I came across some problems and I ask for help from you.

Follow it as far as I could get ...

def impl12(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    import csv

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    data = soup.find_all(id="countries")[0].get_text(separator='; ')

    #print('\ndata >>>', data, type(data))
    data = data.split('; ')

    content = csv.DictReader(data)
    for linha in content:
        pass
        print(linha)
    
asked by anonymous 01.08.2017 / 16:59

1 answer

3

Beautifullsoup is being used redundantly there - you just find the beginning of the table, and then use "brute force" to separate all elements by ";", and then treat the result as pure text. That way you do not preserve the table structure, and it is difficult to know what table header and what is content.

Nothing will create "by magic" the headers for you. The CSV module has tools for extracting dictionaries from a structured text file to disk. Even if the call to get_text("; ") transforms your data into a well-structured CSV file - which does not happen because the required line breaks for a CSV file will not be there (Except for coincidence of HTML formatting), you would have to pass an iterator that delivers one line at a time to the DictReader - but by splitting ";", its iterator passes one cell at a time. Dai it returns you a dictionary with the contents of each cell, not knowing what is heading or not.

In order to do this kind of thing there is not a fmomula ready - every page is a page, and "looking at HTML" and creating the parsing structure that will work at first, is very difficult. It is best to do in Python's interactive mode - you retrieve the daods of the page with requests.get, create the object soup and experiment with the various methods of that soup object and the structure of the page until you find out how you want to leave your data

In this case, you would see that once you find the "children" table, iterating over it with a "for" will alternately return a table row (including the header) and a text string - whitespace.

Maybe it's possible to do something like this then:

def importa(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    from collections import OrderedDict

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    table = soup.find_all(id="countries")[0]

    result = []

    headers = None
    for row in table:
        # Pule as linhas que não contém tags html
        if isinstance(row, str):
            continue
        # Assume que a primeira linha com conteúdo são os cabeçalhos
        if not headers:
            # cria uma lista com o conteúdo de texto de cada tag na linha:
            headers = [cell.get_text() for cell in row]
            continue

        row_contents = [cell.get_text() for cell in row]
        data_dict = OrderedDict(pair for pair in zip(headers, row_contents))
        result.append(data_dict)

    return result


from pprint import pprint
pprint(importa())

(Here it works - note the use of OrderedDict to make it easier to view dictionaries)

    
01.08.2017 / 23:24