Beautifullsoup is being used redundantly there - you just find the beginning of the table, and then use "brute force" to separate all elements by ";", and then treat the result as pure text. That way you do not preserve the table structure, and it is difficult to know what table header and what is content.
Nothing will create "by magic" the headers for you. The CSV module has tools for extracting dictionaries from a structured text file to disk. Even if the call to get_text("; ")
transforms your data into a well-structured CSV file - which does not happen because the required line breaks for a CSV file will not be there (Except for coincidence of HTML formatting), you would have to pass an iterator that delivers one line at a time to the DictReader - but by splitting ";", its iterator passes one cell at a time. Dai it returns you a dictionary with the contents of each cell, not knowing what is heading or not.
In order to do this kind of thing there is not a fmomula ready - every page is a page, and "looking at HTML" and creating the parsing structure that will work at first, is very difficult. It is best to do in Python's interactive mode - you retrieve the daods of the page with requests.get, create the object soup
and experiment with the various methods of that soup object and the structure of the page until you find out how you want to leave your data
In this case, you would see that once you find the "children" table, iterating over it with a "for" will alternately return a table row (including the header) and a text string - whitespace.
Maybe it's possible to do something like this then:
def importa(url='http://www.geonames.org/countries/', tmout=2):
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
page = requests.get(url=url, timeout=tmout)
soup = BeautifulSoup(page.content, 'html.parser')
#print('\nsoup >>>', soup)
table = soup.find_all(id="countries")[0]
result = []
headers = None
for row in table:
# Pule as linhas que não contém tags html
if isinstance(row, str):
continue
# Assume que a primeira linha com conteúdo são os cabeçalhos
if not headers:
# cria uma lista com o conteúdo de texto de cada tag na linha:
headers = [cell.get_text() for cell in row]
continue
row_contents = [cell.get_text() for cell in row]
data_dict = OrderedDict(pair for pair in zip(headers, row_contents))
result.append(data_dict)
return result
from pprint import pprint
pprint(importa())
(Here it works - note the use of OrderedDict
to make it easier to view dictionaries)