Convert HTML5 table + images to CSV or SQL

0

I have a big problem!

I have basically 1 million and a half records including images in an HTML5 table (it starts right there, the browser does not render all the images).

My idea was to convert this table to CSV, and thus play to a MySQL database, and then create a display page algorithm.

How can I perform this conversion, from <table> , <tr> , <td> , some <a> and also <img> to an Excel file?

Is there any other alternative? Here is an example of how a "summary" of each table row is:

<tr>
    <td class="">
        <a href="#">Processo 2333382</a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
</tr>

In short, I need to pass all image links with the process number to the database, or rather to a CSV file.

    
asked by anonymous 19.10.2017 / 07:34

2 answers

3

You did not specify the expected output, so I'm guessing that, for the following input file input.html :

<tr>
    <td class=""> <a href="#">Processo 1</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
<tr>
    <td class=""> <a href="#">Processo 2</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>

Output produced is a csv output.csv file in the following format:

 Processo 1,LINK DA IMAGEM,LINK DA IMAGEM,LINK DA IMAGEM
 Processo 2,LINK DA IMAGEM,LINK DA IMAGEM,LINK DA IMAGEM

The python script below does this conversion:

from lxml import html
import csv

# Le a entrada e salva em s
with open('input.html', 'r') as myfile:
        s = myfile.read()

# Faz o parse e encontra todas as linhas da tabela (<tr>)
page = html.fromstring(s)
rows = page.findall('tr')

# Extrai o conteúdo do html
data = []
for row in rows:
    datarow = []
    for c in row.getchildren():
        # Se for uma imagem, salva o link
        imgel = c.find('a/img')
        if imgel is not None: 
            datarow.append(imgel.get('src'))
        # Se não for uma imagem, salva o texto (nome do processo)
        else:
            datarow.append(c.text_content())
    data.append(datarow)

# Escreve a saída em um arquivo csv
with open('output.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for row in data:
        wr.writerow(row)
    
19.10.2017 / 16:54
0

Excel VBA

To accomplish this in Excel VBA with Regex.

Regex

The code is: (?:<a.*?>\s*[<img src="]*)(.+?)(?="?>?\s*(?:<\/a>|$))

The validation link in Regex101 and #

const regex = /(?:<a.*?>\s*[<img src="]*)(.+?)(?="?>?\s*(?:<\/a>|$))/g;
const str = '<tr>
    <td class=""> <a href="#">Processo 1</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> 
</a> 
</td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
<tr>
    <td class=""> <a href="#">Processo 2</a> </td>
    <td>
 <a>
 <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
';
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the 'm'-variable.
    m.forEach((match, groupIndex) => {
        console.log('Found match, group ${groupIndex}: ${match}');
    });
}

Enable Regex in Excel

  • RegEx needs to be enabled, Enable Developer mode
  • In the 'Developer' tab, click 'Visual Basic' and the VBA window will open.
  • Go to 'Tools' - > 'References ...' and a window will open.
  • Look for 'Microsoft VBScript Regular Expressions 5.5', as in the image below. And enable this option.
  • VBACode

    JustmanipulatethevariablestrtoinsertthedesiredstringinExcel,itcanbeanExcelcell,Range,Arrayorsimplytexts,asintheexample.

    Thiscodedoesnotaccomplishallthatwasdesired,butIsuggestyoubreakdowntheproblemintoseveralsmallerproblemsandtrytoassemblethecodethataccomplishesthetaskyouwant.

    DimstrAsStringDimobjMatchesAsObjectstr="<tr> <td class=""""> <a href=""#"">Processo 1</a> </td> <td> <a> <img src=""LINK DA IMAGEM""> </a> </td>  <td> <a> <img src=""LINK DA IMAGEM""></a></td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td></tr><tr>    <td class=""""> <a href=""#"">Processo 2</a> </td><td> <a> <img src=""LINK DA IMAGEM""> </a> </td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td></tr>"
    Set objRegExp = CreateObject("VBScript.RegExp") 'New regexp
    objRegExp.Pattern = "(?:<a.*?>\s*[<img src=""]*)(.+?)(?=""?>?\s*(?:<\/a>|$))"
    objRegExp.Global = True
    Set objMatches = objRegExp.Execute(str)
    If objMatches.Count <> 0 Then
        For Each m In objMatches
            Debug.Print m.Submatches(0); Value
        Next
    End If
    

    Result

    Using the @klaus String

        
    20.10.2017 / 19:35