Python: Clean html code

1

Using python, what would be the easy way to clean tag parameters from Microsoft tools?

Initially I'm trying to turn via beautiful soup, but I'm open to any suggestions! : D

In this way:

<p style="text-decoration: underline;">Hello <strong>World!</strong></p>
<p style="color: #228;">How are you today?</p>
<table style="width: 300px; text-align: center;" border="1" cellpadding="5">
<tr>
<th width="75"><strong><em>Name</em></strong></th>
<th colspan="2"><span style="font-weight: bold;">Telephone</span></th>
</tr>
<tr>
<td>John</td>
<td><a style="color: #F00; font-weight: bold;" href="tel:0123456785">0123 456 785</a></td>
<td><img width="25" height="30" src="images/check.gif" alt="checked" /></td>
</tr>
</table>

For this form:

<p>Hello <strong>World!</strong></p>
<p>How are you today?</p>
<table border="1" cellpadding="5">
<tr>
<th width="75"><strong><em>Name</em></strong></th>
<th colspan="2"><span>Telephone</span></th>
</tr>
<tr>
<td>John</td>
<td><a href="tel:0123456785">0123 456 785</a></td>
<td><img width="25" height="30" src="images/check.gif" alt="checked" /></td>
</tr>
</table>
    
asked by anonymous 22.02.2018 / 10:07

1 answer

3

You can use re.sub()

Example to remove style attributes:

import re

html_string = "[coloque aqui seu HTML]"
html_no_style = re.sub(r' style="[^"]+"', '', html_string)

It's important to test with several different HTML files to see if you will not need to improve the capture regex.

    
22.02.2018 / 12:14