bs4: How to wrap an incomplete html code?

3

Hello, I came across incomplete html codes where the "html" and "body" tags are missing.

Here is the code I've implemented:

import bs4

content='''
<head>
 <title>
  my page
 </title>
</head>
  <table border="0" cellpadding="0" cellspacing="0">
   <tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''

soup = bs4.Beautifulsoup(content, 'html.parser')

I tried the section below that presents an error.

tag = soup.new_tag('html')
tag.wrap(soup)
  

ValueError: Can not replace one element with another when theelement   be replaced is not part of a tree.

And I've tried this one that mixes the order of tags:

for item in soup.find_all():
    tag.append(item.extract())
soup = tag

<body>
 <head>
 </head>
 <title>
  my page
 </title>
 <div>
 </div>
 <center>
 </center>
 <table border="0" cellpadding="0" cellspacing="0">
 </table>
 <tr>
 </tr>
 <td>
 </td>

How can I solve my problem with bs4, to wrap the code with the 'body' and 'html' tags?

    
asked by anonymous 19.06.2018 / 15:47

1 answer

1

For this you will need the parser html5lib .

pip install html5lib

I tried on my console and this was the result:

In [2]:import bs4

In [3]:content='''
<head>
 <title>
  my page
 </title>
</head>
  <table border="0" cellpadding="0" cellspacing="0">
   <tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''

In [4]: soup = bs4.Beautifulsoup(content, 'html5lib')

In [5]: soup
Out[5]: 
<html><head>
 <title>
  my page
 </title>
</head>
  <body><table border="0" cellpadding="0" cellspacing="0">
   <tbody><tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </tbody></table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
</body></html>
    
23.06.2018 / 00:02