Correcting HTML from a page using Ruby / Nokogiri

1

I'm having a little difficulty consuming an HTML generated by a third-party page where HTML is missing some closing tags.

For example:

<div>
  <li>
    <div>
      <div>test
        test
      </div>
      <li>
        <div>test 
          <div>test2</div>
        </div>

Running the Nokogiri parse

html = Nokogiri::HTML(open('origem.html'))

The result is:

OrinHTML:

<!DOCTYPEhtmlPUBLIC"-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><div>
      <li>
        <div>
          <div>test
            test
          </div>
          <li>
            <div>test 
              <div>test2</div>
            </div>
    </li>
    </div>
    </li>
    </div></body></html>

Being that the correct one would look something like:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<li>
  <div>
    <div>test
      test
    </div>
  </div>
</li>
<li>
  <div>test 
    <div>test2</div>
  </div>
</li>
</div>
</body></html>
    
asked by anonymous 15.09.2015 / 19:27

1 answer

1

The response has been sent in the OS.

Basically using gem Nokogumbo in conjunction with Nokogiri, where HTML5 parse results in the same HTML fix used by Google Chrome !

It works beautifully!

    
16.09.2015 / 13:20