Error converting HTML to PDF using XMLWorkerHelper

1

While exporting the HTML file to PDF using iTextSharp and XMLWorker error occurs in some situations saying that certain tag is not closed and searching I found this post

My application queries an SQL table from where it returns saved HTML forms and when I try to convert them into PDF error occurs saying that certain tag is not closed, below is the code I use to export to PDF:

public ActionResult GetPdfFileZiped(ProcessamentoRegistros pProcessamentoRegistros)
        {
XMLWorkerHelper.GetInstance().ParseXHtml(pw, doc, srHtml);

ocorre erro pois a estrutura do HTML eventualmente não está bem formatada
pProcessamentoRegistros.IdProcessamentoDiario = 1;
                pProcessamentoRegistros.IdRegistro = 1;
                pProcessamentoRegistros.IdServico = 2;
                ProcessamentoRegistros _processamento = _IRepositorio.ObterProcessamentoRegistros(pProcessamentoRegistros);

                var doc = new Document(PageSize.A4.Rotate());
                var stream = new MemoryStream();
                var pw = PdfWriter.GetInstance(doc, stream);
                var minhaStringHTML = @_processamento.DocumentoHtml.Trim();

                doc.Open();

                using (var srHtml = new StringReader(minhaStringHTML))
                {
                    XMLWorkerHelper.GetInstance().ParseXHtml(pw, doc, srHtml); // <-- AQUI OCORRE ERRO
                }
                doc.Close();

                using (var compressedFileStream = new MemoryStream())
                {
                    using (var zipArchive = new ZipArchive(compressedFileStream, ZipArchiveMode.Update, false))
                    {
                        var zipEntry = zipArchive.CreateEntry("MeuPDFZipado.pdf");                        
                        using (var originalFileStream = new MemoryStream(stream.ToArray()))
                        {
                            using (var zipEntryStream = zipEntry.Open())
                            {
                                originalFileStream.CopyTo(zipEntryStream);
                            }
                        }
                    }
                    return new FileContentResult(compressedFileStream.ToArray(), "application/zip") { FileDownloadName = "Filename.zip" };
                }
}

For example, below the img tag is not closed and I have no control over its formatting, the error occurs in some other tags:

<IMG border="0" src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/caixa.gif"width=180height=44>

BelowisthefullHTML:

<HTML><HEAD><METANAME="GENERATOR" Content="Microsoft Visual Studio 6.0">
<script language=javascript>
//function MudarPagina() {
//  window.history.back();
//}
</script>
</HEAD>
<!--body bgcolor=white onBlur=MudarPagina();-->
<body bgcolor=white>
    <FORM method="post" style="BACKGROUND-COLOR: white">
    <!--FORM name="Imprimir" method="post" style="BACKGROUND-COLOR: white"-->
<br>    
<table>
<tr>
<td align=center><a href="javascript:window.print();"><IMG src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/botimprimir.gif"border=0></a><ahref="javascript:window.history.back();"><IMG src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/botvoltar.gif"border=0></a></td></tr><tr><td><tablewidth="75%" CELLSPACING=0 CELLPADDING=10 border=1 align=center bordercolorlight="#FFFFFF" bordercolordark="#CCCCCC">


<tr>
<td>    

    <TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0 style="color: black" class=txtcentral>
        <tr>
            <td align=left><IMG border="0" src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/caixa.gif"width=180height=44></td></tr><tr><tdcolspan=2>&nbsp</td></tr><tr><tdalign=rigth><spanstyle="font-size: 13pt" align=center><strong>Certificado de Regularidade do FGTS - CRF</strong></span></td>
        </tr>
    </table>

    <TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0 style="color: black" class=txtcentral>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD width=22%><font style=" font-family: Verdana;font-size:10pt"><strong>Inscrição:</strong></font></TD>
            <TD ><font style=" font-family: Verdana;font-size:8pt">08439659/0001-50</font></TD>
        </tr>
        <tr>
            <td width=22% valign=top nowrap><font style=" font-family: Verdana;font-size:10pt"><strong>Razão Social:</strong></font></TD>
            <td><font style=" font-family: Verdana;font-size:8pt">CPFL ENERGIAS RENOVAVEIS S A</font></TD>
        </tr>

        <tr>
            <td width=22% nowrap><font style=" font-family: Verdana;font-size:10pt"><strong>Nome Fantasia:</strong></font></TD>
            <td ><font style=" font-family: Verdana;font-size:8pt">CPFL RENOVAVEIS</font></TD>
        </tr>

        <tr>
            <td width=22% valign=top><font style=" font-family: Verdana;font-size:10pt"><strong>Endereço:</strong></font></TD>
            <td ><font style=" font-family: Verdana;font-size:8pt">AV DOUTOR CARDOSO DE MELO   1184   ANDAR 7 / VILA OLIMPIA / SAO PAULO / SP / 4548-004</font></TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD colspan=2 style="text-align: justify"><font style=" font-family: Verdana;font-size:10pt">A Caixa Econômica Federal, no uso da atribuição que lhe confere o Art. 7, da
            Lei 8.036, de 11 de maio de 1990, certifica que, nesta data, a empresa acima identificada
            encontra-se em situação regular perante o Fundo de Garantia do Tempo de Serviço - FGTS.
            </font>
            </TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <td style="text-align: justify" colspan=2><font style=" font-family: Verdana;font-size:10pt">O presente Certificado não servirá de prova contra cobrança de quaisquer débitos referentes
            a contribuições e/ou encargos devidos, decorrentes das obrigações com o FGTS.</font>
            </td>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>


        <tr>
            <td colspan=2><font style=" font-family: Verdana;font-size:10pt"><strong>Validade: </strong>28/02/2017 a 29/03/2017</font></TD>
        </tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <td colspan=2><font style=" font-family: Verdana;font-size:10pt"><strong>Certificação Número: </strong>2017022805233090232330</font></TD></TR>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD colspan=2><font style=" font-family: Verdana;font-size:10pt">Informação obtida em 15/03/2017, às 17:14:51.</font></TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD style="text-align: justify" colspan=2><font style=" font-family: Verdana;font-size:10pt">A utilização deste Certificado
                para os fins previstos em Lei está condicionada à verificação de
                autenticidade no site da Caixa: <strong>www.caixa.gov.br</strong></font></TD>
            </tr>
    </TABLE>
</form>

</td></tr></table>

</td>
</tr>

</table> 

<script language=javascript>
//window.print();
</script>   
</BODY>
</HTML>

How can I get around this problem? Is it possible to parse in HTML and transform into XHTML? Do you have any other free alternatives to convert this HTML to PDF along with the tag styles?

    
asked by anonymous 29.05.2017 / 18:49

1 answer

1
  

How can I get around this?

The correct way to get around your problem is to attack the root of it. That is, you should fix your% s of% s so that the tool can work correctly. Something that can be used, for example, is the W3C Validator to check if HTML passed has errors .

  

How can I parse HTML and transform into XHTML?

I'm not experienced with the tool, but try the TidyManaged .

Below is an example of its use:

using System;
using TidyManaged;

public class Test
{
  public static void Main(string[] args)
  {
    using (Document doc = Document.FromString("<hTml><title>test</tootle><body>asd</body>"))
    {
      doc.ShowWarnings = false;
      doc.Quiet = true;
      doc.OutputXhtml = true;
      doc.CleanAndRepair();
      string parsed = doc.Save();
      Console.WriteLine(parsed);
    }
  }
}

The output of HTML will look something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>test</title>
</head>
<body>
asd
</body>
</html>

It's probably also possible to do something like this with the W3C API .

  

Do you have any other free alternative to convert this HTML to PDF   along with the styles of the tags?

The problem is not the PDF generation but the HTML (root problem, as I mentioned before). But if something prevents you from making the correction in HTML, you can try using some tool like the one I mentioned above to try to parse in your HTML correcting the errors found. But that is not 100% reliable , some errors may not be detected.

    
29.05.2017 / 19:25