Incorrect formatting when opening a PDF generated by latex

4

In my program I need to open a PDF file and get the text it contains. However when opening the PDF, the text is poorly formatted. For example:
Thanks to my family for not being? measure effort
When the right thing would be:
Thanks to my family for not measuring effort

This only occurs when the PDF is generated by latex. When it is generated by word, the text is normal. The code I'm using to open the pdf is:

int i = 1;//Sendo n o numero de paginas
PdfReader reader = new PdfReader(diretorio);
while(i<=n){
   conteudo+=PdfTextExtractor.getTextFromPage(reader, i);
   i++;
}

I know it has to do with encoding, but I do not know how to solve / what to do!
Remembering that PDFs will not be generated by me.

    
asked by anonymous 29.04.2015 / 03:20

2 answers

3

Solution 1: Escape the problem :)

Edit Latex and merge:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}    %% <<<<<< esta linha
\begin{document}
...

The generated PDF is no longer a problem!

Solution 2:

Make a postprocessor of the adulterated text from the pdf and successive replacements restore the accented - bad idea ...

Update:

Not being the solution 1: I do not know how to do this decently. Usually I use tools like pdftotext that applied to bad pdf coming from latex (MPVL) look as follows: pdftotext mpvl.pdf

Jo˜ao 
Resumo
fam´ılia esfor¸co

and after a | fix-mpvl

João
Resumo
família esforço

In my case fix-mpvl does many things among which:

#!/usr/bin/perl 
use utf8::all;

while(<>){
  s/eˆ/ê/g; s/ˆe/ê/g;
  s/aˆ/â/g; s/ˆa/â/g;
  s/oˆ/ô/g; s/ˆo/ô/g;
  s/e´/é/g; s/´e/é/g;
  s/a´/á/g; s/´a/á/g;
  s/o´/ó/g; s/´o/ó/g;
  s/u´/ú/g; s/´u/ú/g;
  s/a˜/ã/g; s/˜a/ã/g;
  s/o˜/õ/g; s/˜o/õ/g;
  s/n˜/ñ/g; s/˜n/ñ/g;
  s/ı´/í/g; s/´ı/í/g;
  s/c¸/ç/g; s/¸c/ç/g;
  print $_;
}
    
29.04.2015 / 13:18
1

You can try using another library called Apache PDFBox . The advantage is that the features are already available in jar . You can test if it works in your files, if it works, you can integrate the classes into your source code.

You can download directly from the Maven repository here

p>

Download the jar and run the following command in your pdf file

java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <inputfile> [Text file]

The parameters are you who have to search here , since I do not have specific details to your project. Take a look at the -encoding option, maybe the answer to your question is there. By default LaTeX uses OT1 , if I'm not mistaken.

If you can extract the text correctly with the command, then you can add the library as a dependency.

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>1.8.9</version>
</dependency>

And use a example that uses TextExtraction in Java.

    
05.05.2015 / 20:32