How to use the JavaScript library pdf.js in Selenium with Java through the JavaScriptExecutor class

0

I found this library that does exactly what I need, extract the text from the PDF and transform it into a String. link link

From what I researched (a lot), it seems to me that the version below is the most recent of pdf.js. However, I can not open the pdf file in the browser, cause this library to be called, and then use its methods to copy the text. link

I searched a lot for 2 in a row, in fact I'm not a big connoisseur of js, but I found this way link that seems to be the ideal of how to implement, however, I could not adapt to the Selenium JavascriptExecutor.

Here's my attempt trying to call the index of the first example link .

driver.get("file:///C:/Users/user/Desktop/arquivo.pdf");

    JavascriptExecutor jse = (JavascriptExecutor) driver;

    String script1 = "id=\"pdf-js\"";
    String script2 = "src=\"projeto/src/test/resources/js/pdf.js\"";
    String script3 = "PDFJS.workerSrc = cslight/src/test/resources/js/pdf.js";
    String script4 = "src=\"/projeto/src/test/resources/js/app.js\"";
    String script5 = "var app = new App;";

    jse.executeScript(script1);
    jse.executeScript(script2);
    jse.executeScript(script3);
    jse.executeScript(script4);
    jse.executeScript(script5);

Below the error:

Exception in thread "main" org.openqa.selenium.WebDriverException: unknown error: PDFJS is not defined

(Session info: chrome = 65.0.3325.181)   (Driver info: chromedriver = 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7), platform = Windows NT 10.0.14393 x86_64) (WARNING: The server did not provide any stacktrace information) Command duration or timeout: 0 milliseconds Build info: version: '3.5.3', revision: 'a88d25fe6b', time: '2017-08-29T12: 42: 44.417Z' System info: host: 'NC0048', ip: '10 .13.30.196 ', os.name:' Windows 10 ', os.arch:' amd64 ', os.version: '10 .0', java.version: '1.8.0_161 ' Driver info: org.openqa.selenium.chrome.ChromeDriver Capabilities [{mobileEmulationEnabled = false, hasTouchScreen = false, platform = XP, acceptSslCerts = false, acceptInsecureCerts = false, webStorageEnabled = true, browserName = chrome, takesScreenshot = true, javascriptEnabled = true, platformName = XP, setWindowRect = true, unexpectedAlertBehaviour = applicationCacheEnabled = false, rotatable = false, networkConnectionEnabled = false, chrome = {chromedriverVersion = 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7), userDataDir = C: \ Users \ ICARO ~ 1.PRA \ AppData \ Local \ Temp \ scoped_dir17892_11337}, takesHeapSnapshot = true, pageLoadStrategy = normal, unhandledPromptBehavior =, databaseEnabled = false, handlesAlerts = true, version = 65.0.3325.181, browserConnectionEnabled = false, nativeEvents = true, locationContextEnabled = true, cssSelectorsEnabled = true}] Session ID: 757fa21a22500f6618317bc12d5799ce     at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance (NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance (Constructor.java:423)     at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:215)     at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:167)     at org.openqa.selenium.remote.http.JsonHttpResponseCodec.reconstructValue (JsonHttpResponseCodec.java:40)     at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode (AbstractHttpResponseCodec.java:82)     at org.openqa.selenium.remote.http.AbstractHttpResponseCodec.decode (AbstractHttpResponseCodec.java:45)     at org.openqa.selenium.remote.HttpCommandExecutor.execute (HttpCommandExecutor.java:164)     at org.openqa.selenium.remote.service.DriverCommandExecutor.execute (DriverCommandExecutor.java:82)     at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:646)     at org.openqa.selenium.remote.RemoteWebDriver.executeScript (RemoteWebDriver.java:582)     at br.com.conductor.test.GenericTester.tester (GenericTester.java:40)     at br.com.conductor.test.GenericTester.main (GenericTester.java:61)

    
asked by anonymous 12.04.2018 / 16:28

1 answer

0

Here are two APIs you can add in your maven project to read PDF:

    com.itextpdf     itextpdf     5.5.13     org.apache.pdfbox     pdfbox     2.0.9

link

link

package testcases;

import java.io.File; import java.io.IOException;

import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream; import org.apache.pdfbox.io.RandomAccessRead; import org.apache.pdfbox.pdfparser.PDFParser; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.junit.Test;

import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class PdfTest {

private final String pdfUrl = "http://files.isec.pt/DOCUMENTOS/SERVICOS/BIBLIO/teses/Tese_Mest_Marcio-Carvalho.pdf";
private final String pdfPath = "/home/diamaral/Documentos/diamaral/test.pdf";

@Test
public void lerConteudoPdfUsandoApiIText() throws IOException {
    PdfReader pdfReader = new PdfReader(pdfUrl); 

    System.out.println("\n\n---------API ITEXT-----------------------------"+
            PdfTextExtractor.getTextFromPage(pdfReader,1));
}

@Test
public void lerPdfUsandoApiPdfBox() throws IOException {
    RandomAccessRead doc = new RandomAccessBufferedFileInputStream(new File(pdfPath));
    PDFParser parser = new PDFParser(doc);
    parser.parse();
    PDDocument pdfDoc = parser.getPDDocument();
    PDFTextStripper stripper = new PDFTextStripper();
    System.out.println("\n\n---------API PDFBOX-----------------------------"
                        +stripper.getText(pdfDoc));
    pdfDoc.close();
}

}

    
14.04.2018 / 18:12