Sometimes we need to verify a PDF content but Selenium WebDriver doesn’t have any direct methods to do that.
If we want to extract the PDF content then we can use for example Apache PDFBox.
Simply download the .jar files and add them to your Eclipse Class path.
Here is a sample script which will extract text from a sample PDF file:


import java.io.BufferedInputStream;
import java.net.URL;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class pdfReader {

WebDriver driver;

@BeforeClass
public void setUp() {
driver = new FirefoxDriver();
}

@AfterClass
public void tearDown() {
driver.quit();
}

@Test
public void testPDFReader() throws Exception {
driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

Thread.sleep(5000);
URL url = new URL(driver.getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(
url.openStream());

PDFParser parser = new PDFParser(fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
parser.getPDDocument().close();
}
}

Simply load the website with the PDF file. The browser will open the PDF file then we read the url, parse it with PDFParser and that’s it!

Posted By István Lackó

    5 Responses to “How to read text from PDF file using Java and Selenium Webdriver”

  1. Rahul says:

    java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

    But if i m not casting this facing Syntax error.

    Thread.sleep(5000);
    URL url = new URL(d.getCurrentUrl());
    BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

    PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
    parser.parse();

    String output = new PDFTextStripper().getText(parser.getPDDocument());
    System.out.println(output);
    parser.getPDDocument().close();

Leave a Reply




XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>