Sometimes we need to verify a PDF content but Selenium WebDriver doesn’t have any direct methods to do that.
If we want to extract the PDF content then we can use for example Apache PDFBox.
Simply download the .jar files and add them to your Eclipse Class path.
Here is a sample script which will extract text from a sample PDF file:


import java.io.BufferedInputStream;
import java.net.URL;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class pdfReader {

WebDriver driver;

@BeforeClass
public void setUp() {
driver = new FirefoxDriver();
}

@AfterClass
public void tearDown() {
driver.quit();
}

@Test
public void testPDFReader() throws Exception {
driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

Thread.sleep(5000);
URL url = new URL(driver.getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(
url.openStream());

PDFParser parser = new PDFParser(fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
parser.getPDDocument().close();
}
}

Simply load the website with the PDF file. The browser will open the PDF file then we read the url, parse it with PDFParser and that’s it!

Similar Posts from the author:

6 thoughts to “How to read text from PDF file using Java and Selenium Webdriver

  • Rahul

    java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

    But if i m not casting this facing Syntax error.

    Thread.sleep(5000);
    URL url = new URL(d.getCurrentUrl());
    BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

    PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
    parser.parse();

    String output = new PDFTextStripper().getText(parser.getPDDocument());
    System.out.println(output);
    parser.getPDDocument().close();

  • Sameera Bhatt

    Hi,

    Thank you for sharing this awesome information with us!!!
    You always write some very interesting topics that give boosts to the reader.

Comments are closed.