How to read text from PDF file using Java and Selenium Webdriver

istvan.lackoSeptember 30, 2014Java, SeleniumComments are off for this post

Sometimes we need to verify a PDF content but Selenium WebDriver doesn’t have any direct methods to do that.
If we want to extract the PDF content then we can use for example Apache PDFBox.
Simply download the .jar files and add them to your Eclipse Class path.
Here is a sample script which will extract text from a sample PDF file:


import java.io.BufferedInputStream;
import java.net.URL;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;

public class pdfReader {

WebDriver driver;

@BeforeClass
public void setUp() {
driver = new FirefoxDriver();
}

@AfterClass
public void tearDown() {
driver.quit();
}

@Test
public void testPDFReader() throws Exception {
driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

Thread.sleep(5000);
URL url = new URL(driver.getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(
url.openStream());

PDFParser parser = new PDFParser(fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
parser.getPDDocument().close();
}
}

Simply load the website with the PDF file. The browser will open the PDF file then we read the url, parse it with PDFParser and that’s it!

6 thoughts to “How to read text from PDF file using Java and Selenium Webdriver”

Rahul January 28, 2016 at 12:39 pm

java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

But if i m not casting this facing Syntax error.

Thread.sleep(5000);
URL url = new URL(d.getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
parser.getPDDocument().close();
- kunal May 30, 2016 at 10:51 am
  
  Rahul did u get a way around this? I am also facing the same type cast error
  - Manoj June 2, 2016 at 6:47 pm
    
    Hi Rahul/Kunal
    
    Did u get a way around this? I am also facing the same type cast error
    - Tihomir Turzai July 3, 2016 at 2:23 pm
      
      Hi,
      
      I made an updated post which use the newer version of Apache PDFBox, you will find the solution for your problem there.
      
      https://blog.wedoqa.com/2016/06/extracting-data-from-pdf-files-in-selenium-webdriver-environment/
- Tihomir Turzai July 3, 2016 at 2:23 pm
  
  Hi,
  
  I made an updated post which use the newer version of Apache PDFBox, you will find the solution for your problem there.
  
  https://blog.wedoqa.com/2016/06/extracting-data-from-pdf-files-in-selenium-webdriver-environment/
Sameera Bhatt September 4, 2017 at 4:58 am

Hi,

Thank you for sharing this awesome information with us!!!
You always write some very interesting topics that give boosts to the reader.