Sometimes we need to verify a PDF content but Selenium WebDriver doesn’t have any direct methods to do that.
If we want to extract the PDF content then we can use for example Apache PDFBox.
Simply download the .jar files and add them to your Eclipse Class path.
Here is a sample script which will extract text from a sample PDF file:
import java.io.BufferedInputStream; import java.net.URL; import org.apache.pdfbox.pdfparser.PDFParser; import org.apache.pdfbox.util.PDFTextStripper; import org.openqa.selenium.WebDriver; import org.openqa.selenium.firefox.FirefoxDriver; import org.testng.annotations.AfterClass; import org.testng.annotations.BeforeClass; import org.testng.annotations.Test; public class pdfReader { WebDriver driver; @BeforeClass public void setUp() { driver = new FirefoxDriver(); } @AfterClass public void tearDown() { driver.quit(); } @Test public void testPDFReader() throws Exception { driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515"); Thread.sleep(5000); URL url = new URL(driver.getCurrentUrl()); BufferedInputStream fileToParse = new BufferedInputStream( url.openStream()); PDFParser parser = new PDFParser(fileToParse); parser.parse(); String output = new PDFTextStripper().getText(parser.getPDDocument()); System.out.println(output); parser.getPDDocument().close(); } }
Simply load the website with the PDF file. The browser will open the PDF file then we read the url, parse it with PDFParser and that’s it!
Similar Posts from the author:
- How to count number of images available on a web page then download them using selenium webdriver?
- How to use TestNG with Selenium WebDriver
- How to run parallel tests with Selenium WebDriver and TestNG
- Parameterized JUnit tests with Selenium WebDriver
- How to run parallel tests with Selenium WebDriver and TestNG in Chrome browser and in Internet Explorer browser
6 thoughts to “How to read text from PDF file using Java and Selenium Webdriver”
java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
But if i m not casting this facing Syntax error.
Thread.sleep(5000);
URL url = new URL(d.getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();
String output = new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
parser.getPDDocument().close();
Rahul did u get a way around this? I am also facing the same type cast error
Hi Rahul/Kunal
Did u get a way around this? I am also facing the same type cast error
Hi,
I made an updated post which use the newer version of Apache PDFBox, you will find the solution for your problem there.
https://blog.wedoqa.com/2016/06/extracting-data-from-pdf-files-in-selenium-webdriver-environment/
Hi,
I made an updated post which use the newer version of Apache PDFBox, you will find the solution for your problem there.
https://blog.wedoqa.com/2016/06/extracting-data-from-pdf-files-in-selenium-webdriver-environment/
Hi,
Thank you for sharing this awesome information with us!!!
You always write some very interesting topics that give boosts to the reader.
Comments are closed.