Currently Browsing

Posts Tagged ‘ PDF ’

Extracting data from pdf files in selenium webdriver environment

From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.

The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.

	@Test
	public void testPDFReader() throws Exception {
		// page with example pdf document
		driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

		URL url = new URL(driver.getCurrentUrl());
		BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

		PDDocument document = null;
		try{
			document = PDDocument.load(fileToParse);
			String output = new PDFTextStripper().getText(document);
			System.out.println(output);
		}finally{

			if( document != null )
			{
				document.close();
			}
		}
	}

You can download an example project from here
Further information can be found on their website.

How to read text from PDF file using Java and Selenium Webdriver

Sometimes we need to verify a PDF content but Selenium WebDriver doesn’t have any direct methods to do that.
If we want to extract the PDF content then we can use for example Apache PDFBox.
Simply download the .jar files and add them to your Eclipse Class path.
Here is a sample script which will extract text from a sample PDF file: