From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.

The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.

	@Test
	public void testPDFReader() throws Exception {
		// page with example pdf document
		driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

		URL url = new URL(driver.getCurrentUrl());
		BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

		PDDocument document = null;
		try{
			document = PDDocument.load(fileToParse);
			String output = new PDFTextStripper().getText(document);
			System.out.println(output);
		}finally{

			if( document != null )
			{
				document.close();
			}
		}
	}

You can download an example project from here
Further information can be found on their website.

Posted By Tihomir Turzai

    One Response to “Extracting data from pdf files in selenium webdriver environment”

  1. Raman Sivasankar says:

    Thanks so much for this working solution.

Leave a Reply




XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>