Currently Browsing

Posts Tagged ‘ data extraction ’

Extracting data from pdf files in selenium webdriver environment

From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.

The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.

	@Test
	public void testPDFReader() throws Exception {
		// page with example pdf document
		driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

		URL url = new URL(driver.getCurrentUrl());
		BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

		PDDocument document = null;
		try{
			document = PDDocument.load(fileToParse);
			String output = new PDFTextStripper().getText(document);
			System.out.println(output);
		}finally{

			if( document != null )
			{
				document.close();
			}
		}
	}

You can download an example project from here
Further information can be found on their website.