From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.

The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.

	public void testPDFReader() throws Exception {
		// page with example pdf document

		URL url = new URL(driver.getCurrentUrl());
		BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

		PDDocument document = null;
			document = PDDocument.load(fileToParse);
			String output = new PDFTextStripper().getText(document);

			if( document != null )

You can download an example project from here
Further information can be found on their website.

Posted By Tihomir Turzai

    3 Responses to “Extracting data from pdf files in selenium webdriver environment”

  1. Raman Sivasankar says:

    Thanks so much for this working solution.

  2. Muthu says:

    I am getting ‘End-of-File, expected line’ error.

  3. jawed akhtar says:

    sir, i want to read only header or any particuler line from pdf is possible or not?

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>