From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.

The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.

	@Test
	public void testPDFReader() throws Exception {
		// page with example pdf document
		driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515");

		URL url = new URL(driver.getCurrentUrl());
		BufferedInputStream fileToParse = new BufferedInputStream(url.openStream());

		PDDocument document = null;
		try{
			document = PDDocument.load(fileToParse);
			String output = new PDFTextStripper().getText(document);
			System.out.println(output);
		}finally{

			if( document != null )
			{
				document.close();
			}
		}
	}

You can download an example project from here
Further information can be found on their website.

Similar Posts from the author:

14 thoughts to “Extracting data from pdf files in selenium webdriver environment

  • Raman Sivasankar

    Thanks so much for this working solution.

  • Raman Sivasankar

    Thanks so much for this working solution.

  • Muthu

    I am getting ‘End-of-File, expected line’ error.

    • SeleniumTest

      Did you get the solution? I am not able observing if the pdf is single page, issue comes where we have more than one page with a page break

      • Dane Dukić

        Successfully reading PDF’s with more than one page. Can you please provide an example of PDF where this error occurs.

    • Dane Dukić

      Getting the same error when accessing site where PDF is only hosted.
      Example: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf

      The link to the PDF document is hidden in one of the responses, so we need to manually search for it.
      We found this link: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf?e=1506417888&s=0158adc91289638078f003d29e36bf98

      It’s not hard to manually find links for a small number of PDF-s and as a temporary solution, this works great. The second solution is to download each PDF and then read from the file system.

  • Muthu

    I am getting ‘End-of-File, expected line’ error.

    • SeleniumTest

      Did you get the solution? I am not able observing if the pdf is single page, issue comes where we have more than one page with a page break

      • Dane Dukić

        Successfully reading PDF’s with more than one page. Can you please provide an example of PDF where this error occurs.

    • Dane Dukić

      Getting the same error when accessing site where PDF is only hosted.
      Example: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf

      The link to the PDF document is hidden in one of the responses, so we need to manually search for it.
      We found this link: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf?e=1506417888&s=0158adc91289638078f003d29e36bf98

      It’s not hard to manually find links for a small number of PDF-s and as a temporary solution, this works great. The second solution is to download each PDF and then read from the file system.

  • jawed akhtar

    sir, i want to read only header or any particuler line from pdf is possible or not?

  • jawed akhtar

    sir, i want to read only header or any particuler line from pdf is possible or not?

  • Dane Dukić

    Either your PDF is tagged, in which case you have to look for the header and footer artifacts. Otherwise, you’ll need to write some code that guesses which parts of the content stream are headers and footers based on hierarchy.

  • Dane Dukić

    Either your PDF is tagged, in which case you have to look for the header and footer artifacts. Otherwise, you’ll need to write some code that guesses which parts of the content stream are headers and footers based on hierarchy.

Comments are closed.