From time-to-time we face with a situation when we have to verify the contents of a pdf file.
There is a nice tool in the Apache toolbox which can do the hard work instead of us called Apache PDFBox.
Beside data extraction it allows creation of new PDF documents, manipulation of existing documents.
The current major version is 2 which came with lots of changes and we have to use a slightly different way to extract the data than it was previously done.
@Test public void testPDFReader() throws Exception { // page with example pdf document driver.get("http://www.vandevenbv.nl/dynamics/modules/SFIL0200/view.php?fil_Id=5515"); URL url = new URL(driver.getCurrentUrl()); BufferedInputStream fileToParse = new BufferedInputStream(url.openStream()); PDDocument document = null; try{ document = PDDocument.load(fileToParse); String output = new PDFTextStripper().getText(document); System.out.println(output); }finally{ if( document != null ) { document.close(); } } }
You can download an example project from here
Further information can be found on their website.
Similar Posts from the author:
- Basic HTTP authentication and Webdriver
- How to setup Selenium WebDriver with Visual Studio and NUnit
- How to integrate a JUnit4 – Webdriver test into JMeter
- Browserstack integration with junit and webdriver
- Wedbriver wait for ajax to finish and JQuery animation
14 thoughts to “Extracting data from pdf files in selenium webdriver environment”
Thanks so much for this working solution.
Thanks so much for this working solution.
I am getting ‘End-of-File, expected line’ error.
Did you get the solution? I am not able observing if the pdf is single page, issue comes where we have more than one page with a page break
Successfully reading PDF’s with more than one page. Can you please provide an example of PDF where this error occurs.
Getting the same error when accessing site where PDF is only hosted.
Example: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf”
The link to the PDF document is hidden in one of the responses, so we need to manually search for it.
We found this link: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf?e=1506417888&s=0158adc91289638078f003d29e36bf98”
It’s not hard to manually find links for a small number of PDF-s and as a temporary solution, this works great. The second solution is to download each PDF and then read from the file system.
I am getting ‘End-of-File, expected line’ error.
Did you get the solution? I am not able observing if the pdf is single page, issue comes where we have more than one page with a page break
Successfully reading PDF’s with more than one page. Can you please provide an example of PDF where this error occurs.
Getting the same error when accessing site where PDF is only hosted.
Example: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf”
The link to the PDF document is hidden in one of the responses, so we need to manually search for it.
We found this link: “https://www.docdroid.net/file/view/RhMSbpc/pdf.pdf?e=1506417888&s=0158adc91289638078f003d29e36bf98”
It’s not hard to manually find links for a small number of PDF-s and as a temporary solution, this works great. The second solution is to download each PDF and then read from the file system.
sir, i want to read only header or any particuler line from pdf is possible or not?
sir, i want to read only header or any particuler line from pdf is possible or not?
Either your PDF is tagged, in which case you have to look for the header and footer artifacts. Otherwise, you’ll need to write some code that guesses which parts of the content stream are headers and footers based on hierarchy.
Either your PDF is tagged, in which case you have to look for the header and footer artifacts. Otherwise, you’ll need to write some code that guesses which parts of the content stream are headers and footers based on hierarchy.
Comments are closed.