PDF is one of the ruling document formats these days. It supports text, graphics, tables, annotations, and a range of other elements. In certain cases, you may need to extract data from the tables in PDF documents such as invoices. To achieve this programmatically, the article covers how to extract data from tables in PDF using Java.
- Java API to Extract Data from Tables in PDF
- Extract Data from Tables in PDF
- Extract Table from a Specific Area of Page
Java API to Extract Data from Tables in PDF
Aspose.PDF for Java is a feature-rich API for creating and manipulating PDF documents. We will use this API to extract the data from tables in PDF documents. You can either download the API’s JAR or install it within your Java application using the following Maven configurations.
<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>21.9</version>
</dependency>
Extract Data from Tables in PDF using Java
The following are the steps to extract data from tables in a PDF using Java.
- First, load the PDF document using the Document class.
- Loop through the pages in PDF using Document.getPages() collection.
- In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.visit(Page) method.
- In a nested loop, iterate through the list of the tables in TableAbsorber.getTableList() collection.
- For each AbsorbedTable in the collection, iterate through the collection of rows from AbsorbedTable.getRowList().
- For each AbsorbedRow in the collection, iterate through the collection of cells from AbsorbedRow.getCellList().
- Finally, loop through the TextFragment collection of each AbsorbedCell and print the text.
The following code sample shows how to extract text from tables in PDF using Java.
Extract Table from a Specific Area of PDF Page
The following are the steps to extract table from a specific part of the page in a PDF using Java.
- First, load the PDF document using the Document class.
- Then, select the desired Page from Document.getPages() collection.
- Extract the Square annotation of the page.
- In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.visit(Page) method.
- In a nested loop, iterate through the list of the tables in TableAbsorber.getTableList() collection.
- Check if the absorbed table is in the desired page region.
- For each AbsorbedTable in the collection, iterate through the collection of rows using AbsorbedTable.getRowList().
- For each AbsorbedRow in the collection, iterate through the collection of cells using AbsorbedRow.getCellList().
- Finally, loop through the TextFragment collection of each AbsorbedCell and print the text.
The following code sample shows how to extract tables from a specific region of the PDF page.
Get a Free License
You can use Aspose.PDF for Java without evaluation limitations using a temporary license.
Conclusion
In this article, you have learned how to extract data from tables in a PDF using Java. Furthermore, the article covered how to extract tables from a specific region of the page in PDF. Besides, you can learn more about Aspose.PDF for Java using the documentation. In addition, you can ask your questions via our forum.