PDF has become one of the most widely used document formats in a multitude of fields. In various cases, it is used to generate invoices where data appears to be in a tabular form. In such cases, you may need to parse the PDF to read data from the tables programmatically. To achieve this, the article covers how to extract data from PDF tables using C#.
- C# API to Extract PDF Tables
- Extract Data from PDF Tables in C#
- Extract Table from a Specific Area of Page
C# API to Extract Tables from PDF
In order to extract data from the tables in PDF files, we will use Aspose.PDF for .NET. It is a powerful API that provides a wide range of PDF manipulation features. You can either download the API or install it using NuGet.
PM> Install-Package Aspose.PDF
Extract Data from PDF Tables in C#
The following are the steps to extract data from tables in a PDF using C#.
- Load the PDF document using the Document class.
- Loop through the pages in PDF using Document.Pages collection.
- In each iteration, initialize the TableAbsorber object and visit the selected page using TableAbsorber.Visit(Page) method.
- In a nested loop, iterate through the list of the tables in TableAbsorber.TableList collection.
- For each AbsorbedTable in the collection, iterate through the collection of rows in AbsorbedTable.RowList.
- For each AbsorbedRow in the collection, iterate through the collection of cells in AbsorbedRow.CellList.
- Finally, loop through the TextFragments collection of each AbsorbedCell and print the text.
The following code sample shows how to extract text from PDF table in C#.
Extract Table from a Specific Area of Page
The following are the steps to extract a table from a specific part of the page in a PDF using C#.
- Load the PDF document using the Document class.
- Select the desired Page from Document.Pages collection.
- Extract the Square annotation of the page.
- Initialize the TableAbsorber object and visit the page using TableAbsorber.Visit(Page) method.
- In a nested loop, iterate through the list of the tables in TableAbsorber.TableList collection.
- If the table is in the region then perform the following steps.
- Iterate through the collection of rows in AbsorbedTable.RowList.
- For each AbsorbedRow in the collection, iterate through the collection of cells in AbsorbedRow.CellList.
- Finally, loop through the TextFragments collection of each AbsorbedCell and print the text.
The following code sample shows how to extract table from a specific region of the PDF page.
Get a Free License
You can use Aspose.PDF for .NET without evaluation limitations using a temporary license.
Conclusion
In this article, you have learned how to extract data from tables in a PDF using C#. Furthermore, you have seen how to extract a table from a specific region of the page in PDF. You can explore more about the C# PDF API using the documentation. Also, you can post your queries on our forum.