PDF file format is popular because file contents can not be modified easily, as compared to other word processing file formats. However, you can extract text from PDF documents with Aspose.PDF for .NET API. In this article, let us explore a few of the many possibilities for searching and extracting text from PDF files using C#. Following is a list of features we will be discussing in this blog:
- Extract Text from a Whole PDF Document using C#
- Search and Extract Text from a Specific Page in PDF Document using C#
- Search and Extract Text from Specific Region of PDF Page using C#
- Search or Extract Specific Text from PDF file with Regular Expressions using C#
- Extract Text from Table Cells in PDF Document using C#
- Search & Extract Highlighted Text from PDF Document using C#
- Extract PDF Text with Low Memory Consumption using C#
Extract Text from a Whole PDF Document using C#
Reading text contents from a PDF document is a frequently used feature. You can extract all text from all pages of the document by following the steps below:
- Load PDF document
- Create TextAbsorber object
- Accept the absorber for all pages
- Write extracted text to a file and save it
Below code snippet follows these steps and shows how to extract text from a whole PDF document using C#:
Search and Extract Text from a Specific Page in PDF in C#
After exploring the text extraction at the document level, let us extract text from a specific page of a PDF document. Simply follow the below steps to achieve your requirements:
- Load PDF document
- Create TextAbsorber object
- Accept the absorber for a specific page
- Write extracted text to a file and save it
Following code snippet follows these steps to read text from any page of a PDF file using C#:
Search and Extract Text from Specific Region of PDF Page in C#
Let us take this text extraction another step further. Consider finding some text from a specific area of the page. It is important to understand here that the basic measuring unit in Aspose.PDF for .NET is point, where 72 points equal 1 inch. Moreover, the left bottom corner is considered as the origin (0,0) of the page. The following are the steps to read text from a specific area or section of Page in PDF document:
- Load source PDF document
- Create TextAbsorber object
- Set TextSearchOptions and specify rectangular coordinates
- Accept the absorber for specific page
- Write extracted text to a file
The code snippet below demonstrates these steps to extract text from specific region using C#:
Search or Extract Specific Text from PDF file with Regular Expressions using C#
You can extract specific text from a PDF document that matches some pattern. For instance, you might want to extract some specific words or numbers. For this purpose, you would need to design a regular expression. The API will utilize that regex to find matching text in the PDF document. Below steps are the guidelines to search and extract specific text from PDF files:
- Load input PDF document
- Create TextAbsorber object
- Design and set a regular expression to match some text
- Loop through each extracted text fragment
Following C# code snippet uses a regex that searches for the text containing 4 digits, for instance, 1999,2000, etc. across all pages of the document.
Extract Text from Table Cells in PDF Document using C#
Extracting text from Tables on a PDF page is a little different. We have been working with TextAbsorber class in previous examples, but extracting text from a Table is a bit different. Therefore, you need to follow the below steps to extract text from Table objects:
- Load source PDF file
- Create an object of TableAbsorber class
- Iterate through AbsorbedTable
- Loop through each cell in each row of Table
Below code snippet follows these steps and efficiently extracts the text from Table cells in PDF document using C#:
Search & Extract Highlighted Text from PDF Document using C#
Highlighted text is present as Annotation in PDF files. They contain Marked Text which makes them different from the conventional text in a document. Below steps describe how to read highlighted text using C#:
- Load source document
- Loop through all the annotations
- Filter TextMarkupAnnotation
- Retrieve highlighted text fragments
The following is a code snippet based on the steps above, it can be used to get highlight text from PDF files:
Extract PDF Text with Low Memory Consumption using C#
The following are two different approaches to optimize memory consumption while extracting text from PDF documents using C# language.
i) Using Reset() and FreeMemory() Method
Sometimes the text extraction may consume huge memory and processor. Possibly when the input file is huge and contains a lot of text. Because TextFragmentAbsorber object stores all found text fragments in the memory.
Therefore, the solution we recommend here is to call absorber.Reset() method after processing each page. Moreover, if you are doing read operations only then you can also free the memory held by page objects, with page.FreeMemory() method. So you need to follow the below steps to utilize minimal resources:
- Load source PDF document
- Specify TextEditOptions
- Read TextFragment
- Call Reset() method
- Call FreeMemory() method
The following code snippet demonstrates text extraction from PDF documents using C#:
We have tested this code snippet with a huge sample file containing 450 pages, 24500 text fragments, and a lot of raster and vector images. The process consumed a mere 500MB of memory. Another tip here is that you may charge .NET garbage collector to decrease maximum memory consumption to around 250MB with an additional cost of 10 seconds of processing time.
ii) Using MemorySaving enumeration
Aspose.PDF for .NET API supports configuring memory saving mode while reading text from a PDF file. The TextExtractionOptions.TextFormattingMode enumeration serves the purpose of optimal consumption of resources. Following steps summarize the steps you need to follow for this approach:
- Load source PDF file
- Create TextDevice object
- Set TextExtractionOptions
- Extract text and write text to a file
Below code snippet demonstrates the memory saving approach using C#:
Conclusion
We have explored a few of the many useful use cases in this article, there are endless possibilities of extracting text from PDF documents using C# with Aspose.PDF for .NET API. You can learn more via API documentation. In case of any inquiry, please reach out to us at Free Support Forums.