Text extraction from documents (PDF, word processing, web pages, etc.) has a variety of use cases in the world of digital information. For example, it can be used for parsing documents, performing text analysis, information retrieval, storing documents’ content into databases, and so on. If we narrow it down, PDF is one of the most widely used document formats to keep and share digital information. This popularity makes PDF documents a huge source of information. Therefore, parsing or extracting text from PDF documents could possibly be involved in a number of text analysis scenarios.
In order to automate the PDF parsing in C++ applications, this article demonstrates how to extract text from PDF documents using C++. It covers the following text extraction scenarios:
- Extract text from a PDF document using C++.
- Extract text from particular pages in a PDF document using C++.
- Page by page text extraction from a PDF document using C++.
C++ PDF Reader and Text Extractor Library
For extracting text from PDF documents, we’ll be using Aspose.PDF for C++ which is a powerful PDF library for creating, converting and parsing PDF documents. You can download the library files as well as the running code samples from the Downloads section.
Extract Text from PDF using C++
Aspose.PDF for C++ lets you parse the PDF documents in a few simple steps. The following is the recipe for extracting text from a PDF document.
- Create an object of the PdfExtractor class.
- Load the PDF document using PdfExtractor->BindPdf() function.
- Extract the text from PDF document to PdfExtractor using PdfExtractor->ExtractText() function.
- Save the extracted text into a MemoryStream object.
- Read the text as string from MemoryStream.
The following code sample shows how to extract text from a PDF using C++.
Extract Text from Particular Pages in PDF using C++
There could be the case when you need to extract text from a few pages of PDF only. For such a case, you can specify a range of the pages in PDF by setting starting and ending page numbers. The following are the steps to extract text from particular pages in a PDF document.
- Create an object of the PdfExtractor class.
- Load the PDF document using PdfExtractor->BindPdf() function.
- Set the starting and ending page number using PdfExtractor->set_StartPage() and PdfExtractor->set_EndPage() functions respectively.
- Extract the text from PDF using PdfExtractor->ExtractText() function.
- Save the extracted text into a MemoryStream object.
- Read the text as string from MemoryStream.
The following code sample shows how to extract text from particular pages of PDF in C++.
Extract Page by Page Text from PDF in C++
Instead of extracting all the text from a PDF document, you can extract text from every page of the document separately. The following are the steps to perform the page by page text extraction from PDF.
- Create an object of the PdfExtractor class.
- Load the PDF document using PdfExtractor->BindPdf() function.
- Call PdfExtractor->ExtractText() function to retreive text from PDF document to PdfExtractor.
- Loop through every page using PdfExtractor->HasNextPageText() function.
- Extract text into memory stream using PdfExtractor->GetNextPageText() function.
- Read text from the memory stream.
The following code sample shows how to extract text page by page from PDF in C++.
Learn more about Aspose.PDF for C++
You can explore more about Aspose.PDF for C++ using the documentation.