In various scenarios, the text is extracted from the documents for further processing such as in text analysis, classification, etc. Among other documents such as PDF and Word, PowerPoint files are also used in text extraction. Therefore, this article aims to show you how to extract text from PowerPoint files in Python. We will cover how to extract text from a specific slide or the whole presentation.
Python Library to Extract Text from PowerPoint Files
To extract text from PowerPoint files, we will use Aspose.Slides for Python via .NET. It is a feature-rich Python library to create and update PowerPoint presentations. Furthermore, it allows you to manipulate and convert the presentations seamlessly. You can install this library from PyPI using the following pip command.
> pip install aspose.slides
Extract Text from PowerPoint Files in Python
Depending upon the scenario, you may need to extract text either from the whole PowerPoint presentation or some specific slide(s). In the following sections, we will demonstrate how to perform text extraction in both of the above-mentioned cases. So let’s proceed.
Extract Text from a Specific Slide
The following are the steps to extract text from a specific slide in PPT in Python.
- First, use PresentationFactory().get_presentation_text(string, TextExtractionArrangingMode) method to get all types of text in the presentation.
- After that, use index to extract text of a sepcific slide from slides_text array.
- The following are the types of text you can extract:
- Slide’s Text
- Notes
- Slide layout text
- Slide master text
The following code sample shows how to extract text from a specific PPT slide in Python.
Text Extraction from Whole PowerPoint File in Python
The following steps demonstrate how to extract text from all the slides of a PowerPoint presentation.
- First, use PresentationFactory().get_presentation_text(string, TextExtractionArrangingMode) method to get all types of text in presentation.
- Load presentation in a Presentation object.
- Iterate through the number of slides in the presentation.
- Extract text from each slide using slides_text array.
The following code sample shows how to extract text from a PPTX (or PPT) file in Python.
Get a Free License
You can use Aspose.Slides for Python via .NET without evaluation limitations by getting a temporary license.
Conclusion
In this article, you have learned how to extract text from PowerPoint files in Python. You have seen how to extract text from a specific slide or all the slides in a PowerPoint presentation. Besides, you can explore other features of Aspose.Slides for Python using the documentation. Also, you can share your queries with us via our forum.