Apache Hadoop has great capabilities for archiving big data through its flexible distributed file system (HDFS) across several nodes. This big data solution is also powered by the MapReduce Framework which enables developers to analyze the archived data through its APIs. The big data may be structured or unstructured and may be in any file format. Keeping this in mind, we have the released first version of the Aspose for Hadoop project which enables developers to work with a number of file formats. Below is a list of the file formats supported in the initial version:
- Microsoft Word (DOC)
- WordprocessingML (DOCX, XML)
- Rich Text Format (RTF)
- HTML, XHTML and MHTML
- OpenDocument (ODT)
- Microsoft Excel (XLS)
- SpreadsheetML (XLSX, XML)
- OpenDocument Spreadsheet (ODS)
- PresentationML (PPTX, XML)
- Outlook Emails (MSG)
Using the Aspose for Hadoop project, the Hadoop developers can parse text from any of the above formats. The text can then be used in MapReduce analysis algorithms or for any other purpose depending on the use case. The project comes up with two packages:
- com.aspose.hadoop.core – Provides Aspose for Java wrapper classes to parse text from the above formats. The package includes a couple of classes to override Hadoop input formats so the binary sequence files can be created.
- com.aspose.hadoop.examples – Provides mapper examples for creating and converting binary sequence files for all the supported formats into text sequence files.
The project can be downloaded from Github: https://github.com/asposemarketplace/Aspose_for_Hadoop/
The project readme file explains the project flow and usage. If you have any inquiries, suggestions, or confusion, we are keen to hear back from you. Inquiries can be posted in our forums or the GitHub project repository.