Textract Plus¶

As undesirable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.

This package provides two primary facilities for doing this, the command line interface

textractplus path/to/file.extension

or the python package

# some python file
import textractplus as tp
text = tp.process("path/to/file.extension")

Currently supporting¶

textract-plus supports a growing list of file types for text extraction. If you don’t see your favorite file type here, Please recommend other file types by either mentioning them on the issue tracker or by contributing a pull request.

.csv via python builtins
.tsv and .tab via python builtins
.doc via antiword
.docx via python-docx2txt
.eml via python builtins
.epub via ebooklib
.gif via tesseract-ocr
.jpg and .jpeg via tesseract-ocr
.json via python builtins
.html and .htm via beautifulsoup4
.mp3 via sox, SpeechRecognition, and pocketsphinx
.msg via msg-extractor
.odt via python builtins
.ogg via sox, SpeechRecognition, and pocketsphinx
.pdf via pdftotext (default) or pdfminer.six
.png via tesseract-ocr
.pptx via python-pptx
.ps via ps2ascii
.rtf via unrtf
.tiff and .tif via tesseract-ocr
.txt via python builtins
.wav via SpeechRecognition and pocketsphinx
.xlsx via xlrd
.xls via xlrd

Extended with Textract Plus¶

.dotx via docx2python
.docm via docx2python
.pptm via python-pptx

Textract Plus¶

Currently supporting¶

Extended with Textract Plus¶

Related projects¶

Indices and tables¶