Installation¶
One of the main goals of textract-plus is to make it as easy as possible to
start using textract-plus (meaning that installation should be as quick and
painless as possible). This package is built on top of several python
packages and other source libraries. Assuming you are using pip
or
easy_install
to install textract-plus, the python packages
are all installed by default with textract-plus. The source libraries are a
separate matter though and largely depend on your operating system.
Ubuntu / Debian¶
There are two steps required to run this package on Ubuntu/Debian. First you must install some system packages using the apt-get package manager before installing textract-plus from pypi.
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
pip install textract-plus
Note
It may also be necessary to install zlib1g-dev
on Docker
instances of Ubuntu. See issue #19 for details
OSX¶
These steps rely on you having homebrew installed
as well as the cask plugin (brew tap caskroom/cask
). The basic idea is to first install
XQuartz before
installing a bunch of system packages before installing textract-plus from
pypi.
brew install --cask xquartz
brew install poppler antiword unrtf tesseract swig
pip install textract-plus
Note
pstotext is
not currently a part of homebrew so .ps
extraction must be
enabled by manually installing from source.
Note
Depending on how you have python configured on your system with homebrew, you may also need to install the python development header files for textract-plus to properly install.
FreeBSD¶
Setting up this package on FreeBSD pretty much follows the steps for
Ubuntu / Debian while using pkg
as package manager.
pkg install lang/python38 devel/py-pip textproc/libxml2 textproc/libxslt textproc/antiword textproc/unrtf \
graphics/poppler print/pstotext graphics/tesseract audio/flac multimedia/ffmpeg audio/lame audio/sox \
graphics/jpeg-turbo
pip install textract-plus
Don’t see your operating system installation instructions here?¶
My apologies! Installing system packages is a bit of a drag and its hard to anticipate all of the different environments that need to be accomodated (wouldn’t it be awesome if there were a system-agnostic package manager or, better yet, if python could install these system dependencies for you?!?!). If you’re operating system doesn’t have documenation about how to install the textract-plus dependencies, please contribute a pull request with:
A new section in here with the appropriate details about how to install things. In particular, please give instructions for how to install the following libraries before running
pip install textract-plus
:libxml2 2.6.21 or later is required by the
.docx
parser which uses lxml via python-docx.libxslt 1.1.15 or later is required by the
.docx
parser which users lxml via python-docx.python header files are required for building lxml.
antiword is required by the
.doc
parser.pdftotext is optionally required by the
.pdf
parser (there is a pure python fallback that works if pdftotext isn’t installed).pstotext is required by the
.ps
parser.tesseract-ocr is required by the
.jpg
,.png
and.gif
parser.sox is required by the
.mp3
and.ogg
parser. You need to install ffmpeg, lame, libmad0 and libsox-fmt-mp3, before building sox, for these filetypes to work.
Add a requirements file to the requirements directory of the project with the lower-cased name of your operating system (e.g.
requirements/windows
) so we can try to keep these things up to date in the future.