unstructured/docs/source/installing.rst

55 lines
1.9 KiB
ReStructuredText
Raw Normal View History

2022-06-29 14:35:19 -04:00
Installation
============
You can install the library by cloning the repo and running ``make install`` from the
root directory. Developers can run ``make install-local`` to install the dev and test
requirements alongside the base requirements. Specific parsing capabilities may require
extra dependencies, as documented below. If you want a minimal installation without any
parser specific dependencies, run ``make install-base``.
Logging
-------
You can set the logging level for the package with the ``LOG_LEVEL`` environment variable.
By default, the log level is set to ``WARNING``. For debugging, consider setting the log
level to ``INFO`` or ``DEBUG``.
=================
NLTK Dependencies
=================
The `NLTK <https://www.nltk.org/>`_ library is used for word and sentence tokenziation and
part of speech (POS) tagging. Tokenization and POS tagging help to identify sections of
narrative text within a document and are used across parsing families. The ``make install``
command downloads the ``punkt`` and ``averaged_perceptron_tagger`` depdenencies from ``nltk``.
If they are not already installed, you can install them with ``make install-nltk``.
======================
XML/HTML Depenedencies
======================
For XML and HTML parsing, you'll need ``libxml2`` and ``libxlst`` installed. On a Mac, you can do
that with:
.. code:: console
$ brew install libxml2
$ brew install libxslt
================
PDF Dependencies
================
Currently, PDF parsing capabilities rely on the
`Detectron2 <https://github.com/facebookresearch/detectron2>`_ object detection model. The
``make install-local`` command installs all of the dependencies for Detectron2. If you
need to parse PDFs and Detectron2 is not already installed, you can install it with
``make install-detectron2``.
Also ensure that you have ``poppler`` installed on your system. On a Mac, you can run:
.. code:: console
$ brew install poppler