unstructured/docs/source/installing.rst

56 lines
2.1 KiB
ReStructuredText
Raw Normal View History

2022-06-29 14:35:19 -04:00
Installation
============
You can install the library by cloning the repo and running ``make install`` from the
root directory. Developers can run ``make install-local`` to install the dev and test
requirements alongside the base requirements. If you want a minimal installation without any
2022-06-29 14:35:19 -04:00
parser specific dependencies, run ``make install-base``.
Logging
-------
You can set the logging level for the package with the ``LOG_LEVEL`` environment variable.
By default, the log level is set to ``WARNING``. For debugging, consider setting the log
level to ``INFO`` or ``DEBUG``.
=================
NLTK Dependencies
=================
The `NLTK <https://www.nltk.org/>`_ library is used for word and sentence tokenziation and
part of speech (POS) tagging. Tokenization and POS tagging help to identify sections of
narrative text within a document and are used across parsing families. The ``make install``
command downloads the ``punkt`` and ``averaged_perceptron_tagger`` depdenencies from ``nltk``.
If they are not already installed, you can install them with ``make install-nltk``.
======================
XML/HTML Depenedencies
======================
For XML and HTML parsing, you'll need ``libxml2`` and ``libxlst`` installed. On a Mac, you can do
that with:
.. code:: console
$ brew install libxml2
$ brew install libxslt
========================
Huggingface Dependencies
========================
The ``transformers`` requires the Rust compiler to be present on your system in
order to properly ``pip`` install. If a Rust compiler is not available on your system,
you can run the following command to install it:
.. code:: console
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Additionally, some tokenizers in the ``transformers`` library required the ``sentencepiece``
library. This is not included as an ``unstructured`` dependency because it only applies
to some tokenizers. See the
`sentencepiece install instructions <https://github.com/google/sentencepiece#installation>`_ for
information on how to install ``sentencepiece`` if your tokenizer requires it.