mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-16 13:35:57 +00:00

* add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile
152 lines
6.7 KiB
ReStructuredText
152 lines
6.7 KiB
ReStructuredText
Installation
|
|
============
|
|
|
|
You can install the library by cloning the repo and running ``make install`` from the
|
|
root directory. Developers can run ``make install-local`` to install the dev and test
|
|
requirements alongside the base requirements. If you want a minimal installation without any
|
|
parser specific dependencies, run ``make install-base``.
|
|
|
|
|
|
Installation with ``conda`` on Windows
|
|
--------------------------------------
|
|
|
|
You can install and run ``unstructured`` on Windows with ``conda``, but the process
|
|
involves a few extra steps. This section will help you get up and running.
|
|
|
|
* Install `Anaconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html>`_ on your Windows machine.
|
|
* Install Microsoft C++ Build Tools using the instructions in `this Stackoverflow post <https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst>`_. C++ build tools are required for the ``pycocotools`` dependency.
|
|
* Run ``conda env create -f environment.yml`` using the ``environment.yml`` file in the ``unstructured`` repo to create a virtual environment. The environment will be named ``unstructured``.
|
|
* Run ``conda activate unstructured`` to activate the virtualenvironment.
|
|
* Run ``pip install unstructured`` to install the ``unstructured`` library.
|
|
|
|
===============================================
|
|
Setting up ``unstructured`` for local inference
|
|
===============================================
|
|
|
|
If you need to run model inferences locally, there are a few additional steps you need to
|
|
take. The main challenge is installing ``detectron2`` for PDF layout parsing. ``detectron2``
|
|
does not officially support Windows, but it is possible to get it to install on Windows.
|
|
The installation instructions are based on the instructions LayoutParser provides
|
|
`here <https://layout-parser.github.io/tutorials/installation#for-windows-users>`_.
|
|
|
|
* Run ``pip install pycocotools-windows`` to install a Windows compatible version of ``pycocotools``. Alternatively, you can run ``pip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"`` as outlined in `this GitHub issue <https://github.com/cocodataset/cocoapi/issues/169#issuecomment-462528628>`_.
|
|
* Run ``git clone https://github.com/ivanpp/detectron2.git``, then ``cd detectron2``, then ``pip install -e .`` to install a Windows compatible version of the ``detectron2`` library.
|
|
* Install the a Windows compatible version of ``iopath`` using the instructions outlined in `this GitHub issue <https://github.com/Layout-Parser/layout-parser/issues/15#issuecomment-819546751>`_. First, run ``git clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8``. Then on line 753 in ``iopath/iopath/common/file_io.py`` change ``filename = path.split("/")[-1]`` to ``filename = parsed_url.path.split("/")[-1]``. After that, navigate to the ``iopath`` directory and run ``pip install -e .``.
|
|
* Run ``pip install unstructured[local-inference]``. This will install the ``unstructured_inference`` dependency.
|
|
|
|
At this point, you can verify the installation by running the following from the root directory of the ``unstructured`` `repo <https://github.com/Unstructured-IO/unstructured>`_:
|
|
|
|
|
|
.. code:: python
|
|
|
|
from unstructured.partition.pdf import partition_pdf
|
|
|
|
partition_pdf("example-docs/layout-parser-paper-fast.pdf", url=None)
|
|
|
|
|
|
====================
|
|
Installing PaddleOCR
|
|
====================
|
|
|
|
PaddleOCR is another package that is helpful to use in conjunction with ``unstructured``.
|
|
You can use the following steps to install ``paddleocr`` in your ``unstructured`` ``conda``
|
|
environment.
|
|
|
|
* Run ``conda install -c esri paddleocr``
|
|
* If you have the Windows version of ``detectron2`` cloned and installed locally, change the name of ``detectron2/tools`` to ``detectron2/detectron2_tools``. Otherwise, you will hit the module name conflict error described in `this issue <https://github.com/PaddlePaddle/PaddleOCR/issues/1024>`_.
|
|
* Set the environment variable ``KMP_DUPLICATE_LIB_OK`` to ``"TRUE"``. This prevents the ``libiomp5md.dll`` linking issue described `in this issue on GitHub <https://github.com/PaddlePaddle/PaddleOCR/issues/4613>`_.
|
|
|
|
|
|
At this point, you can verify the installation using the following commands. Choose a
|
|
``.jpg`` image that contains text.
|
|
|
|
.. code:: python
|
|
|
|
import numpy as np
|
|
from PIL import Image
|
|
from paddleocr import PaddleOCR
|
|
|
|
filename = "path/to/my/image.jpg"
|
|
img = np.array(Image.open(filename))
|
|
ocr = PaddleOCR(lang="en", use_gpu=False, show_log=False)
|
|
result = ocr.ocr(img=img)
|
|
|
|
|
|
|
|
Logging
|
|
-------
|
|
|
|
You can set the logging level for the package with the ``LOG_LEVEL`` environment variable.
|
|
By default, the log level is set to ``WARNING``. For debugging, consider setting the log
|
|
level to ``INFO`` or ``DEBUG``.
|
|
|
|
|
|
Extra Dependencies
|
|
-------------------
|
|
|
|
==================
|
|
Filetype Detection
|
|
==================
|
|
|
|
The ``filetype`` module in ``unstructured`` uses ``libmagic`` to detect MIME types. For
|
|
this to work, you'll need ``libmagic`` installed on your computer. On a Mac, you can run:
|
|
|
|
.. code:: console
|
|
|
|
$ brew install libmagic
|
|
|
|
One Debian, run:
|
|
|
|
.. code:: console
|
|
|
|
$ sudo apt-get install -y libmagic-dev
|
|
|
|
|
|
If you are on Windows using ``conda``, run:
|
|
|
|
.. code:: console
|
|
|
|
$ conda install -c conda-forge libmagic
|
|
|
|
|
|
=================
|
|
NLTK Dependencies
|
|
=================
|
|
|
|
The `NLTK <https://www.nltk.org/>`_ library is used for word and sentence tokenziation and
|
|
part of speech (POS) tagging. Tokenization and POS tagging help to identify sections of
|
|
narrative text within a document and are used across parsing families. The ``make install``
|
|
command downloads the ``punkt`` and ``averaged_perceptron_tagger`` depdenencies from ``nltk``.
|
|
If they are not already installed, you can install them with ``make install-nltk``.
|
|
|
|
======================
|
|
XML/HTML Depenedencies
|
|
======================
|
|
|
|
For XML and HTML parsing, you'll need ``libxml2`` and ``libxlst`` installed. On a Mac, you can do
|
|
that with:
|
|
|
|
|
|
.. code:: console
|
|
|
|
$ brew install libxml2
|
|
$ brew install libxslt
|
|
|
|
========================
|
|
Huggingface Dependencies
|
|
========================
|
|
|
|
The ``transformers`` requires the Rust compiler to be present on your system in
|
|
order to properly ``pip`` install. If a Rust compiler is not available on your system,
|
|
you can run the following command to install it:
|
|
|
|
.. code:: console
|
|
|
|
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
|
|
|
Additionally, some tokenizers in the ``transformers`` library required the ``sentencepiece``
|
|
library. This is not included as an ``unstructured`` dependency because it only applies
|
|
to some tokenizers. See the
|
|
`sentencepiece install instructions <https://github.com/google/sentencepiece#installation>`_ for
|
|
information on how to install ``sentencepiece`` if your tokenizer requires it.
|