docs: instructions on how to install on Windows + conda (#129)

* add environment.yml * instructions on how to install base package and detectron2 * added instructions on paddleocr * remove covers * install -> to install * specified the shell * updated example snippets * update environment.yml * updated the repo reference * no more ands!
2026-01-07 21:00:29 +00:00 · 2023-01-05 11:21:44 -05:00 · 2023-01-05 11:21:44 -05:00 · 33b983fbf0
commit 33b983fbf0
parent 5a47eb06e9
2 changed files with 88 additions and 0 deletions
--- a/docs/source/installing.rst
+++ b/docs/source/installing.rst
@ -6,6 +6,73 @@ root directory. Developers can run ``make install-local`` to install the dev and
 requirements alongside the base requirements. If you want a minimal installation without any
 parser specific dependencies, run ``make install-base``.

+
+Installation with ``conda`` on Windows
+--------------------------------------
+
+You can install and run ``unstructured`` on Windows with ``conda``, but the process
+involves a few extra steps. This section will help you get up and running.
+
+* Install `Anaconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html>`_ on your Windows machine.
+* Install Microsoft C++ Build Tools using the instructions in `this Stackoverflow post <https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst>`_. C++ build tools are required for the ``pycocotools`` dependency.
+* Run ``conda env create -f environment.yml`` using the ``environment.yml`` file in the ``unstructured`` repo to create a virtual environment. The environment will be named ``unstructured``.
+* Run ``conda activate unstructured`` to activate the virtualenvironment.
+* Run ``pip install unstructured`` to install the ``unstructured`` library.
+
+===============================================
+Setting up ``unstructured`` for local inference
+===============================================
+
+If you need to run model inferences locally, there are a few additional steps you need to
+take. The main challenge is installing ``detectron2`` for PDF layout parsing. ``detectron2``
+does not officially support Windows, but it is possible to get it to install on Windows.
+The installation instructions are based on the instructions LayoutParser provides
+`here <https://layout-parser.github.io/tutorials/installation#for-windows-users>`_.
+
+* Run ``pip install pycocotools-windows`` to install a Windows compatible version of ``pycocotools``. Alternatively, you can run ``pip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"`` as outlined in `this GitHub issue <https://github.com/cocodataset/cocoapi/issues/169#issuecomment-462528628>`_.
+* Run ``git clone https://github.com/ivanpp/detectron2.git``, then ``cd detectron2``, then ``pip install -e .`` to install a Windows compatible version of the ``detectron2`` library.
+* Install the a Windows compatible version of ``iopath`` using the instructions outlined in `this GitHub issue <https://github.com/Layout-Parser/layout-parser/issues/15#issuecomment-819546751>`_. First, run ``git clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8``. Then on line 753 in ``iopath/iopath/common/file_io.py`` change ``filename = path.split("/")[-1]`` to ``filename = parsed_url.path.split("/")[-1]``. After that, navigate to the ``iopath`` directory and run ``pip install -e .``.
+* Run ``pip install unstructured[local-inference]``. This will install the ``unstructured_inference`` dependency.
+
+At this point, you can verify the installation by running the following from the root directory of the ``unstructured`` `repo <https://github.com/Unstructured-IO/unstructured>`_:
+
+
+.. code:: python
+
+	from unstructured.partition.pdf import partition_pdf
+
+	partition_pdf("example-docs/layout-parser-paper-fast.pdf", url=None)
+
+
+====================
+Installing PaddleOCR
+====================
+
+PaddleOCR is another package that is helpful to use in conjunction with ``unstructured``.
+You can use the following steps to install ``paddleocr`` in your ``unstructured`` ``conda``
+environment.
+
+* Run ``conda install -c esri paddleocr``
+* If you have the Windows version of ``detectron2`` cloned and installed locally, change the name of ``detectron2/tools`` to ``detectron2/detectron2_tools``. Otherwise, you will hit the module name conflict error described in `this issue <https://github.com/PaddlePaddle/PaddleOCR/issues/1024>`_.
+* Set the environment variable ``KMP_DUPLICATE_LIB_OK`` to ``"TRUE"``. This prevents the ``libiomp5md.dll`` linking issue described `in this issue on GitHub <https://github.com/PaddlePaddle/PaddleOCR/issues/4613>`_.
+
+
+At this point, you can verify the installation using the following commands. Choose a
+``.jpg`` image that contains text.
+
+.. code:: python
+
+	import numpy as np
+	from PIL import Image
+	from paddleocr import PaddleOCR
+
+	filename = "path/to/my/image.jpg"
+	img = np.array(Image.open(filename))
+	ocr = PaddleOCR(lang="en", use_gpu=False, show_log=False)
+	result = ocr.ocr(img=img)
+
+
+
 Logging
 -------

@ -13,6 +80,10 @@ You can set the logging level for the package with the ``LOG_LEVEL`` environment
 By default, the log level is set to ``WARNING``. For debugging, consider setting the log
 level to ``INFO`` or ``DEBUG``.

+
+Extra Dependencies
+-------------------
+
 =================
 NLTK Dependencies
 =================
--- a/environment.yml
+++ b/environment.yml
@ -0,0 +1,17 @@
+name: unstructured
+
+channels:
+  - defaults
+  - anaconda
+  - conda-forge
+  - pytorch
+
+dependencies:
+  - python=3.8
+  - pytorch=1.12.1
+  - pywin32
+  - poppler
+  - torchvision
+  - pip:
+    - huggingface-hub
+    - layoutparser