* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
You can install and run ``unstructured`` on Windows with ``conda``, but the process
involves a few extra steps. This section will help you get up and running.
* Install `Anaconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html>`_ on your Windows machine.
* Install Microsoft C++ Build Tools using the instructions in `this Stackoverflow post <https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst>`_. C++ build tools are required for the ``pycocotools`` dependency.
* Run ``conda env create -f environment.yml`` using the ``environment.yml`` file in the ``unstructured`` repo to create a virtual environment. The environment will be named ``unstructured``.
* Run ``conda activate unstructured`` to activate the virtualenvironment.
* Run ``pip install unstructured`` to install the ``unstructured`` library.
===============================================
Setting up ``unstructured`` for local inference
===============================================
If you need to run model inferences locally, there are a few additional steps you need to
take. The main challenge is installing ``detectron2`` for PDF layout parsing. ``detectron2``
does not officially support Windows, but it is possible to get it to install on Windows.
The installation instructions are based on the instructions LayoutParser provides
* Run ``pip install pycocotools-windows`` to install a Windows compatible version of ``pycocotools``. Alternatively, you can run ``pip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"`` as outlined in `this GitHub issue <https://github.com/cocodataset/cocoapi/issues/169#issuecomment-462528628>`_.
* Run ``git clone https://github.com/ivanpp/detectron2.git``, then ``cd detectron2``, then ``pip install -e .`` to install a Windows compatible version of the ``detectron2`` library.
* Install the a Windows compatible version of ``iopath`` using the instructions outlined in `this GitHub issue <https://github.com/Layout-Parser/layout-parser/issues/15#issuecomment-819546751>`_. First, run ``git clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8``. Then on line 753 in ``iopath/iopath/common/file_io.py`` change ``filename = path.split("/")[-1]`` to ``filename = parsed_url.path.split("/")[-1]``. After that, navigate to the ``iopath`` directory and run ``pip install -e .``.
* Run ``pip install unstructured[local-inference]``. This will install the ``unstructured_inference`` dependency.
At this point, you can verify the installation by running the following from the root directory of the ``unstructured```repo <https://github.com/Unstructured-IO/unstructured>`_:
..code:: python
from unstructured.partition.pdf import partition_pdf
PaddleOCR is another package that is helpful to use in conjunction with ``unstructured``.
You can use the following steps to install ``paddleocr`` in your ``unstructured````conda``
environment.
* Run ``conda install -c esri paddleocr``
* If you have the Windows version of ``detectron2`` cloned and installed locally, change the name of ``detectron2/tools`` to ``detectron2/detectron2_tools``. Otherwise, you will hit the module name conflict error described in `this issue <https://github.com/PaddlePaddle/PaddleOCR/issues/1024>`_.
* Set the environment variable ``KMP_DUPLICATE_LIB_OK`` to ``"TRUE"``. This prevents the ``libiomp5md.dll`` linking issue described `in this issue on GitHub <https://github.com/PaddlePaddle/PaddleOCR/issues/4613>`_.
At this point, you can verify the installation using the following commands. Choose a