docs: add a quick start page to the readme and docs (#240)

* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
This commit is contained in:
Matt Robinson 2023-02-17 17:13:28 -05:00 committed by GitHub
parent 601f250edc
commit 7472e1bb21
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 77 additions and 7 deletions

View File

@ -49,11 +49,48 @@ about. Bricks in the library fall into three categories:
- :performing_arts: ***Staging bricks*** that format data for downstream tasks, such as ML inference
and data labeling.
<br></br>
## :eight_pointed_black_star: Installation
## :eight_pointed_black_star: Quick Start
To install the library, run `pip install unstructured`.
Use the following instructions to get up and running with `unstructured` and test your
installation.
## :coffee: Getting Started
- Install the Python SDK with `pip install unstructured[local-inference]`
- If you do not need to process PDFs or images, you can run `pip install unstructured`
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
- `libmagic-dev` (filetype detection)
- `poppler-utils` (images and PDFs)
- `tesseract-ocr` (images and PDFs)
- `libreoffice` (MS Office docs)
- Run the following to install NLTK dependencies. `unstructured` will handle this automatically
soon.
- `python -c "import nltk; nltk.download('punkt')"`
- `python -c "import nltk; nltk.download('averaged_perceptron_tagger')"`
- If you are parsing PDFs, run the following to install the `detectron2` model, which
`unstructured` uses for layout detection:
- `pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"`
At this point, you should be able to run the following code:
```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/fake-email.eml")
```
And if you installed with `local-inference`, you should be able to run this as well:
```python
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper.pdf")
```
## :coffee: Installation Instructions for Local Development
The following instructions are intended to help you get up and running with `unstructured`
locally if you are planning to contribute to the project.
* Using `pyenv` to manage virtualenv's is recommended but not necessary
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.

View File

@ -1,10 +1,43 @@
Installation
============
You can install the library by cloning the repo and running ``make install`` from the
root directory. Developers can run ``make install-local`` to install the dev and test
requirements alongside the base requirements. If you want a minimal installation without any
parser specific dependencies, run ``make install-base``.
Quick Start
-----------
Use the following instructions to get up and running with ``unstructured`` and test your
installation.
* Install the Python SDK with ``pip install unstructured[local-inference]``
* If you do not need to process PDFs or images, you can run ``pip install unstructured``
* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
* ``libmagic-dev`` (filetype detection)
* ``poppler-utils`` (images and PDFs)
* ``tesseract-ocr`` (images and PDFs)
* ``libreoffice`` (MS Office docs)
* Run the following to install NLTK dependencies. ``unstructured`` will handle this automatically soon.
* ``python -c "import nltk; nltk.download('punkt')"``
* ``python -c "import nltk; nltk.download('averaged_perceptron_tagger')"``
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``
At this point, you should be able to run the following code:
.. code:: python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/fake-email.eml")
And if you installed with `local-inference`, you should be able to run this as well:
.. code:: python
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper.pdf")
Installation with ``conda`` on Windows