unstructured/docs/source/introduction/getting_started.rst

Quick Start
===========

Installation
------------

This guide offers concise steps to swiftly install and validate your ``unstructured`` installation. For more comprehensive installation guide, please refer to `this page <http://localhost:63342/CHANGELOG.md/docs/build/html/installing.html>`__.

1. **Installing the Python SDK**: 
   You can install the core SDK using pip:
   
   .. code-block:: bash

      pip install unstructured

   Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies.

   If you need to process other document types, you can install the extras required by following the :doc:`../installation/full_installation`

2. **System Dependencies**:
   Ensure the subsequent system dependencies are installed. Your requirements might vary based on the document types you're handling:

   - `libmagic-dev` : Essential for filetype detection.
   - `poppler-utils` : Needed for images and PDFs.
   - `tesseract-ocr` : Essential for images and PDFs.
   - `libreoffice` : For MS Office documents.
   - `pandoc` : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version `2.14.2` or newer. Running `this script <https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh>`__ will install the correct version for you.

Validating Installation
-----------------------

After installation, confirm the setup by executing the below Python code:

.. code-block:: python

   from unstructured.partition.auto import partition
   elements = partition(filename="example-docs/eml/fake-email.eml")

If you've opted for the "local-inference" installation, you should also be able to execute:

.. code-block:: python

   from unstructured.partition.auto import partition
   elements = partition("example-docs/layout-parser-paper.pdf")

If these code snippets run without errors, congratulations! Your ``unstructured`` installation is successful and ready for use.


The following section will cover basic concepts and usage patterns in ``unstructured``.
After reading this section, you should be able to:

* Partitioning a document with the ``partition`` function.
* Understand how documents are structured in ``unstructured``.
* Convert a document to a dictionary and/or save it as a JSON.

The example documents in this section come from the
`example-docs <https://github.com/Unstructured-IO/unstructured/tree/main/example-docs>`_
directory in the ``unstructured`` repo.

Before running the code in this make sure you've installed the ``unstructured`` library
and all dependencies using the instructions in the `Quick Start <https://unstructured-io.github.io/unstructured/installing.html#quick-start>`_ section.

Partitioning a document
-----------------------

In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document.
The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections,
and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for
partitioning a document. We'll cover those in a later section. For now, we'll use the simplest API in the library,
the ``partition`` function. The ``partition`` function will detect the filetype of the source document and route it to the appropriate
partitioning function. You can try out the partition function by running the cell below.

.. code:: python

	from unstructured.partition.auto import partition

	elements = partition(filename="example-10k.html")


You can also pass in a file as a file-like object using the following workflow:

.. code:: python

	with open("example-10k.html", "rb") as f:
	    elements = partition(file=f)


The ``partition`` function uses `libmagic <https://formulae.brew.sh/formula/libmagic>`_ for filetype detection. If ``libmagic`` is
not present and the user passes a filename, ``partition`` falls back to detecting the filetype using the file extension.
``libmagic`` is required if you'd like to pass a file-like object to ``partition``.
We highly recommend installing ``libmagic`` and you may observe different file detection behaviors
if ``libmagic`` is not installed`.


Quickstart Tutorial
-------------------

If you're eager to dive in, head over `Getting Started <https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ>`__ on Google Colab to get a hands-on introduction to the ``unstructured`` library. In a few minutes, you'll have a basic workflow set up and running!

For more detailed information about specific components or advanced features, explore the rest of the documentation.
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			`Quick Start`
			`===========`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			`Installation`
			`------------`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			This guide offers concise steps to swiftly install and validate your ``unstructured`` installation. For more comprehensive installation guide, please refer to `this page <http://localhost:63342/CHANGELOG.md/docs/build/html/installing.html>`__.
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`1. Installing the Python SDK:`
			`You can install the core SDK using pip:`

			`.. code-block:: bash`

			`pip install unstructured`

			`Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies.`

			If you need to process other document types, you can install the extras required by following the :doc:`../installation/full_installation`

			`2. System Dependencies:`
			`Ensure the subsequent system dependencies are installed. Your requirements might vary based on the document types you're handling:`

			- `libmagic-dev` : Essential for filetype detection.
			- `poppler-utils` : Needed for images and PDFs.
			- `tesseract-ocr` : Essential for images and PDFs.
			- `libreoffice` : For MS Office documents.
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593) ## Problem Description In some cases you might find yourselves in a situation when pandoc won't be able to process an `rtf` as input file format, because older versions simply do not support that. ``` RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki ``` Basically, some user may install the wrong version. The `README.md` is not be precise enough when mentioning RTF files support: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122 ## Example Installing `pandoc` from a [stable repository, like Debian](https://packages.debian.org/source/bullseye/pandoc) will give you `2.9` and the official documentation shows clearly that support for rtf was introduced in `2.14` https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21 ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8) ### Note that `rtf` is not there ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38) ### More detail ![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8) ## Proposed Solution - [x] I've simply added/copied `make install-pandoc` calls, mimicking other recipes in order to ensure that `3.1.2` will be installed in all cases. Side note: `make install-pandoc` calls `./scripts/install-pandoc.sh` under the hood. - [x] Update README file - mention that `make install-pandoc` is recommended (`>=2.14.2`) - [x] Verify tests that cover `rtf` cases: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14 - [x] Update `setup_ubuntu.sh` if needed?: https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87 - 2024-03-04 12:02:32 +01:00			- `pandoc` : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version `2.14.2` or newer. Running `this script <https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh>`__ will install the correct version for you.
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Validating Installation`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`-----------------------`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`After installation, confirm the setup by executing the below Python code:`

			`.. code-block:: python`

			`from unstructured.partition.auto import partition`
rfctr: docx partitioning (#1422) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge. 2023-09-19 15:32:46 -07:00			`elements = partition(filename="example-docs/eml/fake-email.eml")`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`If you've opted for the "local-inference" installation, you should also be able to execute:`

			`.. code-block:: python`

			`from unstructured.partition.auto import partition`
			`elements = partition("example-docs/layout-parser-paper.pdf")`

			If these code snippets run without errors, congratulations! Your ``unstructured`` installation is successful and ready for use.

docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00
			The following section will cover basic concepts and usage patterns in ``unstructured``.
			`After reading this section, you should be able to:`

			* Partitioning a document with the ``partition`` function.
			* Understand how documents are structured in ``unstructured``.
			`* Convert a document to a dictionary and/or save it as a JSON.`

			`The example documents in this section come from the`
			`example-docs <https://github.com/Unstructured-IO/unstructured/tree/main/example-docs>`_
			directory in the ``unstructured`` repo.

			Before running the code in this make sure you've installed the ``unstructured`` library
doc: add pdf extra note (#1165) 2023-08-22 11:20:26 -07:00			and all dependencies using the instructions in the `Quick Start <https://unstructured-io.github.io/unstructured/installing.html#quick-start>`_ section.
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00			`Partitioning a document`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`-----------------------`
docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00
			`In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document.`
			`The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections,`
			`and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for`
			`partitioning a document. We'll cover those in a later section. For now, we'll use the simplest API in the library,`
			the ``partition`` function. The ``partition`` function will detect the filetype of the source document and route it to the appropriate
			`partitioning function. You can try out the partition function by running the cell below.`

			`.. code:: python`

			`from unstructured.partition.auto import partition`

			`elements = partition(filename="example-10k.html")`


			`You can also pass in a file as a file-like object using the following workflow:`

			`.. code:: python`

			`with open("example-10k.html", "rb") as f:`
			`elements = partition(file=f)`


			The ``partition`` function uses `libmagic <https://formulae.brew.sh/formula/libmagic>`_ for filetype detection. If ``libmagic`` is
			not present and the user passes a filename, ``partition`` falls back to detecting the filetype using the file extension.
docs: more detailed bricks writeup; reoganize docs (#304) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix 2023-02-27 18:11:49 -05:00			``libmagic`` is required if you'd like to pass a file-like object to ``partition``.
docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00			We highly recommend installing ``libmagic`` and you may observe different file detection behaviors
			if ``libmagic`` is not installed`.


Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			`Quickstart Tutorial`
			`-------------------`
docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			If you're eager to dive in, head over `Getting Started <https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ>`__ on Google Colab to get a hands-on introduction to the ``unstructured`` library. In a few minutes, you'll have a basic workflow set up and running!
docs: add getting started section and remove outdated docs (#277) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits 2023-02-27 10:10:53 -05:00
Docs various updates (#2386) To test: > cd docs && make html Changelogs: * Added verbiage about the cap limit and data usage for the Freemium AP * Added deprecated warning on Staging bricks * Added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs * Fixed example page formatting * Added deprecation warning on ``model_name`` param in favor of ``hi_res_model_name`` * Added ``extract_images_in_pdf`` usage and code example in ``partition_pdf`` section * Reorganized and improved the documentation Intro section 2024-01-17 13:01:01 -08:00			`For more detailed information about specific components or advanced features, explore the rest of the documentation.`