unstructured/docs/source/introduction/overview.rst

Document Elements
=================

Introduction
************

The ``unstructured`` library aims to simplify and streamline the preprocessing of structured and unstructured documents for downstream tasks. And what that means is no matter where your data is and no matter what format that data is in, Unstructured's toolkit will transform and preprocess that data into an easily digestable and usable format.

Document elements
*****************

When we partition a document, the output is a list of document ``Element`` objects.
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:

* ``type``

  * ``FigureCaption``

  * ``NarrativeText``

  * ``ListItem``

  * ``Title``

  * ``Address``

  * ``Table``

  * ``PageBreak``

  * ``Header``

  * ``Footer``

  * ``UncategorizedText``

  * ``Image``

  * ``Formula``

* ``element_id``

* ``metadata`` - see: :ref:`Metadata page <metadata-label>`

* ``text``


Other element types that we will add in the future include tables and figures.
Different partitioning functions use different methods for determining the element type and extracting the associated content.
Document elements have a ``str`` representation. You can print them using the snippet below.

.. code:: python

	elements = partition(filename="example-10k.html")

	for element in elements[:5]:
	    print(element)
	    print("\n")


One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case.
For example, if you're training a summarization model you may only want to include narrative text for model training.
You'll notice that the output above includes a lot of titles and other content that may not be suitable for a summarization model.
The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains narrative text.

.. code:: python

	from unstructured.documents.elements import NarrativeText
	from unstructured.partition.text_type import sentence_count

	for element in elements[:100]:
	    if isinstance(element, NarrativeText) and sentence_count(element.text) > 2:
	        print(element)
	        print("\n")

Tables
******

For ``Table`` elements, the raw text of the table will be stored in the ``text`` attribute for the Element, and HTML representation
of the table will be available in the element metadata under ``element.metadata.text_as_html``. For most documents where
table extraction is available, the ``partition`` function will extract tables automatically if they are present.
For PDFs and images, table extraction requires a relatively expensive call to a table recognition model, and so for those
document types table extraction is an option you need to enable. If you would like to extract tables for PDFs or images,
pass in ``infer_table_structure=True``. Here is an example (Note: this example requires the ``pdf`` extra. This can be installed with ``pip install "unstructured[pdf]"``):

.. code:: python

    from unstructured.partition.pdf import partition_pdf

    filename = "example-docs/layout-parser-paper.pdf"

    elements = partition_pdf(filename=filename, infer_table_structure=True)
    tables = [el for el in elements if el.category == "Table"]

    print(tables[0].text)
    print(tables[0].metadata.text_as_html)

The text will look like this:

.. code:: python

	Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents


And the ``text_as_html`` metadata will look like this:

.. code:: html

	<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>


Converting Elements to Dictionary or JSON
*****************************************

The final step in the process for most users is to convert the output to JSON.
You can convert a list of document elements to a list of dictionaries using the ``convert_to_dict`` function.
The workflow for using ``convert_to_dict`` appears below.


.. code:: python


	from unstructured.staging.base import convert_to_dict

	convert_to_dict(elements)


The ``unstructured`` library also includes utilities for saving a list of elements to JSON and reading
a list of elements from JSON, as seen in the snippet below

.. code:: python

    from unstructured.staging.base import elements_to_json, elements_from_json


    filename = "outputs.json"
    elements_to_json(elements, filename=filename)
    elements = elements_from_json(filename=filename)


Unique Element IDs
******************

By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
To obtain globally unique IDs in the output (UUIDs), you can pass
``unique_element_ids=True`` into any of the partition functions. This can be helpful
if you'd like to use the IDs as a primary key in a database, for example.

.. code:: python

    from unstructured.partition.text import partition_text

    elements = partition_text(text="Here is some example text.", unique_element_ids=True)
    elements[0].id

Element ID Design Principles
""""""""""""""""""""""""""""""""""""

#. A partitioning function can assign only one of two available ID types to a returned element: a hash or a UUID.
#. All elements that are returned come with an ID, which is never None.
#. No matter which type of ID is used, it will always be in string format.
#. Partitioning a document returns elements with hashes as their default IDs, ensuring they are both deterministic and unique within a document.

For creating elements independently of partitioning functions, refer to the `Element` class documentation in the source code (`unstructured/documents/elements.py`).


Wrapping it all up
******************

To conclude, the basic workflow for reading in a document and converting it to a JSON in ``unstructured``
looks like the following:

.. code:: python

    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json

    input_filename = "example-docs/example-10k.html"
    output_filename = "outputs.json"

    elements = partition(filename=input_filename)
    elements_to_json(elements, filename=output_filename)
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								Document Elements
 								=================
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
 								Introduction
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								************
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								The ``unstructured`` library aims to simplify and streamline the preprocessing of structured and unstructured documents for downstream tasks. And what that means is no matter where your data is and no matter what format that data is in, Unstructured's toolkit will transform and preprocess that data into an easily digestable and usable format.
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								Document elements
 								*****************
-												Jack/update documentation (#1190)

Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
											
										
										
											2023-09-04 09:15:50 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								When we partition a document, the output is a list of document ``Element`` objects.
 								These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:
-												Jack/update documentation (#1190)

Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
											
										
										
											2023-09-04 09:15:50 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								* ``type``
-												Jack/update documentation (#1190)

Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
											
										
										
											2023-09-04 09:15:50 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``FigureCaption``
-												Jack/update documentation (#1190)

Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
											
										
										
											2023-09-04 09:15:50 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``NarrativeText``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``ListItem``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``Title``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``Address``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``Table``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``PageBreak``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``Header``
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								  * ``Footer``
 								  * ``UncategorizedText``
 								  * ``Image``
 								  * ``Formula``
 								* ``element_id``
 								* ``metadata`` - see: :ref:`Metadata page <metadata-label>`
 								* ``text``
 								Other element types that we will add in the future include tables and figures.
 								Different partitioning functions use different methods for determining the element type and extracting the associated content.
 								Document elements have a ``str`` representation. You can print them using the snippet below.
 								.. code:: python
 									elements = partition(filename="example-10k.html")
 									for element in elements[:5]:
 									    print(element)
 									    print("\n")
 								One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case.
 								For example, if you're training a summarization model you may only want to include narrative text for model training.
 								You'll notice that the output above includes a lot of titles and other content that may not be suitable for a summarization model.
 								The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains narrative text.
 								.. code:: python
 									from unstructured.documents.elements import NarrativeText
 									from unstructured.partition.text_type import sentence_count
 									for element in elements[:100]:
 									    if isinstance(element, NarrativeText) and sentence_count(element.text) > 2:
 									        print(element)
 									        print("\n")
 								Tables
 								******
 								For ``Table`` elements, the raw text of the table will be stored in the ``text`` attribute for the Element, and HTML representation
 								of the table will be available in the element metadata under ``element.metadata.text_as_html``. For most documents where
 								table extraction is available, the ``partition`` function will extract tables automatically if they are present.
 								For PDFs and images, table extraction requires a relatively expensive call to a table recognition model, and so for those
 								document types table extraction is an option you need to enable. If you would like to extract tables for PDFs or images,
 								pass in ``infer_table_structure=True``. Here is an example (Note: this example requires the ``pdf`` extra. This can be installed with ``pip install "unstructured[pdf]"``):
 								.. code:: python
 								    from unstructured.partition.pdf import partition_pdf
 								    filename = "example-docs/layout-parser-paper.pdf"
 								    elements = partition_pdf(filename=filename, infer_table_structure=True)
 								    tables = [el for el in elements if el.category == "Table"]
 								    print(tables[0].text)
 								    print(tables[0].metadata.text_as_html)
 								The text will look like this:
 								.. code:: python
 									Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents
 								And the ``text_as_html`` metadata will look like this:
 								.. code:: html
 									<table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>
 								Converting Elements to Dictionary or JSON
 								*****************************************
 								The final step in the process for most users is to convert the output to JSON.
 								You can convert a list of document elements to a list of dictionaries using the ``convert_to_dict`` function.
 								The workflow for using ``convert_to_dict`` appears below.
 								.. code:: python
 									from unstructured.staging.base import convert_to_dict
 									convert_to_dict(elements)
 								The ``unstructured`` library also includes utilities for saving a list of elements to JSON and reading
 								a list of elements from JSON, as seen in the snippet below
 								.. code:: python
 								    from unstructured.staging.base import elements_to_json, elements_from_json
 								    filename = "outputs.json"
 								    elements_to_json(elements, filename=filename)
 								    elements = elements_from_json(filename=filename)
 								Unique Element IDs
 								******************
-												Better element IDs - deterministic and document-unique hashes (#2673)

Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

											
										
										
											2024-04-24 09:05:20 +02:00
+								By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
 								To obtain globally unique IDs in the output (UUIDs), you can pass
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
+								``unique_element_ids=True`` into any of the partition functions. This can be helpful
 								if you'd like to use the IDs as a primary key in a database, for example.
 								.. code:: python
 								    from unstructured.partition.text import partition_text
 								    elements = partition_text(text="Here is some example text.", unique_element_ids=True)
 								    elements[0].id
-												Preparing the foundation for better element IDs (#2842)

Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
											
										
										
											2024-04-16 23:14:53 +02:00
+								Element ID Design Principles
 								""""""""""""""""""""""""""""""""""""
 								#. A partitioning function can assign only one of two available ID types to a returned element: a hash or a UUID.
 								#. All elements that are returned come with an ID, which is never None.
 								#. No matter which type of ID is used, it will always be in string format.
-												Better element IDs - deterministic and document-unique hashes (#2673)

Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

											
										
										
											2024-04-24 09:05:20 +02:00
+								#. Partitioning a document returns elements with hashes as their default IDs, ensuring they are both deterministic and unique within a document.
-												Preparing the foundation for better element IDs (#2842)

Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
											
										
										
											2024-04-16 23:14:53 +02:00
 								For creating elements independently of partitioning functions, refer to the `Element` class documentation in the source code (`unstructured/documents/elements.py`).
-												Docs various updates (#2386)

To test:
> cd docs && make html 

Changelogs:
* Added verbiage about the cap limit and data usage for the Freemium AP
* Added deprecated warning on Staging bricks
* Added warning and code examples to use the SaaS API Endpoints using
CLI-vs-SDKs
* Fixed example page formatting
* Added deprecation warning on ``model_name`` param in favor of
``hi_res_model_name``
* Added ``extract_images_in_pdf`` usage and code example in
``partition_pdf`` section
* Reorganized and improved the documentation Intro section
											
										
										
											2024-01-17 13:01:01 -08:00
 								Wrapping it all up
 								******************
 								To conclude, the basic workflow for reading in a document and converting it to a JSON in ``unstructured``
 								looks like the following:
 								.. code:: python
 								    from unstructured.partition.auto import partition
 								    from unstructured.staging.base import elements_to_json
 								    input_filename = "example-docs/example-10k.html"
 								    output_filename = "outputs.json"
 								    elements = partition(filename=input_filename)
 								    elements_to_json(elements, filename=output_filename)