unstructured/docs/source/metadata.rst

.. role:: raw-html(raw)
    :format: html

.. _metadata-label:

Metadata
========

The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
For example, a user may be interested in selected document elements from a given page number
or an e-mail with a given subject line.

Metadata is tracked at the element level. You can extract the metadata for a given document element
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.


######################
Common Metadata Fields
######################

All document types return the following metadata fields when the information is available from
the source file:

+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Metadata Field Name         | Short Description                                        | Details                                                                                                                                                                                                                                                                                     |
+=============================+==========================================================+=============================================================================================================================================================================================================================================================================================+
| filename                    | Filename                                                 |                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| file_directory              | File Directory                                           |                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| last_modified               | Last Modified Date                                       |                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| filetype                    | File Type                                                |                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| coordinates                 | XY Bounding Box Coordinates                              | See notes below for further details about the bounding box.                                                                                                                                                                                                                                 |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| parent_id                   | Element Hierarchy (Parent ID)                            | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title).              |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category_depth              | Element Depth relative to                                | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies.                                                                                |
|                             | other elements of the same category                      | Category depth may be set using native document hierarchies, e.g.  reflecting \<H1>, \<H2>, or `\<H3>` tags within an HTML document or the indentation level of a bulleted list item in a Word document.                                                                                    |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| text_as_html                | HTML representation of extracted tables                  |  Only applicable to table elements.                                                                                                                                                                                                                                                         |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| languages                   | Document Languages                                       | At document level or element level. List is ordered by probability of being the primary language of the text.                                                                                                                                                                               |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| emphasized_text_contents    | Emphasized text (bold or italic) in the original document|                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| emphasized_text_tags        | Tags on text that is emphasized in the original document |                                                                                                                                                                                                                                                                                             |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| is_continuation             | True if element is a continuation of a previous element  | Only relevant for chunking, if an element was divided into two due to ``max_characters``.                                                                                                                                                                                                   |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| detection_class_prob        | Detection model class probabilities                      | From unstructured-inference, hi-res strategy.                                                                                                                                                                                                                                               |
+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

:raw-html:`<br />`
Notes on common metadata fields:

Coordinates
-----------

Some document types support location data for the elements, usually in the form of bounding boxes.
If it exists, an element's location data is available with ``element.metadata.coordinates``.

The ``coordinates`` property of an ``ElementMetadata`` stores:

* ``points`` : These specify the corners of the bounding box starting from the top left corner and
  proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
  the ``y`` coordinate increases in the downward direction.
* ``system``: The points have an associated coordinate system. A typical example of a coordinate system is
  ``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
  name, orientation, layout width, and layout height.

Information about the element’s coordinates (including the coordinate system name, coordinate points,
the layout width, and the layout height) can be accessed with `element.to_dict()["metadata"]["coordinates"]`.

The coordinates of an element can be changed to a new coordinate system by using the
``Element.convert_coordinates_to_new_system`` method. If the ``in_place`` flag is ``True``, the
coordinate system and points of the element are updated in place and the new coordinates are
returned. If the ``in_place`` flag is ``False``, only the altered coordinates are returned.

.. code:: python

	from unstructured.documents.elements import Element
	from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem

	coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
	coordinate_system = PixelSpace(width=850, height=1100)

	element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
	print(element.metadata.coordinates.to_dict())
	print(element.metadata.coordinates.system.orientation)
	print(element.metadata.coordinates.system.width)
	print(element.metadata.coordinates.system.height)

	element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
	# Should now be in terms of new coordinate system
	print(element.metadata.coordinates.to_dict())
	print(element.metadata.coordinates.system.orientation)
	print(element.metadata.coordinates.system.width)
	print(element.metadata.coordinates.system.height)

###########################################
Additional Metadata Fields by Document Type
###########################################

+-------------------------+---------------------+--------------------------------------------------------+
| Field Name              | Applicable Doc Types| Short Description                                      |
+=========================+=====================+========================================================+
| page_number             | DOCX,PDF, PPT,XLSX  | Page Number                                            |
+-------------------------+---------------------+--------------------------------------------------------+
| page_name               | XLSX                | Sheet Name in Excel document                           |
+-------------------------+---------------------+--------------------------------------------------------+
| sent_from               | EML                 | Email Sender                                           |
+-------------------------+---------------------+--------------------------------------------------------+
| sent_to                 | EML                 | Email Recipient                                        |
+-------------------------+---------------------+--------------------------------------------------------+
| subject                 | EML                 | Email Subject                                          |
+-------------------------+---------------------+--------------------------------------------------------+
| attached_to_filename    | MSG                 | filename that attachment file is attached to           |
+-------------------------+---------------------+--------------------------------------------------------+
| header_footer_type      | Word Doc            | Pages a header or footer applies to: "primary",        |
|                         |                     | "even_only", and "first_page"                          |
+-------------------------+---------------------+--------------------------------------------------------+
| link_urls               | HTML                | The url associated with a link in a document.          |
+-------------------------+---------------------+--------------------------------------------------------+
| link_texts              | HTML                | The text associated with a link in a document.         |
+-------------------------+---------------------+--------------------------------------------------------+
| links                   | HTML                | List of {”text”: “<the text>, “url”: <the url>} items. |
|                         |                     | Note: this element will be removed in the near future  |
|                         |                     | in favor of the above link_urls and link_texts.        |
+-------------------------+---------------------+--------------------------------------------------------+
| section                 | EPUB                | Book section title corresponding to table of contents  |
+-------------------------+---------------------+--------------------------------------------------------+

:raw-html:`<br />`
Notes on additional metadata by document type:

Email
-----

Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata.
``sent_from`` is a list of strings because the `RFC 822 <https://www.rfc-editor.org/rfc/rfc822>`_
spec for emails allows for multiple sent from email addresses.


Microsoft Excel Documents
--------------------------

For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds
to the sheet name in the Excel document.


Microsoft Word Documents
-------------------------

Headers and footers in Word documents include a ``header_footer_type`` indicating which page
a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``.


##############################
Data Connector Metadata Fields
##############################

Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.

Common Data Connector Metadata Fields
-------------------------------------

- Data Source metadata (on json output):
    - url
    - version
    - date created
    - date modified
    - date processed
    - record locator
- Record locator is specific to each connector

Additional Metadata Fields by Connector Type (via record locator)
-----------------------------------------------------------------

- airtable
    - base id
    - table id
    - view id
- azure (from fsspec)
    - protocol
    - remote file path
- box (from fsspec)
    - protocol
    - remote file path
- confluence
    - url
    - page id
- discord
    - channel
- dropbox (from fsspec)
    - protocol
    - remote file path
- elasticsearch
    - url
    - index name
    - document id
- fsspec
    - protocol
    - remote file path
- google drive
    - drive id
    - file id
- gcs (from fsspec)
    - protocol
    - remote file path
- jira
    - base url
    - issue key
- onedrive
    - user pname
    - server relative path
- outlook
    - message id
    - user email
- s3 (from fsspec)
    - protocol
    - remote file path
- sharepoint
    - server path
    - site url
- wikipedia
    - page title
    - page url


##########################
Advanced Metadata Options
##########################

Extract Metadata with Regexes
------------------------------

``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg.
Here is an example of how to extract regex metadata:


.. code:: python

  from unstructured.partition.text import partition_text

  text = "SPEAKER 1: It is my turn to speak now!"
  elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"})
  elements[0].metadata.regex_metadata

The result will look like:


.. code:: python

  {'speaker':
    [
      {
        'text': 'SPEAKER 1:',
        'start': 0,
        'end': 10,
     }
    ]
  }
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								.. role:: raw-html(raw)
 								    :format: html
 								.. _metadata-label:
-												feat: add ability to extract extra metadata with regex (#763)

* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
											
										
										
											2023-06-16 10:10:56 -04:00
+								Metadata
 								========
 								The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
 								Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
 								For example, a user may be interested in selected document elements from a given page number
 								or an e-mail with a given subject line.
 								Metadata is tracked at the element level. You can extract the metadata for a given document element
 								with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								######################
 								Common Metadata Fields
 								######################
 								All document types return the following metadata fields when the information is available from
 								the source file:
-												feat: add ability to extract extra metadata with regex (#763)

* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
											
										
										
											2023-06-16 10:10:56 -04:00
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| Metadata Field Name         | Short Description                                        | Details                                                                                                                                                                                                                                                                                     |
 								+=============================+==========================================================+=============================================================================================================================================================================================================================================================================================+
 								| filename                    | Filename                                                 |                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| file_directory              | File Directory                                           |                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| last_modified               | Last Modified Date                                       |                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| filetype                    | File Type                                                |                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| coordinates                 | XY Bounding Box Coordinates                              | See notes below for further details about the bounding box.                                                                                                                                                                                                                                 |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| parent_id                   | Element Hierarchy (Parent ID)                            | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title).              |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| category_depth              | Element Depth relative to                                | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies.                                                                                |
 								|                             | other elements of the same category                      | Category depth may be set using native document hierarchies, e.g.  reflecting \<H1>, \<H2>, or `\<H3>` tags within an HTML document or the indentation level of a bulleted list item in a Word document.                                                                                    |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| text_as_html                | HTML representation of extracted tables                  |  Only applicable to table elements.                                                                                                                                                                                                                                                         |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| languages                   | Document Languages                                       | At document level or element level. List is ordered by probability of being the primary language of the text.                                                                                                                                                                               |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| emphasized_text_contents    | Emphasized text (bold or italic) in the original document|                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| emphasized_text_tags        | Tags on text that is emphasized in the original document |                                                                                                                                                                                                                                                                                             |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| is_continuation             | True if element is a continuation of a previous element  | Only relevant for chunking, if an element was divided into two due to ``max_characters``.                                                                                                                                                                                                   |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								| detection_class_prob        | Detection model class probabilities                      | From unstructured-inference, hi-res strategy.                                                                                                                                                                                                                                               |
 								+-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 								:raw-html:`<br />`
 								Notes on common metadata fields:
 								Coordinates
 								-----------
-												chore: Move coordinate details from Element model to a metadata model (#827)


											
										
										
											2023-07-05 11:25:11 -07:00
 								Some document types support location data for the elements, usually in the form of bounding boxes.
 								If it exists, an element's location data is available with ``element.metadata.coordinates``.
 								The ``coordinates`` property of an ``ElementMetadata`` stores:
-												rfctr: docx partitioning (#1422)

Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
											
										
										
											2023-09-19 15:32:46 -07:00
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								* ``points`` : These specify the corners of the bounding box starting from the top left corner and
-												rfctr: docx partitioning (#1422)

Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
											
										
										
											2023-09-19 15:32:46 -07:00
+								  proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
 								  the ``y`` coordinate increases in the downward direction.
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								* ``system``: The points have an associated coordinate system. A typical example of a coordinate system is
-												rfctr: docx partitioning (#1422)

Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
											
										
										
											2023-09-19 15:32:46 -07:00
+								  ``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
 								  name, orientation, layout width, and layout height.
-												chore: Move coordinate details from Element model to a metadata model (#827)


											
										
										
											2023-07-05 11:25:11 -07:00
 								Information about the element’s coordinates (including the coordinate system name, coordinate points,
 								the layout width, and the layout height) can be accessed with `element.to_dict()["metadata"]["coordinates"]`.
 								The coordinates of an element can be changed to a new coordinate system by using the
 								``Element.convert_coordinates_to_new_system`` method. If the ``in_place`` flag is ``True``, the
 								coordinate system and points of the element are updated in place and the new coordinates are
 								returned. If the ``in_place`` flag is ``False``, only the altered coordinates are returned.
 								.. code:: python
 									from unstructured.documents.elements import Element
 									from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem
 									coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
 									coordinate_system = PixelSpace(width=850, height=1100)
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
-												chore: Move coordinate details from Element model to a metadata model (#827)


											
										
										
											2023-07-05 11:25:11 -07:00
+									element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
 									print(element.metadata.coordinates.to_dict())
 									print(element.metadata.coordinates.system.orientation)
 									print(element.metadata.coordinates.system.width)
 									print(element.metadata.coordinates.system.height)
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
-												chore: Move coordinate details from Element model to a metadata model (#827)


											
										
										
											2023-07-05 11:25:11 -07:00
+									element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
 									# Should now be in terms of new coordinate system
 									print(element.metadata.coordinates.to_dict())
 									print(element.metadata.coordinates.system.orientation)
 									print(element.metadata.coordinates.system.width)
 									print(element.metadata.coordinates.system.height)
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								###########################################
 								Additional Metadata Fields by Document Type
 								###########################################
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| Field Name              | Applicable Doc Types| Short Description                                      |
 								+=========================+=====================+========================================================+
 								| page_number             | DOCX,PDF, PPT,XLSX  | Page Number                                            |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| page_name               | XLSX                | Sheet Name in Excel document                           |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| sent_from               | EML                 | Email Sender                                           |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| sent_to                 | EML                 | Email Recipient                                        |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| subject                 | EML                 | Email Subject                                          |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| attached_to_filename    | MSG                 | filename that attachment file is attached to           |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| header_footer_type      | Word Doc            | Pages a header or footer applies to: "primary",        |
 								|                         |                     | "even_only", and "first_page"                          |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| link_urls               | HTML                | The url associated with a link in a document.          |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| link_texts              | HTML                | The text associated with a link in a document.         |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| links                   | HTML                | List of {”text”: “<the text>, “url”: <the url>} items. |
 								|                         |                     | Note: this element will be removed in the near future  |
 								|                         |                     | in favor of the above link_urls and link_texts.        |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								| section                 | EPUB                | Book section title corresponding to table of contents  |
 								+-------------------------+---------------------+--------------------------------------------------------+
 								:raw-html:`<br />`
 								Notes on additional metadata by document type:
-												feat: add ability to extract extra metadata with regex (#763)

* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
											
										
										
											2023-06-16 10:10:56 -04:00
+								Email
 								-----
 								Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata.
 								``sent_from`` is a list of strings because the `RFC 822 <https://www.rfc-editor.org/rfc/rfc822>`_
 								spec for emails allows for multiple sent from email addresses.
 								Microsoft Excel Documents
 								--------------------------
 								For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds
 								to the sheet name in the Excel document.
 								Microsoft Word Documents
 								-------------------------
 								Headers and footers in Word documents include a ``header_footer_type`` indicating which page
 								a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``.
-												Update Metadata and Installation Documentation (#1646)

* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
											
										
										
											2023-10-04 18:25:41 -07:00
+								##############################
 								Data Connector Metadata Fields
 								##############################
 								Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.
 								Common Data Connector Metadata Fields
 								-------------------------------------
 								- Data Source metadata (on json output):
 								    - url
 								    - version
 								    - date created
 								    - date modified
 								    - date processed
 								    - record locator
 								- Record locator is specific to each connector
 								Additional Metadata Fields by Connector Type (via record locator)
 								-----------------------------------------------------------------
 								- airtable
 								    - base id
 								    - table id
 								    - view id
 								- azure (from fsspec)
 								    - protocol
 								    - remote file path
 								- box (from fsspec)
 								    - protocol
 								    - remote file path
 								- confluence
 								    - url
 								    - page id
 								- discord
 								    - channel
 								- dropbox (from fsspec)
 								    - protocol
 								    - remote file path
 								- elasticsearch
 								    - url
 								    - index name
 								    - document id
 								- fsspec
 								    - protocol
 								    - remote file path
 								- google drive
 								    - drive id
 								    - file id
 								- gcs (from fsspec)
 								    - protocol
 								    - remote file path
 								- jira
 								    - base url
 								    - issue key
 								- onedrive
 								    - user pname
 								    - server relative path
 								- outlook
 								    - message id
 								    - user email
 								- s3 (from fsspec)
 								    - protocol
 								    - remote file path
 								- sharepoint
 								    - server path
 								    - site url
 								- wikipedia
 								    - page title
 								    - page url
-												feat: add ability to extract extra metadata with regex (#763)

* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
											
										
										
											2023-06-16 10:10:56 -04:00
 								##########################
 								Advanced Metadata Options
-												Update docs jack (#1157)

Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)

											
										
										
											2023-08-21 10:27:32 -07:00
+								##########################
-												feat: add ability to extract extra metadata with regex (#763)

* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
											
										
										
											2023-06-16 10:10:56 -04:00
 								Extract Metadata with Regexes
 								------------------------------
 								``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg.
 								Here is an example of how to extract regex metadata:
 								.. code:: python
 								  from unstructured.partition.text import partition_text
 								  text = "SPEAKER 1: It is my turn to speak now!"
 								  elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"})
 								  elements[0].metadata.regex_metadata
 								The result will look like:
 								.. code:: python
 								  {'speaker':
 								    [
 								      {
 								        'text': 'SPEAKER 1:',
 								        'start': 0,
 								        'end': 10,
 								     }
 								    ]
 								  }