.. role:: raw-html(raw) :format: html .. _metadata-label: Metadata ======== The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents. Tracking metadata enables users to filter document elements downstream based on element metadata of interest. For example, a user may be interested in selected document elements from a given page number or an e-mail with a given subject line. Metadata is tracked at the element level. You can extract the metadata for a given document element with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``. ###################### Common Metadata Fields ###################### All document types return the following metadata fields when the information is available from the source file: +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Metadata Field Name | Short Description | Details | +=============================+==========================================================+=============================================================================================================================================================================================================================================================================================+ | filename | Filename | | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | file_directory | File Directory | | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | last_modified | Last Modified Date | | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | filetype | File Type | | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | coordinates | XY Bounding Box Coordinates | See notes below for further details about the bounding box. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | parent_id | Element Hierarchy (Parent ID) | `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a `NarrativeText` element may have a `Title` element as a parent (a "sub-title"), which in turn may have another `Title` element as its parent (a "title). | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | category_depth | Element Depth relative to | Category depth is the depth of an element relative to other elements of the same category. It's set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. | | | other elements of the same category | Category depth may be set using native document hierarchies, e.g. reflecting \

, \

, or `\

` tags within an HTML document or the indentation level of a bulleted list item in a Word document. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | text_as_html | HTML representation of extracted tables | Only applicable to table elements. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | languages | Document Languages | At document level or element level. List is ordered by probability of being the primary language of the text. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | emphasized_text_contents | Emphasized text (bold or italic) in the original document| | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | emphasized_text_tags | Tags on text that is emphasized in the original document | | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | is_continuation | True if element is a continuation of a previous element | Only relevant for chunking, if an element was divided into two due to ``max_characters``. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | detection_class_prob | Detection model class probabilities | From unstructured-inference, hi-res strategy. | +-----------------------------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ :raw-html:`
` Notes on common metadata fields: Coordinates ----------- Some document types support location data for the elements, usually in the form of bounding boxes. If it exists, an element's location data is available with ``element.metadata.coordinates``. The ``coordinates`` property of an ``ElementMetadata`` stores: * ``points`` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the ``y`` coordinate increases in the downward direction. * ``system``: The points have an associated coordinate system. A typical example of a coordinate system is ``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height. Information about the element’s coordinates (including the coordinate system name, coordinate points, the layout width, and the layout height) can be accessed with `element.to_dict()["metadata"]["coordinates"]`. The coordinates of an element can be changed to a new coordinate system by using the ``Element.convert_coordinates_to_new_system`` method. If the ``in_place`` flag is ``True``, the coordinate system and points of the element are updated in place and the new coordinates are returned. If the ``in_place`` flag is ``False``, only the altered coordinates are returned. .. code:: python from unstructured.documents.elements import Element from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem coordinates = ((10, 10), (10, 100), (200, 100), (200, 10)) coordinate_system = PixelSpace(width=850, height=1100) element = Element(coordinates=coordinates, coordinate_system=coordinate_system) print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True) # Should now be in terms of new coordinate system print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) ########################################### Additional Metadata Fields by Document Type ########################################### +-------------------------+---------------------+--------------------------------------------------------+ | Field Name | Applicable Doc Types| Short Description | +=========================+=====================+========================================================+ | page_number | DOCX,PDF, PPT,XLSX | Page Number | +-------------------------+---------------------+--------------------------------------------------------+ | page_name | XLSX | Sheet Name in Excel document | +-------------------------+---------------------+--------------------------------------------------------+ | sent_from | EML | Email Sender | +-------------------------+---------------------+--------------------------------------------------------+ | sent_to | EML | Email Recipient | +-------------------------+---------------------+--------------------------------------------------------+ | subject | EML | Email Subject | +-------------------------+---------------------+--------------------------------------------------------+ | attached_to_filename | MSG | filename that attachment file is attached to | +-------------------------+---------------------+--------------------------------------------------------+ | header_footer_type | Word Doc | Pages a header or footer applies to: "primary", | | | | "even_only", and "first_page" | +-------------------------+---------------------+--------------------------------------------------------+ | link_urls | HTML | The url associated with a link in a document. | +-------------------------+---------------------+--------------------------------------------------------+ | link_texts | HTML | The text associated with a link in a document. | +-------------------------+---------------------+--------------------------------------------------------+ | links | HTML | List of {”text”: “, “url”: } items. | | | | Note: this element will be removed in the near future | | | | in favor of the above link_urls and link_texts. | +-------------------------+---------------------+--------------------------------------------------------+ | section | EPUB | Book section title corresponding to table of contents | +-------------------------+---------------------+--------------------------------------------------------+ :raw-html:`
` Notes on additional metadata by document type: Email ----- Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata. ``sent_from`` is a list of strings because the `RFC 822 `_ spec for emails allows for multiple sent from email addresses. Microsoft Excel Documents -------------------------- For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds to the sheet name in the Excel document. Microsoft Word Documents ------------------------- Headers and footers in Word documents include a ``header_footer_type`` indicating which page a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``. ############################## Data Connector Metadata Fields ############################## Documents processed through unstructured-ingest connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector. Common Data Connector Metadata Fields ------------------------------------- - Data Source metadata (on json output): - url - version - date created - date modified - date processed - record locator - Record locator is specific to each connector Additional Metadata Fields by Connector Type (via record locator) ----------------------------------------------------------------- - airtable - base id - table id - view id - azure (from fsspec) - protocol - remote file path - box (from fsspec) - protocol - remote file path - confluence - url - page id - discord - channel - dropbox (from fsspec) - protocol - remote file path - elasticsearch - url - index name - document id - fsspec - protocol - remote file path - google drive - drive id - file id - gcs (from fsspec) - protocol - remote file path - jira - base url - issue key - onedrive - user pname - server relative path - outlook - message id - user email - s3 (from fsspec) - protocol - remote file path - sharepoint - server path - site url - wikipedia - page title - page url ########################## Advanced Metadata Options ########################## Extract Metadata with Regexes ------------------------------ ``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg. Here is an example of how to extract regex metadata: .. code:: python from unstructured.partition.text import partition_text text = "SPEAKER 1: It is my turn to speak now!" elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"}) elements[0].metadata.regex_metadata The result will look like: .. code:: python {'speaker': [ { 'text': 'SPEAKER 1:', 'start': 0, 'end': 10, } ] }