2023-06-16 10:10:56 -04:00
|
|
|
|
Metadata
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
|
|
|
|
|
Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
|
|
|
|
|
For example, a user may be interested in selected document elements from a given page number
|
|
|
|
|
or an e-mail with a given subject line.
|
|
|
|
|
|
|
|
|
|
Metadata is tracked at the element level. You can extract the metadata for a given document element
|
|
|
|
|
with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
|
|
|
|
|
All document types return the following metadata fields when the information is available from
|
|
|
|
|
the source file:
|
|
|
|
|
|
|
|
|
|
* ``filename``
|
|
|
|
|
* ``file_directory``
|
|
|
|
|
* ``date``
|
|
|
|
|
* ``filetype``
|
|
|
|
|
* ``page_number``
|
|
|
|
|
|
|
|
|
|
|
2023-07-05 11:25:11 -07:00
|
|
|
|
####################
|
|
|
|
|
Element coordinates
|
|
|
|
|
####################
|
|
|
|
|
|
|
|
|
|
Some document types support location data for the elements, usually in the form of bounding boxes.
|
|
|
|
|
If it exists, an element's location data is available with ``element.metadata.coordinates``.
|
|
|
|
|
|
|
|
|
|
The ``coordinates`` property of an ``ElementMetadata`` stores:
|
2023-09-19 15:32:46 -07:00
|
|
|
|
|
2023-07-05 11:25:11 -07:00
|
|
|
|
* points: These specify the corners of the bounding box starting from the top left corner and
|
2023-09-19 15:32:46 -07:00
|
|
|
|
proceeding counter-clockwise. The points represent pixels, the origin is in the top left and
|
|
|
|
|
the ``y`` coordinate increases in the downward direction.
|
2023-07-05 11:25:11 -07:00
|
|
|
|
* system: The points have an associated coordinate system. A typical example of a coordinate system is
|
2023-09-19 15:32:46 -07:00
|
|
|
|
``PixelSpace``, which is used for representing the coordinates of images. The coordinate system has a
|
|
|
|
|
name, orientation, layout width, and layout height.
|
2023-07-05 11:25:11 -07:00
|
|
|
|
|
|
|
|
|
Information about the element’s coordinates (including the coordinate system name, coordinate points,
|
|
|
|
|
the layout width, and the layout height) can be accessed with `element.to_dict()["metadata"]["coordinates"]`.
|
|
|
|
|
|
|
|
|
|
The coordinates of an element can be changed to a new coordinate system by using the
|
|
|
|
|
``Element.convert_coordinates_to_new_system`` method. If the ``in_place`` flag is ``True``, the
|
|
|
|
|
coordinate system and points of the element are updated in place and the new coordinates are
|
|
|
|
|
returned. If the ``in_place`` flag is ``False``, only the altered coordinates are returned.
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Element
|
|
|
|
|
from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem
|
|
|
|
|
|
|
|
|
|
coordinates = ((10, 10), (10, 100), (200, 100), (200, 10))
|
|
|
|
|
coordinate_system = PixelSpace(width=850, height=1100)
|
|
|
|
|
element = Element(coordinates=coordinates, coordinate_system=coordinate_system)
|
|
|
|
|
print(element.metadata.coordinates.to_dict())
|
|
|
|
|
print(element.metadata.coordinates.system.orientation)
|
|
|
|
|
print(element.metadata.coordinates.system.width)
|
|
|
|
|
print(element.metadata.coordinates.system.height)
|
|
|
|
|
element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True)
|
|
|
|
|
# Should now be in terms of new coordinate system
|
|
|
|
|
print(element.metadata.coordinates.to_dict())
|
|
|
|
|
print(element.metadata.coordinates.system.orientation)
|
|
|
|
|
print(element.metadata.coordinates.system.width)
|
|
|
|
|
print(element.metadata.coordinates.system.height)
|
|
|
|
|
|
2023-06-16 10:10:56 -04:00
|
|
|
|
Email
|
|
|
|
|
-----
|
|
|
|
|
|
|
|
|
|
Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata.
|
|
|
|
|
``sent_from`` is a list of strings because the `RFC 822 <https://www.rfc-editor.org/rfc/rfc822>`_
|
|
|
|
|
spec for emails allows for multiple sent from email addresses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Microsoft Excel Documents
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds
|
|
|
|
|
to the sheet name in the Excel document.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Microsoft Word Documents
|
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
|
|
Headers and footers in Word documents include a ``header_footer_type`` indicating which page
|
|
|
|
|
a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Webpages
|
|
|
|
|
---------
|
|
|
|
|
|
|
|
|
|
Elements from webpages will include a ``url`` metadata field, corresponding to the URL for the webpage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
##########################
|
|
|
|
|
Advanced Metadata Options
|
2023-08-21 10:27:32 -07:00
|
|
|
|
##########################
|
2023-06-16 10:10:56 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Extract Metadata with Regexes
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
|
|
``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg.
|
|
|
|
|
Here is an example of how to extract regex metadata:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.partition.text import partition_text
|
|
|
|
|
|
|
|
|
|
text = "SPEAKER 1: It is my turn to speak now!"
|
|
|
|
|
elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"})
|
|
|
|
|
elements[0].metadata.regex_metadata
|
|
|
|
|
|
|
|
|
|
The result will look like:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
{'speaker':
|
|
|
|
|
[
|
|
|
|
|
{
|
|
|
|
|
'text': 'SPEAKER 1:',
|
|
|
|
|
'start': 0,
|
|
|
|
|
'end': 10,
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|