We highly recommend installing ``libmagic`` and you may observe different file detection behaviors
if ``libmagic`` is not installed`.
##################
Document elements
##################
When we partition a document, the output is a list of document ``Element`` objects.
These element objects represent different components of the source document. Currently, the ``unstructured`` library supports the following element types:
*``Element``
*``Text``
*``FigureCaption``
*``NarrativeText``
*``ListItem``
*``Title``
*``Address``
*``PageBreak``
*``CheckBox``
*``Image``
Other element types that we will add in the future include tables and figures.
Different partitioning functions use different methods for determining the element type and extracting the associated content.
Document elements have a ``str`` representation. You can print them using the snippet below.
..code:: python
elements = partition(filename="example-10k.html")
for element in elements[:5]:
print(element)
print("\n")
One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case.
For example, if you're training a summarization model you may only want to include narrative text for model training.
You'll notice that the output above includes a lot of titles and other content that may not be suitable for a summarization model.
The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains narrative text.
..code:: python
from unstructured.documents.elements import NarrativeText
from unstructured.partition.text_type import sentence_count
for element in elements[:100]:
if isinstance(element, NarrativeText) and sentence_count(element.text) > 2: