Bricks ====== The goal of this page is to introduce you to the concept of bricks. Bricks are functions that live in ``unstructured`` and are the primary public API for the library. There are three types of bricks in ``unstructured``, corresponding to the different stages of document pre-processing: partitioning, cleaning, and staging. After reading this section, you should understand the following: * How to extract content from a document using partitioning bricks. * How to remove unwanted content from document elements using cleaning bricks. * How to prepare data for downstream use cases using staging bricks ############ Partitioning ############ Partitioning bricks in ``unstructured`` allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as ``Title``, ``NarrativeText``, and ``ListItem``, enabling users to decide what content they'd like to keep for their particular application. If you're training a summarization model, for example, you may only be interested in ``NarrativeText``. The easiest way to partition documents in unstructured is to use the ``partition`` brick. If you call the ``partition`` brick, ``unstructured`` will use ``libmagic`` to automatically determine the file type and invoke the appropriate partition function. In cases where ``libmagic`` is not available, filetype detection will fall back to using the file extension. As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input. ``partition`` also has some optional kwargs. For example, if you set ``include_page_breaks=True``, the output will include ``PageBreak`` elements if the filetype supports it. Additionally you can bypass the filetype detection logic with the optional ``content_type`` argument which may be specified with either the ``filename`` or file-like object, ``file``. You can find a full listing of optional kwargs in the documentation below. .. code:: python from unstructured.partition.auto import partition filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") elements = partition(filename=filename, content_type="application/pdf") print("\n\n".join([str(el) for el in elements][:10])) .. code:: python from unstructured.partition.auto import partition filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") with open(filename, "rb") as f: elements = partition(file=f, include_page_breaks=True) print("\n\n".join([str(el) for el in elements][5:15])) The ``unstructured`` library also includes partitioning bricks targeted at specific document types. The ``partition`` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partitioning brick instead of ``partition``: * If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly, or passing in the ``content_type`` will make your program run faster. * Fewer dependencies. You don't need to install ``libmagic`` for filetype detection if you're only using document-specific bricks. * Additional features. The API for partition is the least common denominator for all document types. Certain document-specific brick include extra features that you may want to take advantage of. For example, ``partition_html`` allows you to pass in a URL so you don't have to store the ``.html`` file locally. See the documentation below learn about the options available in each partitioning brick. Below we see an example of how to partition a document directly with the URL using the partition_html function. .. code:: python from unstructured.partition.html import partition_html url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html" elements = partition_html(url=url) print("\n\n".join([str(el) for el in elements])) ``partition`` -------------- The ``partition`` brick is the simplest way to partition a document in ``unstructured``. If you call the ``partition`` function, ``unstructured`` will attempt to detect the file type and route it to the appropriate partitioning brick. All partitioning bricks called within ``partition`` are called using the default kwargs. Use the document-type specific bricks if you need to apply non-default settings. ``partition`` currently supports ``.docx``, ``.doc``, ``.odt``, ``.pptx``, ``.ppt``, ``.xlsx``, ``.csv``, ``.tsv``, ``.eml``, ``.msg``, ``.rtf``, ``.epub``, ``.html``, ``.xml``, ``.pdf``, ``.png``, ``.jpg``, and ``.txt`` files. If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``, ``.png``, and ``.jpg``. The ``strategy`` kwarg controls the strategy for partitioning documents. Generally available strategies are `"fast"` for faster processing and `"hi_res"` for more accurate processing. .. code:: python import docx from unstructured.partition.auto import partition document = docx.Document() document.add_paragraph("Important Analysis", style="Heading 1") document.add_paragraph("Here is my first thought.", style="Body Text") document.add_paragraph("Here is my second thought.", style="Normal") document.save("mydoc.docx") elements = partition(filename="mydoc.docx") with open("mydoc.docx", "rb") as f: elements = partition(file=f) .. code:: python from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper-fast.pdf") The ``partition`` function also accepts a ``url`` kwarg for remotely hosted documents. If you want to force ``partition`` to treat the document as a particular MIME type, use the ``content_type`` kwarg in conjunction with ``url``. Otherwise, ``partition`` will use the information from the ``Content-Type`` header in the HTTP response. The ``ssl_verify`` kwarg controls whether or not SSL verification is enabled for the HTTP request. By default it is on. Use ``ssl_verify=False`` to disable SSL verification in the request. .. code:: python from unstructured.partition.auto import partition url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md" elements = partition(url=url) elements = partition(url=url, content_type="text/markdown") For more information about the ``partition`` brick, you can check the `source code here `_. ``partition_csv`` ------------------ The ``partition_csv`` function pre-processes CSV files. The output is a single ``Table`` element. The ``text_as_html`` attribute in the element metadata will contain an HTML representation of the table. Examples: .. code:: python from unstructured.partition.csv import partition_csv elements = partition_csv(filename="example-docs/stanley-cups.csv") print(elements[0].metadata.text_as_html) For more information about the ``partition_csv`` brick, you can check the `source code here `_. ``partition_doc`` ------------------ The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents saved in the ``.doc`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_doc`` can take a filename or file-like object as input. ``partiton_doc`` uses ``libreoffice`` to convert the file to ``.docx`` and then calls ``partition_docx``. Ensure you have ``libreoffice`` installed before using ``partition_doc``. Examples: .. code:: python from unstructured.partition.doc import partition_doc elements = partition_doc(filename="example-docs/fake.doc") For more information about the ``partition_doc`` brick, you can check the `source code here `_. ``partition_docx`` ------------------ The ``partition_docx`` partitioning brick pre-processes Microsoft Word documents saved in the ``.docx`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_docx`` can take a filename or file-like object as input, as shown in the two examples below. Examples: .. code:: python import docx from unstructured.partition.docx import partition_docx document = docx.Document() document.add_paragraph("Important Analysis", style="Heading 1") document.add_paragraph("Here is my first thought.", style="Body Text") document.add_paragraph("Here is my second thought.", style="Normal") document.save("mydoc.docx") elements = partition_docx(filename="mydoc.docx") with open("mydoc.docx", "rb") as f: elements = partition_docx(file=f) In Word documents, headers and footers are specified per section. In the output, the ``Header`` elements will appear at the beginning of a section and ``Footer`` elements will appear at the end. MSFT Word headers and footers have a ``header_footer_type`` metadata field indicating where the header or footer applies. Valid values are ``"primary"``, ``"first_page"`` and ``"even_page"``. ``partition_docx`` will include page numbers in the document metadata when page breaks are present in the document. The function will detect user inserted page breaks and page breaks inserted by the Word document renderer. Some (but not all) Word document renderers insert page breaks when you save the document. If your Word document renderer does not do that, you may not see page numbers in the output even if you see them visually when you open the document. If that is the case, you can try saving the document with a different renderer. For more information about the ``partition_docx`` brick, you can check the `source code here `_. ``partition_email`` --------------------- The ``partition_email`` function partitions ``.eml`` documents and works with exports from email clients such as Microsoft Outlook and Gmail. The ``partition_email`` takes a filename, file-like object, or raw text as input and produces a list of document ``Element`` objects as output. Also ``content_source`` can be set to ``text/html`` (default) or ``text/plain`` to process the html or plain text version of the email, respectively. In order for ``partition_email`` to also return the header information (e.g. sender, recipient, attachment, etc.), ``include_headers`` must be set to ``True``. Returns tuple with body elements first and header elements second, if ``include_headers`` is True. Examples: .. code:: python from unstructured.partition.email import partition_email elements = partition_email(filename="example-docs/fake-email.eml") with open("example-docs/fake-email.eml", "r") as f: elements = partition_email(file=f) with open("example-docs/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text) with open("example-docs/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text, content_source="text/plain") with open("example-docs/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text, include_headers=True) ``partition_email`` includes a ``max_partition`` parameter that indicates the maximum character length for a document element. This parameter only applies if ``"text/plain"`` is selected as the ``content_source``. The default value is ``1500``, which roughly corresponds to the average character length for a paragraph. You can disable ``max_partition`` by setting it to ``None``. You can optionally partition e-mail attachments by setting ``process_attachments=True``. If you set ``process_attachments=True``, you'll also need to pass in a partitioning function to ``attachment_partitioner``. The following is an example of what the workflow looks like: .. code:: python from unstructured.partition.auto import partition from unstructured.partition.email import partition_email filename = "example-docs/eml/fake-email-attachment.eml" elements = partition_email( filename=filename, process_attachments=True, attachment_partitioner=partition ) For more information about the ``partition_email`` brick, you can check the `source code here `_. ``partition_epub`` --------------------- The ``partition_epub`` function processes e-books in EPUB3 format. The function first converts the document to HTML using ``pandocs`` and then calls ``partition_html``. You'll need `pandocs `_ installed on your system to use ``partition_epub``. Examples: .. code:: python from unstructured.partition.epub import partition_epub elements = partition_epub(filename="example-docs/winter-sports.epub") For more information about the ``partition_epub`` brick, you can check the `source code here `_. ``partition_html`` --------------------- The ``partition_html`` function partitions an HTML document and returns a list of document ``Element`` objects. ``partition_html`` can take a filename, file-like object, string, or url as input. The following three invocations of partition_html() are essentially equivalent: .. code:: python from unstructured.partition.html import partition_html elements = partition_html(filename="example-docs/example-10k.html") with open("example-docs/example-10k.html", "r") as f: elements = partition_html(file=f) with open("example-docs/example-10k.html", "r") as f: text = f.read() elements = partition_html(text=text) The following illustrates fetching a url and partitioning the response content. The ``ssl_verify`` kwarg controls whether or not SSL verification is enabled for the HTTP request. By default it is on. Use ``ssl_verify=False`` to disable SSL verification in the request. .. code:: python from unstructured.partition.html import partition_html elements = partition_html(url="https://python.org/") # you can also provide custom headers: elements = partition_html(url="https://python.org/", headers={"User-Agent": "YourScriptName/1.0 ..."}) # and turn off SSL verification elements = partition_html(url="https://python.org/", ssl_verify=False) If you want to ignore content in ``
`` and ``