diff --git a/docs/requirements.txt b/docs/requirements.txt index 3140b57e8..99f23b7e8 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -69,6 +69,8 @@ sphinx-basic-ng==1.0.0b2 # via furo sphinx-rtd-theme==1.2.2 # via -r requirements/build.in +sphinx-tabs + # to enable tabbed code blocks sphinxcontrib-applehelp==1.0.4 # via sphinx sphinxcontrib-devhelp==1.0.2 diff --git a/docs/source/api.rst b/docs/source/api.rst index 54230d735..ddcf32d99 100644 --- a/docs/source/api.rst +++ b/docs/source/api.rst @@ -5,16 +5,50 @@ Try our hosted API! It's freely available to use with any of the file types list Now you can get started with this quick example: -.. code:: shell +.. tabs:: - curl -X 'POST' \ - 'https://api.unstructured.io/general/v0/general' \ - -H 'accept: application/json' \ - -H 'Content-Type: multipart/form-data' \ - -H 'unstructured-api-key: ' \ - -F 'files=@example-docs/family-day.eml' \ - | jq -C . | less -R + .. tab:: Shell + .. code:: shell + + curl -X 'POST' \ + 'https://api.unstructured.io/general/v0/general' \ + -H 'accept: application/json' \ + -H 'Content-Type: multipart/form-data' \ + -H 'unstructured-api-key: ' \ + -F 'files=@sample-docs/family-day.eml' \ + | jq -C . | less -R + + .. tab:: Python + + .. code:: python + + # Define the URL + url = 'https://api.unstructured.io/general/v0/general' + + # Define the headers + headers = { + 'accept': 'application/json', + 'unstructured-api-key': '', + } + + # Define the form data + data = { + 'strategy': 'auto', + } + + # Define the file data + file_path = "/Path/To/File" + file_data = {'files': open(file_path, 'rb')} + + # Make the POST request + response = requests.post(url, headers=headers, data=data, files=file_data) + + # Close the file + file_data['files'].close() + + # Parse the JSON response + json_response = response.json() Below, you will find a more comprehensive overview of the API capabilities. For detailed information on request and response schemas, refer to the `API documentation `_. @@ -227,7 +261,7 @@ If you are self-hosting the API or running it locally, it's strongly suggested t Using Docker Images ==================== -The following instructions are intended to help you get up and running using Docker to interact with ``unstructured-api``. See `here `_ if you don't already have docker installed on your machine. +The following instructions are intended to help you get up and running using Docker to interact with ``unstructured-api``. See `docker `_ if you don't already have docker installed on your machine. NOTE: Multi-platform images are built to support both x86_64 and Apple silicon hardware. Docker pull should download the corresponding image for your architecture, but you can specify with ``--platform`` (e.g. ``--platform linux/amd64``) if needed. diff --git a/docs/source/bricks.rst b/docs/source/bricks.rst index 554b8e2b9..4263a298e 100644 --- a/docs/source/bricks.rst +++ b/docs/source/bricks.rst @@ -1,2475 +1,19 @@ Bricks ====== -The goal of this page is to introduce you to the concept of bricks. Bricks are functions that live in ``unstructured`` and are the primary public API for the library. -There are three types of bricks in ``unstructured``, corresponding to the different stages of document pre-processing: partitioning, cleaning, and staging. +There are several types of bricks in ``unstructured``, corresponding to the different stages of document pre-processing: partitioning, cleaning, chunking and staging. After reading this section, you should understand the following: -* How to extract content from a document using partitioning bricks. +* How to partition a document into json or csv. * How to remove unwanted content from document elements using cleaning bricks. +* How to extract content from a document using the extraction bricks. * How to prepare data for downstream use cases using staging bricks +.. toctree:: + :maxdepth: 1 - -############ -Partitioning -############ - - -Partitioning bricks in ``unstructured`` allow users to extract structured content from a raw unstructured document. -These functions break a document down into elements such as ``Title``, ``NarrativeText``, and ``ListItem``, -enabling users to decide what content they'd like to keep for their particular application. -If you're training a summarization model, for example, you may only be interested in ``NarrativeText``. - -The easiest way to partition documents in unstructured is to use the ``partition`` brick. -If you call the ``partition`` brick, ``unstructured`` will use ``libmagic`` to automatically determine the file type and invoke the appropriate partition function. -In cases where ``libmagic`` is not available, filetype detection will fall back to using the file extension. - -The following table shows the document types the `unstructured` library currently supports. `partition` will recognize each of these document types and route the document to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly. - -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Document Type | Partition Function | Strategies | Table Support | Options | -+==============================================+=====================+========================================+================+==================================================================================================================+ -| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| HTML Pages (`.html`) | `partition_html` | N/A | No | Encoding; Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Images (`.png`/`.jpg`) | `partition_image` | `"auto"`, `"hi_res"`, `"ocr_only"` | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| PDFs (`.pdf`) | `partition_pdf` | `"auto"`, `"fast"`, `"hi_res"`, `"ocr_only"` | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Power Points (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Power Points (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags | -+----------------------------------------------+---------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ - - -As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input. -``partition`` also has some optional kwargs. -For example, if you set ``include_page_breaks=True``, the output will include ``PageBreak`` elements if the filetype supports it. -Additionally you can bypass the filetype detection logic with the optional ``content_type`` argument which may be specified with either the ``filename`` or file-like object, ``file``. -You can find a full listing of optional kwargs in the documentation below. - -.. code:: python - - from unstructured.partition.auto import partition - - - filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") - elements = partition(filename=filename, content_type="application/pdf") - print("\n\n".join([str(el) for el in elements][:10])) - - -.. code:: python - - from unstructured.partition.auto import partition - - - filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") - with open(filename, "rb") as f: - elements = partition(file=f, include_page_breaks=True) - print("\n\n".join([str(el) for el in elements][5:15])) - - -The ``unstructured`` library also includes partitioning bricks targeted at specific document types. -The ``partition`` brick uses these document-specific partitioning bricks under the hood. -There are a few reasons you may want to use a document-specific partitioning brick instead of ``partition``: - -* If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly, or passing in the ``content_type`` will make your program run faster. -* Fewer dependencies. You don't need to install ``libmagic`` for filetype detection if you're only using document-specific bricks. -* Additional features. The API for partition is the least common denominator for all document types. Certain document-specific brick include extra features that you may want to take advantage of. For example, ``partition_html`` allows you to pass in a URL so you don't have to store the ``.html`` file locally. See the documentation below learn about the options available in each partitioning brick. - - -Below we see an example of how to partition a document directly with the URL using the partition_html function. - -.. code:: python - - from unstructured.partition.html import partition_html - - url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html" - elements = partition_html(url=url) - print("\n\n".join([str(el) for el in elements])) - - -``partition`` --------------- - -The ``partition`` brick is the simplest way to partition a document in ``unstructured``. -If you call the ``partition`` function, ``unstructured`` will attempt to detect the -file type and route it to the appropriate partitioning brick. All partitioning bricks -called within ``partition`` are called using the default kwargs. Use the document-type -specific bricks if you need to apply non-default settings. -``partition`` currently supports ``.docx``, ``.doc``, ``.odt``, ``.pptx``, ``.ppt``, ``.xlsx``, ``.csv``, ``.tsv``, ``.eml``, ``.msg``, ``.rtf``, ``.epub``, ``.html``, ``.xml``, ``.pdf``, -``.png``, ``.jpg``, and ``.txt`` files. -If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``, -``.png``, and ``.jpg``. -The ``strategy`` kwarg controls the strategy for partitioning documents. Generally available strategies are `"fast"` for -faster processing and `"hi_res"` for more accurate processing. - - -.. code:: python - - import docx - - from unstructured.partition.auto import partition - - document = docx.Document() - document.add_paragraph("Important Analysis", style="Heading 1") - document.add_paragraph("Here is my first thought.", style="Body Text") - document.add_paragraph("Here is my second thought.", style="Normal") - document.save("mydoc.docx") - - elements = partition(filename="mydoc.docx") - - with open("mydoc.docx", "rb") as f: - elements = partition(file=f) - - -.. code:: python - - from unstructured.partition.auto import partition - - elements = partition(filename="example-docs/layout-parser-paper-fast.pdf") - - -The ``partition`` function also accepts a ``url`` kwarg for remotely hosted documents. If you want -to force ``partition`` to treat the document as a particular MIME type, use the ``content_type`` -kwarg in conjunction with ``url``. Otherwise, ``partition`` will use the information from -the ``Content-Type`` header in the HTTP response. The ``ssl_verify`` kwarg controls whether -or not SSL verification is enabled for the HTTP request. By default it is on. Use ``ssl_verify=False`` -to disable SSL verification in the request. - - -.. code:: python - - from unstructured.partition.auto import partition - - url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md" - elements = partition(url=url) - elements = partition(url=url, content_type="text/markdown") - -For more information about the ``partition`` brick, you can check the `source code here `_. - - -``partition_csv`` ------------------- - -The ``partition_csv`` function pre-processes CSV files. The output is a single -``Table`` element. The ``text_as_html`` attribute in the element metadata will -contain an HTML representation of the table. - -Examples: - -.. code:: python - - from unstructured.partition.csv import partition_csv - - elements = partition_csv(filename="example-docs/stanley-cups.csv") - print(elements[0].metadata.text_as_html) - -For more information about the ``partition_csv`` brick, you can check the `source code here `_. - - -``partition_doc`` ------------------- - -The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents -saved in the ``.doc`` format. This partition brick uses a combination of the styling -information in the document and the structure of the text to determine the type -of a text element. The ``partition_doc`` can take a filename or file-like object -as input. -``partition_doc`` uses ``libreoffice`` to convert the file to ``.docx`` and then -calls ``partition_docx``. Ensure you have ``libreoffice`` installed -before using ``partition_doc``. - -Examples: - -.. code:: python - - from unstructured.partition.doc import partition_doc - - elements = partition_doc(filename="example-docs/fake.doc") - -For more information about the ``partition_doc`` brick, you can check the `source code here `_. - - -``partition_docx`` ------------------- - -The ``partition_docx`` partitioning brick pre-processes Microsoft Word documents -saved in the ``.docx`` format. This partition brick uses a combination of the styling -information in the document and the structure of the text to determine the type -of a text element. The ``partition_docx`` can take a filename or file-like object -as input, as shown in the two examples below. - -Examples: - -.. code:: python - - import docx - - from unstructured.partition.docx import partition_docx - - document = docx.Document() - document.add_paragraph("Important Analysis", style="Heading 1") - document.add_paragraph("Here is my first thought.", style="Body Text") - document.add_paragraph("Here is my second thought.", style="Normal") - document.save("mydoc.docx") - - elements = partition_docx(filename="mydoc.docx") - - with open("mydoc.docx", "rb") as f: - elements = partition_docx(file=f) - -In Word documents, headers and footers are specified per section. In the output, -the ``Header`` elements will appear at the beginning of a section and ``Footer`` -elements will appear at the end. MSFT Word headers and footers have a ``header_footer_type`` -metadata field indicating where the header or footer applies. Valid values are -``"primary"``, ``"first_page"`` and ``"even_page"``. - -``partition_docx`` will include page numbers in the document metadata when page breaks -are present in the document. The function will detect user inserted page breaks -and page breaks inserted by the Word document renderer. Some (but not all) Word document renderers -insert page breaks when you save the document. If your Word document renderer does not do that, -you may not see page numbers in the output even if you see them visually when you open the -document. If that is the case, you can try saving the document with a different renderer. - -For more information about the ``partition_docx`` brick, you can check the `source code here `_. - - -``partition_email`` ---------------------- - -The ``partition_email`` function partitions ``.eml`` documents and works with exports -from email clients such as Microsoft Outlook and Gmail. The ``partition_email`` -takes a filename, file-like object, or raw text as input and produces a list of -document ``Element`` objects as output. Also ``content_source`` can be set to ``text/html`` -(default) or ``text/plain`` to process the html or plain text version of the email, respectively. -In order for ``partition_email`` to also return the header information (e.g. sender, recipient, -attachment, etc.), ``include_headers`` must be set to ``True``. Returns tuple with body elements -first and header elements second, if ``include_headers`` is True. - -Examples: - -.. code:: python - - from unstructured.partition.email import partition_email - - elements = partition_email(filename="example-docs/fake-email.eml") - - with open("example-docs/fake-email.eml", "r") as f: - elements = partition_email(file=f) - - with open("example-docs/fake-email.eml", "r") as f: - text = f.read() - elements = partition_email(text=text) - - with open("example-docs/fake-email.eml", "r") as f: - text = f.read() - elements = partition_email(text=text, content_source="text/plain") - - with open("example-docs/fake-email.eml", "r") as f: - text = f.read() - elements = partition_email(text=text, include_headers=True) - - -``partition_email`` includes a ``max_partition`` parameter that indicates the maximum character -length for a document element. -This parameter only applies if ``"text/plain"`` is selected as the ``content_source``. -The default value is ``1500``, which roughly corresponds to -the average character length for a paragraph. -You can disable ``max_partition`` by setting it to ``None``. - - -You can optionally partition e-mail attachments by setting ``process_attachments=True``. -If you set ``process_attachments=True``, you'll also need to pass in a partitioning -function to ``attachment_partitioner``. The following is an example of what the -workflow looks like: - -.. code:: python - - from unstructured.partition.auto import partition - from unstructured.partition.email import partition_email - - filename = "example-docs/eml/fake-email-attachment.eml" - elements = partition_email( - filename=filename, process_attachments=True, attachment_partitioner=partition - ) - -For more information about the ``partition_email`` brick, you can check the `source code here `_. - - -``partition_epub`` ---------------------- - -The ``partition_epub`` function processes e-books in EPUB3 format. The function -first converts the document to HTML using ``ebooklib`` and then calls ``partition_html``. -You'll need `ebooklib `_ installed on your system -to use ``partition_epub``. - - -Examples: - -.. code:: python - - from unstructured.partition.epub import partition_epub - - elements = partition_epub(filename="example-docs/winter-sports.epub") - -For more information about the ``partition_epub`` brick, you can check the `source code here `_. - - -``partition_html`` ---------------------- - -The ``partition_html`` function partitions an HTML document and returns a list -of document ``Element`` objects. ``partition_html`` can take a filename, file-like -object, string, or url as input. - -The following three invocations of partition_html() are essentially equivalent: - - -.. code:: python - - from unstructured.partition.html import partition_html - - elements = partition_html(filename="example-docs/example-10k.html") - - with open("example-docs/example-10k.html", "r") as f: - elements = partition_html(file=f) - - with open("example-docs/example-10k.html", "r") as f: - text = f.read() - elements = partition_html(text=text) - - - -The following illustrates fetching a url and partitioning the response content. -The ``ssl_verify`` kwarg controls whether -or not SSL verification is enabled for the HTTP request. By default it is on. Use ``ssl_verify=False`` -to disable SSL verification in the request. - -.. code:: python - - from unstructured.partition.html import partition_html - - elements = partition_html(url="https://python.org/") - - # you can also provide custom headers: - - elements = partition_html(url="https://python.org/", - headers={"User-Agent": "YourScriptName/1.0 ..."}) - - # and turn off SSL verification - - elements = partition_html(url="https://python.org/", ssl_verify=False) - - -If you want to ignore content in ``
`` and ``