diff --git a/docs/source/bricks/partition.rst b/docs/source/bricks/partition.rst index dc3d80106..e3cea47ec 100644 --- a/docs/source/bricks/partition.rst +++ b/docs/source/bricks/partition.rst @@ -13,53 +13,54 @@ The easiest way to partition documents in unstructured is to use the ``partition If you call the ``partition`` brick, ``unstructured`` will use ``libmagic`` to automatically determine the file type and invoke the appropriate partition function. In cases where ``libmagic`` is not available, filetype detection will fall back to using the file extension. -The following table shows the document types the ``unstructured`` library currently supports. ``partition`` will recognize each of these document types and route the document +The following table shows the document types the ``unstructured`` library currently supports. ``partition`` will recognize each of these document types and route the document to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly. -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Document Type | Partition Function | Strategies | Table Support | Options | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| HTML Pages (`.html`) | `partition_html` | N/A | No | Encoding; Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Power Points (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Power Points (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ -| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags | -+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ - ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Document Type | Partition Function | Strategies | Table Support | Options | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| HTML Pages (`.html`/`.htm`) | `partition_html` | N/A | No | Encoding; Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Images (`.png`/`.jpg`/`.jpeg`/`.tiff`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Plain Text (`.txt`/`.text`/`.log`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| PowerPoints (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| PowerPoints (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +| Code Files (`.js`/`.py`/`.java`/ `.cpp`/`.cc`/`.cxx`/`.c`/`.cs`/ `.php`/`.rb`/`.swift`/`.ts`/`.go`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper | ++-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input. ``partition`` also has some optional kwargs.