chore: updating table docs with file extensions (#1702)

gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691

Adding filetype extensions from this
[list](f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200))
where applicable.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
This commit is contained in:
Amanda Cameron 2023-10-14 14:14:52 -07:00 committed by GitHub
parent cf31c9a2c4
commit d0c84d605c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -16,50 +16,51 @@ In cases where ``libmagic`` is not available, filetype detection will fall back
The following table shows the document types the ``unstructured`` library currently supports. ``partition`` will recognize each of these document types and route the document The following table shows the document types the ``unstructured`` library currently supports. ``partition`` will recognize each of these document types and route the document
to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly. to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly.
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Document Type | Partition Function | Strategies | Table Support | Options | | Document Type | Partition Function | Strategies | Table Support | Options |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None | | CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments | | E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments | | E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks | | EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None | | Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| HTML Pages (`.html`) | `partition_html` | N/A | No | Encoding; Include Page Breaks | | HTML Pages (`.html`/`.htm`) | `partition_html` | N/A | No | Encoding; Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | | Images (`.png`/`.jpg`/`.jpeg`/`.tiff`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks | | Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks | | Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None | | Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | | PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper | | Plain Text (`.txt`/`.text`/`.log`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Power Points (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks | | PowerPoints (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Power Points (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks | | PowerPoints (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks | | ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks | | Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None | | TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks | | Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks | | Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags | | XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags |
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
| Code Files (`.js`/`.py`/`.java`/ `.cpp`/`.cc`/`.cxx`/`.c`/`.cs`/ `.php`/`.rb`/`.swift`/`.ts`/`.go`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input. As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input.
``partition`` also has some optional kwargs. ``partition`` also has some optional kwargs.