mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-19 06:09:32 +00:00
chore: updating table docs with file extensions (#1702)
gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691
Adding filetype extensions from this
[list](f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)
)
where applicable.
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
This commit is contained in:
parent
cf31c9a2c4
commit
d0c84d605c
@ -16,50 +16,51 @@ In cases where ``libmagic`` is not available, filetype detection will fall back
|
||||
The following table shows the document types the ``unstructured`` library currently supports. ``partition`` will recognize each of these document types and route the document
|
||||
to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly.
|
||||
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Document Type | Partition Function | Strategies | Table Support | Options |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| CSV Files (`.csv`) | `partition_csv` | N/A | Yes | None |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| E-mails (`.eml`) | `partition_eml` | N/A | No | Encoding; Max Partition; Process Attachments |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| E-mails (`.msg`) | `partition_msg` | N/A | No | Encoding; Max Partition; Process Attachments |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| EPubs (`.epub`) | `partition_epub` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Excel Documents (`.xlsx`/`.xls`) | `partition_xlsx` | N/A | Yes | None |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| HTML Pages (`.html`) | `partition_html` | N/A | No | Encoding; Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Images (`.png`/`.jpg`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| HTML Pages (`.html`/`.htm`) | `partition_html` | N/A | No | Encoding; Include Page Breaks |
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Images (`.png`/`.jpg`/`.jpeg`/`.tiff`) | `partition_image` | "auto", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Markdown (`.md`) | `partition_md` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Org Mode (`.org`) | `partition_org` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Open Office Documents (`.odt`) | `partition_odt` | N/A | Yes | None |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| PDFs (`.pdf`) | `partition_pdf` | "auto", "fast", "hi_res", "ocr_only" | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Plain Text (`.txt`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Plain Text (`.txt`/`.text`/`.log`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| PowerPoints (`.ppt`) | `partition_ppt` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| PowerPoints (`.pptx`) | `partition_pptx` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| ReStructured Text (`.rst`) | `partition_rst` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Rich Text Files (`.rtf`) | `partition_rtf` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| TSV Files (`.tsv`) | `partition_tsv` | N/A | Yes | None |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Word Documents (`.doc`) | `partition_doc` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Word Documents (`.docx`) | `partition_docx` | N/A | Yes | Include Page Breaks |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| XML Documents (`.xml`) | `partition_xml` | N/A | No | Encoding; Max Partition; XML Keep Tags |
|
||||
+----------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
| Code Files (`.js`/`.py`/`.java`/ `.cpp`/`.cc`/`.cxx`/`.c`/`.cs`/ `.php`/`.rb`/`.swift`/`.ts`/`.go`) | `partition_text` | N/A | No | Encoding; Max Partition; Paragraph Grouper |
|
||||
+-----------------------------------------------------------------------------------------------------+--------------------------------+----------------------------------------+----------------+------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
As shown in the examples below, the ``partition`` function accepts both filenames and file-like objects as input.
|
||||
``partition`` also has some optional kwargs.
|
||||
|
Loading…
x
Reference in New Issue
Block a user