mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-29 03:50:46 +00:00

### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
503 lines
14 KiB
JSON
503 lines
14 KiB
JSON
[
|
|
{
|
|
"type": "Title",
|
|
"element_id": "5d45a28d875e403c7294a15f22a0162f",
|
|
"text": "LayoutParser: A Unified Toolkit for DL-Based DIA 5",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "FigureCaption",
|
|
"element_id": "d9d53799fbfc3f90096f9dc9d45ff667",
|
|
"text": "Table 1: Current layout detection models in the LayoutParser model zoo",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "Table",
|
|
"element_id": "dddac446da6c93dc1449ecb5d997c423",
|
|
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century \u2018TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
|
|
"metadata": {
|
|
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></thead><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></table>",
|
|
"table_as_cells": [
|
|
{
|
|
"x": 0,
|
|
"y": 0,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Dataset"
|
|
},
|
|
{
|
|
"x": 0,
|
|
"y": 1,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "PubLayNet [33]"
|
|
},
|
|
{
|
|
"x": 0,
|
|
"y": 2,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "PRImA [3]"
|
|
},
|
|
{
|
|
"x": 0,
|
|
"y": 3,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Newspaper [17]"
|
|
},
|
|
{
|
|
"x": 0,
|
|
"y": 4,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "TableBank [18]"
|
|
},
|
|
{
|
|
"x": 0,
|
|
"y": 5,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "HIDataset [31]"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 0,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "| Base Model!|"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 1,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "P/M"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 2,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "M"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 3,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "P"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 4,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "P"
|
|
},
|
|
{
|
|
"x": 1,
|
|
"y": 5,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "P/M"
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 0,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Large Model"
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 1,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "M"
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 2,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": ""
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 3,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": ""
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 4,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": ""
|
|
},
|
|
{
|
|
"x": 2,
|
|
"y": 5,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": ""
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 0,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "| Notes"
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 1,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Layouts of modern scientific documents"
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 2,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Layouts of scanned modern magazines and scientific reports"
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 3,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Layouts of scanned US newspapers from the 20th century"
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 4,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Table region on modern scientific and business document"
|
|
},
|
|
{
|
|
"x": 3,
|
|
"y": 5,
|
|
"w": 1,
|
|
"h": 1,
|
|
"content": "Layouts of history Japanese documents"
|
|
}
|
|
],
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "UncategorizedText",
|
|
"element_id": "e5314387378c7a98911d71c145c45327",
|
|
"text": "2",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "FigureCaption",
|
|
"element_id": "e262996994d01c45f0d6ef28cb8afa93",
|
|
"text": "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months.",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "2298258fe84201e839939d70c168141b",
|
|
"text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and stomge of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component.",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "24d2473c4975fedd3f5cfd3026249837",
|
|
"text": "3.1 Layout Detection Models",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "008c0a590378dccd98ae7a5c49905eda",
|
|
"text": "In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art models like Faster R-CNN [28] and Mask R-CNN [12] are used. This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron2 [35], provides a minimal API that can perform layout detection with only four lines of code in Python:",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "ListItem",
|
|
"element_id": "b98aac79b1c1af144f6ed563e6510fd4",
|
|
"text": "import layoutparser as lp",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "Title",
|
|
"element_id": "44691a14713d40ea25a0401490ed7b5e",
|
|
"text": "wwe",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "ListItem",
|
|
"element_id": "e14922762abe8a044371efcab13bdcc9",
|
|
"text": "image = cv2.imread(\"image_file\") # load images",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "ListItem",
|
|
"element_id": "986e6a00c43302413ca0ad4badd5bca8",
|
|
"text": "model = lp. Detectron2LayoutModel (",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "ListItem",
|
|
"element_id": "d50233678a0d15373eb47ab537d3c11e",
|
|
"text": "ea \"lp: //PubLayNet/faster_rcnn_R_50_FPN_3x/config\")",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "ListItem",
|
|
"element_id": "11dccdd53ee27c94e976b875d2d6e40d",
|
|
"text": "layout = model.detect (image)",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"type": "NarrativeText",
|
|
"element_id": "bb86a9374cb6126db4088d1092557d09",
|
|
"text": "LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, it is important to select models trained on a dataset similar to the test samples. A semantic syntax is used for initializing the model weights in Layout Parser, using both the dataset name and model name 1p://<dataset-name>/<model-architecture-name>.",
|
|
"metadata": {
|
|
"filetype": "image/jpeg",
|
|
"languages": [
|
|
"eng"
|
|
],
|
|
"page_number": 1,
|
|
"data_source": {
|
|
"record_locator": {
|
|
"path": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
|
|
},
|
|
"permissions_data": [
|
|
{
|
|
"mode": 33188
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
] |