2023-10-12 12:14:53 -04:00
[
{
2024-05-21 13:01:49 -04:00
"type" : "Title" ,
2024-04-24 09:05:20 +02:00
"element_id" : "5d45a28d875e403c7294a15f22a0162f" ,
2024-05-21 13:01:49 -04:00
"text" : "LayoutParser: A Unified Toolkit for DL-Based DIA 5" ,
2023-10-12 12:14:53 -04:00
"metadata" : {
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "FigureCaption" ,
"element_id" : "d9d53799fbfc3f90096f9dc9d45ff667" ,
"text" : "Table 1: Current layout detection models in the LayoutParser model zoo" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "Table" ,
"element_id" : "dddac446da6c93dc1449ecb5d997c423" ,
"text" : "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century \u2018TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents" ,
"metadata" : {
"text_as_html" : "<table><thead><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></thead><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></table>" ,
2024-04-25 13:14:48 +02:00
"table_as_cells" : [
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 0 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Dataset"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 1 ,
"w" : 1 ,
"h" : 1 ,
"content" : "PubLayNet [33]"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 2 ,
"w" : 1 ,
"h" : 1 ,
"content" : "PRImA [3]"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 3 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Newspaper [17]"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 4 ,
"w" : 1 ,
"h" : 1 ,
"content" : "TableBank [18]"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 0 ,
2024-05-21 13:01:49 -04:00
"y" : 5 ,
"w" : 1 ,
"h" : 1 ,
"content" : "HIDataset [31]"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 0 ,
"w" : 1 ,
"h" : 1 ,
"content" : "| Base Model!|"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 1 ,
"w" : 1 ,
"h" : 1 ,
"content" : "P/M"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 2 ,
"w" : 1 ,
"h" : 1 ,
"content" : "M"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 3 ,
"w" : 1 ,
"h" : 1 ,
"content" : "P"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 4 ,
"w" : 1 ,
"h" : 1 ,
"content" : "P"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 1 ,
2024-05-21 13:01:49 -04:00
"y" : 5 ,
"w" : 1 ,
"h" : 1 ,
"content" : "P/M"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 0 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Large Model"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 1 ,
"w" : 1 ,
"h" : 1 ,
"content" : "M"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 2 ,
"w" : 1 ,
"h" : 1 ,
"content" : ""
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 3 ,
"w" : 1 ,
"h" : 1 ,
"content" : ""
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 4 ,
"w" : 1 ,
"h" : 1 ,
"content" : ""
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 2 ,
2024-05-21 13:01:49 -04:00
"y" : 5 ,
"w" : 1 ,
"h" : 1 ,
"content" : ""
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 0 ,
"w" : 1 ,
"h" : 1 ,
"content" : "| Notes"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 1 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Layouts of modern scientific documents"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 2 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Layouts of scanned modern magazines and scientific reports"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 3 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Layouts of scanned US newspapers from the 20th century"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 4 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Table region on modern scientific and business document"
2024-04-25 13:14:48 +02:00
} ,
{
"x" : 3 ,
2024-05-21 13:01:49 -04:00
"y" : 5 ,
"w" : 1 ,
"h" : 1 ,
"content" : "Layouts of history Japanese documents"
2024-04-25 13:14:48 +02:00
}
] ,
2024-05-21 13:01:49 -04:00
"filetype" : "image/jpeg" ,
"languages" : [
"eng"
] ,
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "UncategorizedText" ,
"element_id" : "e5314387378c7a98911d71c145c45327" ,
"text" : "2" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "FigureCaption" ,
"element_id" : "e262996994d01c45f0d6ef28cb8afa93" ,
"text" : "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For \u201cbase model\u201d and \u201clarge model\u201d, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months." ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "NarrativeText" ,
"element_id" : "2298258fe84201e839939d70c168141b" ,
"text" : "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility functions for the visualization and stomge of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training. We now provide detailed descriptions for each component." ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "Title" ,
"element_id" : "24d2473c4975fedd3f5cfd3026249837" ,
"text" : "3.1 Layout Detection Models" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "NarrativeText" ,
"element_id" : "008c0a590378dccd98ae7a5c49905eda" ,
"text" : "In LayoutParser, a layout model takes a document image as an input and generates a list of rectangular boxes for the target content regions. Different from traditional methods, it relies on deep convolutional neural networks rather than manually curated rules to identify content regions. It is formulated as an object detection problem and state-of-the-art models like Faster R-CNN [28] and Mask R-CNN [12] are used. This yields prediction results of high accuracy and makes it possible to build a concise, generalized interface for layout detection. LayoutParser, built upon Detectron2 [35], provides a minimal API that can perform layout detection with only four lines of code in Python:" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "ListItem" ,
"element_id" : "b98aac79b1c1af144f6ed563e6510fd4" ,
"text" : "import layoutparser as lp" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "Title" ,
"element_id" : "44691a14713d40ea25a0401490ed7b5e" ,
"text" : "wwe" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "ListItem" ,
"element_id" : "e14922762abe8a044371efcab13bdcc9" ,
"text" : "image = cv2.imread(\"image_file\") # load images" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "ListItem" ,
"element_id" : "986e6a00c43302413ca0ad4badd5bca8" ,
"text" : "model = lp. Detectron2LayoutModel (" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "ListItem" ,
"element_id" : "d50233678a0d15373eb47ab537d3c11e" ,
"text" : "ea \"lp: //PubLayNet/faster_rcnn_R_50_FPN_3x/config\")" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "ListItem" ,
"element_id" : "11dccdd53ee27c94e976b875d2d6e40d" ,
"text" : "layout = model.detect (image)" ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
2023-10-23 11:51:52 -04:00
"data_source" : {
2024-05-21 13:01:49 -04:00
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
2023-10-23 11:51:52 -04:00
"permissions_data" : [
{
"mode" : 33188
}
2024-05-21 13:01:49 -04:00
]
}
}
} ,
{
"type" : "NarrativeText" ,
"element_id" : "bb86a9374cb6126db4088d1092557d09" ,
"text" : "LayoutParser provides a wealth of pre-trained model weights using various datasets covering different languages, time periods, and document types. Due to domain shift [7], the prediction performance can notably drop when models are ap- plied to target samples that are significantly different from the training dataset. As document structures and layouts vary greatly in different domains, it is important to select models trained on a dataset similar to the test samples. A semantic syntax is used for initializing the model weights in Layout Parser, using both the dataset name and model name 1p://<dataset-name>/<model-architecture-name>." ,
"metadata" : {
2023-10-12 12:14:53 -04:00
"filetype" : "image/jpeg" ,
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 10:47:05 -06:00
"languages" : [
"eng"
] ,
2024-05-21 13:01:49 -04:00
"page_number" : 1 ,
"data_source" : {
"record_locator" : {
"path" : "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper-with-table.jpg"
} ,
"permissions_data" : [
{
"mode" : 33188
}
]
}
}
2023-10-12 12:14:53 -04:00
}
]