unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-19 03:10:21 +00:00

Author	SHA1	Message	Date
Christine Straub	4ad01efe23	feat: improve reading order (#2219 ) Closes GH Issue #2208.	2023-12-07 23:21:10 -08:00
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Roman Isecke	a2af72bb79	local connector metadata and deserialization fix (#1800 ) ### Description * Priority of this was to fix deserialization of ingest docs. Currently the source metadata wasn't being persisted * To help debug this, source metadata was added to the local ingest doc as well. * Unit test added to make sure the metadata itself was persisted. * As part of serialization, it was forcing docs to fetch source metadata if it hadn't already to add to the generated dict/json. This shouldn't have happened if the underlying variable `_source_metadata` was `None`. This way the doc can be serialized without any calls being made. * Serialization was moved to the `to_dict` method to make it more universal. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-23 15:51:52 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Christine Straub	237d04c896	feat: improve natural reading order by filtering OCR results (#1768 ) ### Summary Some `OCR` elements with only spaces in the text have full-page width in the bounding box, which causes the `xycut` sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-16 23:05:55 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
Roman Isecke	9b5d5e0f9e	roman/cli infer table arg (#1685 ) ### Description Add new parameter to map to `skip_infer_table_types` partition arg. Applies to partition config which is set on all connectors. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-12 16:14:53 +00:00
Yuming Long	dcd6d0ff67	Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` (#1579 ) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM\|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM\| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-06 22:54:49 +00:00
Christine Straub	b30d6a601e	Fix/1209 tweak xycut ordering output (#1630 ) Closes GH Issue #1209. ### Summary - add swapped `xycut` sorting - update `xycut` sorting evaluation script PDFs: - [sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf) - [multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf) - [11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf) ### Testing ``` elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res") print("\n\n".join([str(el) for el in elements])) ``` ### Evaluation ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only ```	2023-10-05 07:41:38 +00:00
Yao You	19d8bff275	feat: change default hi_res model to yolox quantized (#1607 )	2023-10-04 03:28:47 +00:00
Yao You	ad59a879cc	chore: bump inference to 0.6.6 (#1563 ) - bump `unstructured-inference` to `0.6.6` - specify default model name for element detection to be `detectron2_onnx` to keep current behavior - NOTE: the updated inference package by default would use yolox as element detection model; this will be evaluated and enabled in a separated PR --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2023-09-29 19:09:57 +00:00
Christine Straub	94fbbed189	feat: bbox shrinking in xycut algo, better natural reading order (#1560 ) Closes GH Issue #1233. ### Summary - add functionality to shrink all bounding boxes along x and y axes (still centered around the same center point) before running xy-cut sort ### Evaluation Run the followin gcommand for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>	2023-09-29 03:48:02 +00:00
Christine Straub	2d951722df	Feat/1332 save embedded images in pdf (#1371 ) Addresses [#1332](https://github.com/Unstructured-IO/unstructured/issues/1332) with `unstructured-inference` PR [#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )	2023-09-22 09:16:03 +00:00
Yao You	b534b2a6cd	Chore: bump inference package version to 0.5.28 and new release (#1355 ) This bump removes the preprocessing before table structure extraction and improves the OCR results for tables. --------- Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-09-15 18:26:15 -07:00
Klaijan	00181b88df	feat: pdf auto strategy groups broken numbered and bullet list items(#1393 ) Summary Adds logic to combine broken numbered list for pdf fast strategy. Details Previously the document reads the numbered list items part of the `layout-parser-paper-fast.pdf` file as: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character' 'recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that' 'underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' 'tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` Now it reads: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` The added logic leverages `ElementType` and `coordinates` to determine whether the following lines is a part of the previously detected `ListItem` or not. Test Add test that checks the element length less than original version with broken numbered list. The test also checks whether the first detected numbered list ends with previously broken line. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-13 21:30:06 +00:00
qued	6595632a57	enhancement: backup text categorization (#1322 ) Currently there are some cases when `partition_pdf` is run using the `hi_res` strategy, in which elements can come back with category `UncategorizedText`. This happens when the detection model fails to detect an element, but we're able to find it anyway either because it was embedded in the PDF, or we found it using OCR. This commit is to allow for attempting to categorize these uncategorized elements using our text-based classification function, `element_from_text`.	2023-09-12 20:32:48 +00:00
cragwolfe	87bfe7a1fe	build(deps): PDF images, unstructured-inference==0.5.23 (#1341 ) Bumps unstructured-inference==05.23 to pull in @christinestraub's fix: https://github.com/Unstructured-IO/unstructured-inference/pull/198 , so embedded Images in PDF's are now included in partition results ("hi_res"). From the perspective of elements with clean text, this is not a big win as a lot of the images have OCR garbage. However, it is important to preserve image elements for other downstream use cases, so overall this is a step forward.	2023-09-08 05:29:53 +00:00
cragwolfe	65344117b1	enhancement: entire page OCR output included with hi_res (#1263 ) Bumps unstructured-inference==0.5.19 to bring in @christinestraub's enhancement https://github.com/Unstructured-IO/unstructured-inference/pull/186 . This is a massive improvement where previously omitted text was not included in `hi_res` output if the layout model had not put a bounding box around it. In addition, the xycut sorting algorithm generally does a good job of ordering the merged OCR-text-not-in-layout-model bboxes with layout-model bboxes into "natural reading order." More details in https://github.com/Unstructured-IO/unstructured-inference/pull/186#issuecomment-1700438645 . Bonus: changelog fix.	2023-09-01 04:27:48 +00:00
cragwolfe	4c13d12dc3	fix: prevent spammy ListItem's from images and PDF's (#1210 ) The issue was that for blocks detected in an image such as: ![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a) , where the full image is: https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png , many ListItem's would be extracted that were not adding much value to the output (assuming the block was determined to be of type List from the layout model). This particular file is also used in ingest tests, and you can see the prior output here: https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280 Test Instructions: 1. run the following snippet: ``` import json import os from datetime import datetime from unstructured.__version__ import __version__ from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json filename = "/opt/home/tmp/IRS-form-1987.png" output_dir = "/opt/home/tmp/json" base_name_with_ext = os.path.basename(filename) output_filename_part = os.path.join(output_dir, base_name_with_ext) print(f"unstructured version: {__version__}") #for strategy in ("hi_res", "fast", "auto"): for strategy in ("hi_res",): d1 = datetime.now() elements = partition(filename=filename, strategy=strategy) elems_as_dicts = json.loads(elements_to_json(elements, indent=2)) # strip out metadata for the sake of more readable results for element_dict in elems_as_dicts: del element_dict["metadata"] json_filename=f"{output_filename_part}-{strategy}.json" with open(json_filename, "w") as jsonf: jsonf.write(json.dumps(elems_as_dicts, indent=2)) d2 = datetime.now() print(f"num elements for {strategy}: {len(elements)}") print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}") ``` updating the `filename` and `output_dir` paths for your particular local environment. 2. Open the json file that was writen to your `output_dir`, named IRS-form-1987.png-hi_res.json Witness the new element: ``` { "type": "ListItem", "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f", "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87 -61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil e Form 3115 for these changes." }, ```	2023-08-26 21:01:07 -07:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.	2023-08-22 16:05:02 -07:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com>	2023-08-18 20:59:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res.	2023-08-17 06:25:00 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Yuming Long	b4fe40e484	Chore[ingest]: adding parameter --partition-pdf-infer-table-structure (#1056 ) * add param * expected test * add option (to do doc nit) * test with api for now * typo * test with api key * use local only * encoding -> partition-encoding * changelog and version * Update ingest test fixtures (#1055) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> * ignore coordinates * no witespace lol * Update ingest test fixtures (#1061) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-08-08 18:11:06 -04:00

29 Commits