unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-18 13:03:01 +00:00

Author	SHA1	Message	Date
Klaijan	5ba3b9c2c6	chore: get eval metrics from ingest in (#2097 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-17 18:22:36 +00:00
Steve Canny	ee62ed7b54	rfctr(html): clean types and docs in prep for HTML table parsing fixes (#2104 ) There are a cluster of bugs in the HTML parsing code, particularly surrounding table behaviors but also inclusion of style elements, etc. Clean up typing and docstrings in that neighborhood as a way to familiarize myself with that part of the code-base.	2023-11-17 17:56:38 +00:00
Yuming Long	ef8ac7257d	Chore: Import tables_agent from inference (#2087 ) Related to https://github.com/Unstructured-IO/unstructured/issues/2028 Import `tables_agent` from inference and don't init in unst again copy the same logic from [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L89) (currently deprecated).	2023-11-16 22:14:24 -08:00
Yuming Long	97a25b0094	Chore: move hi res initialization `initialize.py` file out of ingest (#2096 ) Move Hi_res model initialization file out of ingest to `partition` dir --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-16 21:53:25 -08:00
Christine Straub	9c66eab8a9	Fix: handle pdf text extraction errors (#2101 ) Closes #2084. ### Summary Certain pdfs throw unexpected errors when being opened by `pdfminer`, causing `partition_pdf()` to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy. ### Testing PDF: [NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13383215/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf) ``` elements = partition_pdf( filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf", ) ```	2023-11-16 21:42:36 -08:00
Steve Canny	a589a494f6	docx: improve page break fidelity (#1631 ) Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.	2023-11-17 00:09:14 +00:00
Roman Isecke	b8af2f18bb	add mongo db destination connector (#2068 ) ### Description This adds the basic implementation of pushing the generated json output of partition to mongodb. None of this code provisions the mondo db instance so things like adding a search index around the embedding content must be done by the user. Any sort of schema validation would also have to take place via user-specific configuration on the database. This update makes no assumptions about the configuration of the database itself.	2023-11-16 22:40:22 +00:00
Roman Isecke	ead2a7f1eb	drop cloud cli deps (#2088 ) ### Description To not require additional dependencies on cloud-related CLIs (i.e. gcloud and az), using python and the existing dependencies already used to run out code to interact with those providers for overhead work associated with destination ingest tests.	2023-11-16 20:13:46 +00:00
Steve Canny	7a741c9ae6	fix(chunk): #1985 mis-splits of Table chunks (#2076 ) Closes #1985 Summary. Due to an interaction of coding errors, HTML text in `TableChunk` splits of a `Table` element were repeating the entire HTML for the table in each chunk. Technical Summary. This behavior was fixed but not published in the last chunking PR of a series. Finish up that PR and submit it all here. This PR extracts chunking to the particular Section type (each has their own distinct chunking behavior).	2023-11-16 16:22:50 +00:00
Steve Canny	41fc55bc12	fix(docx): tabulate output is non-deterministic (#2090 ) The test for nested tables added a few PRs ago indirectly relies on the padding added to table-HTML by `tabulate`. The length of that padding turns out to be non-deterministic, perhaps related to M1 vs. Intel hardware. Remove padding from tabulate output in the test so only actual content is compared.	2023-11-16 07:52:16 +00:00
cragwolfe	5fa40850f4	feat: convenience script to post files to the API (#2083 ) Usage: ./unstructured-get-json.sh [options] <file>" Options: --api-key KEY Specify the API key for authentication. Set the env var $UNST_API_KEY to skip providing this option. --hi-res hi_res strategy: Enable high-resolution processing, with layout segmentation and OCR --fast fast strategy: No OCR, just extract embedded text --ocr-only ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation. --tables Enable table extraction: tables are represented as html in metadata --coordinates Include coordinates in the output --trace Enable trace logging for debugging, useful to cut and paste the executed curl call --verbose Enable verbose logging including printing first 8 elements to stdout --s3 Write the resulting output to s3 (like a pastebin) --help Display this help and exit. Arguments: <file> File to send to the API. The script requires a <file>, the document to post to the Unstructured API. The .json result is written to ~/tmp/unst-outputs/ -- this path is echoed and copied to your clipboard.	2023-11-15 22:58:28 -08:00
cragwolfe	abe4e8191a	chore: ingest-script cleanup, better skip condition (#2094 ) When testing ingest tests, one often wants to keep the .json output or generated metrics files around for inspection after the fact. This updates the bash condition to actually honor the comment that mentions # export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 Test Instructions Run: export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 ./test_unstructured_ingest/src/s3.sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction and witness test directories/files do not get cleaned up. E.g., `test_unstructured_ingest/metrics-tmp/`. One can also add a `set -x` at the top of test_unstructured_ingest/cleanup.sh to see what is getting skipped (it's a lot!).	2023-11-15 22:28:04 -08:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
Klaijan	777a428071	chore: for ingest-test metrics, also check subdirs (#2079 ) - Copy script only went through one layer of subdirectory so it did not found the match between manifest file and structured output. Now edited to search all subdirectories. - `set -e` causes the script to exit at any exit rather than `exit 0`, fix all scripts that needs to run the copy script to be `set +e` right before the check diff, then back to `set -e` after - Edit the default evaluation metrics output from `metrics` to `metrics-tmp` to account for diff check - Add a script that checks the differences between old eval metric output (metrics) and new eval metrics output (metrics-tmp)	2023-11-15 21:02:43 -08:00
Yao You	f1ad901f57	chore: add more parametrization to ingestion test (#2086 ) - allow the overwrite destination to be set to the `OUTPUT_ROOT` instead of default to script dir. ## test run ```bash OVERWRITE_FIXTURES=true OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/s3.sh ``` with this change we should find new files generated under `/tmp/expected-structured-output/s3`. Without this change there will be no such new files.	2023-11-15 22:32:41 +00:00
Steve Canny	252405c780	Dynamic ElementMetadata implementation (#2043 ) ### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - ad-hoc metadata fields are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields cannot start with an underscore. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one).	2023-11-15 13:22:15 -08:00
cragwolfe	d7a280402f	build: larger images for docker publish (#2082 ) Build and publish docker images on larger runner to work around the space issue here: https://github.com/Unstructured-IO/unstructured/actions/runs/6871101034/job/18689403845 .	2023-11-15 14:46:53 +00:00
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
Austin Walker	2931cb38e8	fix: handle KeyError: 'N' for certain pdfs (#2072 ) Closes #2059. We've found some pdfs that throw an error in pdfminer. These files use a ICCBased color profile but do not include an expected value `N`. As a workaround, we can wrap pdfminer and drop any colorspace info, since we don't need to render the document. To verify, try to partition the document in the linked issue. ``` elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-15 01:59:05 +00:00
Trevor Bossert	f8528a0e2c	Update base image to include CUDA 11.8 (#2053 ) This adds Nvidia GPU support with CUDA to container images.	2023-11-14 16:14:01 -08:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
Ahmet Melek	68686e292e	fix: check existence of variable res before iteration (#2063 ) Fixes a bug where `TypeError: 'NoneType' object is not iterable` raises due to variable `res` returning as None Checks the existence of `res` before iteration	2023-11-14 16:07:54 +00:00
Yuming Long	6c9990b013	Chore: specify default language parameter to paddle with `DEFAULT_PADDLE_LANG` (#2065 ) Close: https://github.com/Unstructured-IO/unstructured-api/issues/247 ### Summary User now can specify default [paddle lang code](https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations) with env `DEFAULT_PADDLE_LANG` before we have the language mapping for paddle ### Test * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run paddle on intel chip: ``` pip install paddlepaddle pip install "unstructured.PaddleOCR" export OCR_AGENT=paddle export DEFAULT_PADDLE_LANG=ch make run-web-app ``` * curl: ``` curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' \| jq -C . \| less -R ``` * expected to see `INFO Loading paddle with CPU on language=ch...` in log info	2023-11-13 22:05:37 +00:00
Trevor Bossert	22aedc4d6f	Remove ssh-keyscan and files (#2057 ) This was legacy and is no longer needed. It also has the effect of incorrect owner for known_hosts of notebook-user Relates to: #2056	2023-11-13 18:50:06 +00:00
Yao You	36c4441e2b	ci: parametrize ingest test checking scripts (#2062 ) - parametrize the output folder paths and expected output folder paths in comparison scripts - now allow user to use env `OUTPUT_ROOT` to control where the output and expected output is - currently assumes output from test and expected output are in the same directory; this may need separation later ## test run ```bash OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh ``` and it should show files changed but not able to show diff since there is no expected output content at `OUTPUT_ROOT`. Then run ```bash cp -R test_unstructured_ingest/expected-* /tmp/ OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh ``` we can see (due to CI and local instance producing different results) actual line by line diff	2023-11-13 18:42:19 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Christine Straub	b11c546757	Fix: partition pdf overflow error (#2054 ) Closes #2050. ### Summary - set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR data - update `determine_pdf_auto_strategy` to return the `hi_res` strategy if either `infer_table_structure` or `extract_images_in_pdf` is true ### Testing PDF: [getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf) Run the following code in both the `main` branch and the `current` branch. ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="getty_62-62.pdf", extract_images_in_pdf=True, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path=path, ) ``` 0.10.30	2023-11-10 11:01:46 -08:00
John	f8c180a59e	Jj/2027 float no attr strip (#2048 ) Closes #2027 Tables or pages that contain only numbers are returned as floats in a pandas.DataFrame when the image or page is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats. This update converts those floats if needed and otherwise strips the text. Testing (note: the document used for testing is new, so you will have to copy it to the main branch in order to see that this snippet raises an AttributeError on the main branch, but works on this branch) ``` from unstructured.partition.pdf import partition_pdf filename = "example-docs/all-number-table.pdf" partition_pdf(filename, strategy="ocr_only") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-10 05:14:06 +00:00
cragwolfe	fa27408c4f	chore: fix Makefile ingest targets (#2051 ) Fixes the Makefile `ingest-` targets were broken in https://github.com/Unstructured-IO/unstructured/pull/1799/files. Test Instructions for maketarget in $(grep .PHONY Makefile \| grep install-ingest \| perl -p -e 's/.PHONY://' \| tr -d '\n'); do echo $maketarget; make $maketarget done	2023-11-09 21:55:27 -08:00
cragwolfe	69952f66ed	fix(build): update ingest script loc in Dockerfile (#2052 ) Fixes docker-smoke-test.sh to reference the new location for the wikipedia ingest script, which was moved in https://github.com/Unstructured-IO/unstructured/pull/1951 . This fix should allow the docker image build to complete on merges to main. Reference to recent failed job: https://github.com/Unstructured-IO/unstructured/actions/runs/6819416096/job/18546724401	2023-11-09 21:55:07 -08:00
Klaijan	049b0f3fa8	chore: update metrics-json-manifest (#2047 ) Update `metrics-json-manigest.txt` master file for ingest evaluation. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-10 00:24:59 +00:00
Steve Canny	d06bcc41bb	fix(docx): improve page-break detection (#2036 ) Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.	2023-11-09 20:34:30 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
Christine Straub	bb58c1bb0b	Refactor: element type (#2035 ) ### Summary - add constants for element type - replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the `ElementType` constants - replace element type strings using the constants ### Testing CI should pass.	2023-11-08 21:52:55 -08:00
Steve Canny	c688216b38	fix: remove `.max_characters` from ElementMetadata (#2032 ) This metadata field is assumedly vestigial and is unused by any code in the repo. `max_characters` is an optional argument to `chunk_by_title()` and has meaning in that context, but is not written to the metadata. Remove this unused field.	2023-11-08 19:56:31 +00:00
Steve Canny	0e2c21e5a2	fix: handle sectionless-docx in the general case (#1829 ) A DOCX document that has no sections can still contain one or more tables. Such files are never created by Word but Word can open them just fine. These can be and are generated by other applications. Use the newly-added `Document.iter_inner_content()` method added upstream in `python-docx` to capture both paragraphs and tables from a section-less DOCX document. This generalizes the fix for MS Teams chat-transcripts (an example of sectionless-docx) implemented in #1825.	2023-11-08 19:05:19 +00:00
shreyanid	67fa7ad867	feat: rework aggregate metrics by doctype calculation (#1982 ) ### Summary Previously, the holistic evaluation script was a copy of the ingest evaluation function with some modifications to aggregate the data by doctype. This refactor instead takes the result of the `measure_text_edit_distance` function (used by ingest) and aggregates the results by doctype. This pattern can also be followed by future aggregations we may want to perform. ### Test Confirm the doctype aggregation functionality of the `aggregate_cct_data_by_doctype` function by calling it on the ingest metrics result sheet: (from the top level unstructured folder) ``` python -c 'from unstructured.metrics.doctype_aggregation import *; aggregate_cct_data_by_doctype("./test_unstructured_ingest/metrics")' ``` The aggregated result will be written to the same metrics folder. <img width="680" alt="Screenshot 2023-11-03 at 2 56 20 PM" src="https://github.com/Unstructured-IO/unstructured/assets/42684285/7250191b-bdf7-4e9f-99ca-ddbe7ee74ac5">	2023-11-08 01:00:02 -08:00
ryannikolaidis	d5fd21f0fd	fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023 ) Closes #1064 When using the `--partition-by-api` flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. ## Changes * parse and pass relevant partition arguments to the api in unstructured-ingest * bonus: leverage an existing `partition.api` function to call out to the api rather than including duplicative request logic in unstructured ingest * bonus: --pdf-infer-table-structure is now a flag not an arg (it defaults false anyways, this is more succinct and consistent with similar parameters) * bonus: adds `hi_res_model_name` so a user can specify the model to leverage when using a hi_res strategy. ## Testing * update against_api.sh source test script to specify a partition argument and validates that the response from the api respected the argument * manually ran a request and validated that it was processed with chipper as specified (not sure if we want to bake a chipper request into the ci tests) (validated that the response leveraged the chipper model): ``` PYTHONPATH=. ./unstructured/ingest/main.py \ local \ --output-dir /tmp/ingest-requests/chipper \ --verbose \ --reprocess \ --strategy hi_res \ --partition-by-api \ --hi-res-model-name chipper \ --api-key "$API_KEY" \ --input-path 'example-docs/layout-parser-paper-with-table.pdf' ```	2023-11-08 04:47:02 +00:00
Roman Isecke	03f62faf9b	feat: add connection check method to all source and destination connectors (#2000 ) ### Description Add a `check_connection` method to each connector to easily be able to check it without running the full ingest process. As part of this PR, some refactoring done to allow clients to be shared and populated across the `check_connection` method and the `initialize` method, allowing for the `check_connection` method to be called without having to rely on the `initialize` one to be called first. * bonus: fix the changelog --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-11-08 03:11:39 +00:00
qued	92ddf3a337	feat: enable request timeout (#2013 ) Courtesy @cdpierse. Adds a test to PR #1529 in accordance with feedback. Description from original PR: In python the default behaviour of `requests.get` without a `timeout` being set is to hang indefinitely. We have a production use case where the desired behaviour would be to raise a timeout error rather than have the application just hang. This PR adds a new optional keyword parameter `request_timeout` to `partition` which is passed to `file_and_type_from_url` in the case where we are fetching from a URL. This is then passed to `requests.get` --------- Co-authored-by: Charles Pierse <charlespierse@gmail.com>	2023-11-08 00:44:58 +00:00
Steve Canny	80fe07b89f	fix: #1952 support nested docx tables (#2020 ) In DOCX, like HTML, a table cell can itself contain a table. This is not uncommon and is typically used for formatting purposes. When a DOCX table is nested, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table. This implements the solution described and spiked in PR #1952. --------- Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>	2023-11-08 00:37:21 +00:00
ryannikolaidis	0e94dd5d65	fix: ingest destination test failure with missing output (#2031 ) Intermittently the various destination test will fail with: ``` {noformat}--- Cleanup done --- gs://utic-test-ingest-fixtures-output/1699377964/example-docs/ deleting gs://utic-test-ingest-fixtures-output/1699377964 Removing objects: ERROR: (gcloud.storage.rm) The following URLs matched no objects or files: -gs://utic-test-ingest-fixtures-output/1699377964 Last ran script: gcs.sh Error: Process completed with exit code 1.{noformat} ``` Reference trace [here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020) After some investigation it looks like this error is due to collisions that occur because we’re assuming 1s date accuracy is sufficient when generating (and deleting) "unique" test destination location names. The likelihood is actually pretty high given that we run these tests against a test matrix. Instead we should just use a uuid for these unique destinations. ## Changes - Use uuidgen instead of `date +%s` for unique destinations	2023-11-07 23:14:01 +00:00
qued	04fcdb91fe	chore: Update readme slack links (#2030 ) Updated slack links in the README that were using an old shortened URL.	2023-11-07 13:02:43 -08:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> 0.10.29	2023-11-06 19:30:12 -06:00
Yao You	38ab35dcb6	fix: make pip compile (#2015 ) - add missing make file in ingest folder	2023-11-06 16:26:12 -06:00
qued	ad09a869b5	fix: update slack link to link shortener (#2010 ) Per @tabossert we're now using a link shortener behind which we can rotate the link to keep it current. That way we (🤞 ) never have to update this here again. #### Testing: Links should work. No more links should exist in the documentation except this one.	2023-11-06 15:47:18 +00:00
Ahmet Melek	ca78dc737a	feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918 ) Closes #1782 This PR: - Extends ingest pipeline so that it is possible to select an embedding provider from a range of providers - Modifies the ingest embedding test to be a diff test, since the embedding vectors are reproducible after supporting multiple providers Additional info on the chosen provider for the test: - Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic even when there's no seed set - Took 6.84s to pass a unit test with the provider (without cache, including model download) - `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it zero cost For all these reasons, testing embedding modules with the Huggingface model seems to be making sense --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-11-06 12:26:12 +00:00
Trevor Bossert	24d5877bd6	Bump base image with latest security fixes (#2009 ) This includes latest version and security updates available from upstream	2023-11-05 19:29:29 +00:00
Matt Robinson	e5bcd36475	docs: update slack links (#1990 ) ### Summary A user in the [Community Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919) reported having difficulty signing up for Slack using the links from the documentation. Updated the links to the use the invite link that worked from him, which came from [this blog post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).	2023-11-05 11:26:34 -08:00

... 7 8 9 10 11 ...

1418 Commits