unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-15 01:47:27 +00:00

Author	SHA1	Message	Date
John	6d7fe3ab02	fix: default to None for the languages metadata field (#1743 ) ### Summary Closes #1714 Changes the default value for `languages` to `None` for elements that don't have text or the language can't be detected. ### Testing ``` from unstructured.partition.auto import partition filename = "example-docs/handbook-1p.docx" elements = partition(filename=filename, detect_language_per_element=True) # PageBreak elements don't have text and will be collected here none_langs = [element for element in elements if element.metadata.languages is None] none_langs[0].text ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-14 22:46:24 +00:00
Amanda Cameron	d0c84d605c	chore: updating table docs with file extensions (#1702 ) gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691 Adding filetype extensions from this [list](`f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200)`) where applicable. --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-10-14 14:14:52 -07:00
qued	cf31c9a2c4	fix: use nx to avoid recursion limit (#1761 ) Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size. Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues. * Updated `_get_connected_components` to use `networkx` graph methods rather than implementing our own algorithm for finding contiguous groups of cells within a sheet. * Added a test and example doc that replicates the `RecursionError` prior to the change. * Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d. #### Testing: The following run from a Python terminal should raise a `RecursionError` on `main` and succeed on this branch: ```python import sys from unstructured.partition.xlsx import partition_xlsx old_recursion_limit = sys.getrecursionlimit() try: sys.setrecursionlimit(1000) filename = "example-docs/more-than-1k-cells.xlsx" partition_xlsx(filename=filename) finally: sys.setrecursionlimit(old_recursion_limit) ``` Note: the recursion limit is different in different contexts. Checking my own system, the default in a notebook seems to be 3000, but in a terminal it's 1000. The documented Python default recursion limit is 1000.	2023-10-14 19:38:21 +00:00
cragwolfe	3f32c6702a	feat: bump unstructured-inference=0.7.5 for faster chipper (#1756 ) Improved inference speed for Chipper V2 API requests with 'hi_res_model_name=chipper' now have ~2-3x faster responses.	2023-10-14 13:03:59 -07:00
Minwoo Byeon (Dylan)	3331c5c6c0	Remove the temporary files when the conversion is finished. (#1696 ) Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-13 18:51:44 -05:00
qued	95728ead0f	fix: zero divide in under_non_alpha_ratio (#1753 ) The function `under_non_alpha_ratio` in `unstructured.partition.text_type` was producing a divide-by-zero error. After investigation I found this was a possibility when the function was passed a string of all spaces. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-13 21:20:01 +00:00
M Bharat lal	21df17f7fa	fix: consider all the required lines instead of first line to detect file type as CSV (#1728 ) Current file detection logic for csv in file_utils/filetype.py is not considering all the lines for counting the no. of comma's, it is considering just the first line which will return always return true ``` lines = lines[: len(lines)] if len(lines) < 10 else lines[:10] header_count = _count_commas(lines[0]) if any("," not in line for line in lines): return False return all(_count_commas(line) == header_count for line in lines[:1]) ``` fixed issue by considering all the lines except the first line as shown below ``` lines = lines[: len(lines)] if len(lines) < 10 else lines[:10] header_count = _count_commas(lines[0]) if any("," not in line for line in lines): return False return all(_count_commas(line) == header_count for line in lines[1:]) ```	2023-10-13 13:36:05 -07:00
Christine Straub	ef391e1a3e	feat: less precision in json floats (#1718 ) Closes #1340. ### Summary - add functionality to limit precision when serializing to JSON ### Testing ``` elements = partition(raw_doc.<extension>) output_json = elements_to_json(elements) print(output_json) ```	2023-10-13 11:06:36 -07:00
Austin Walker	ad1b93dbaa	chore: cut the 0.10.22 release (#1749 ) 0.10.22	2023-10-13 17:17:21 +00:00
ryannikolaidis	d9a0bd741a	fix: build test failures (#1748 ) * Fix missing HF_TOKEN when running containerized test for the build process * Fix pytest args when running specific test ## Testing Example run of the HF_TOKEN assgned for the containerized test in the build process: https://github.com/Unstructured-IO/unstructured/actions/runs/6504556437/job/17666669155 Example run of the pytest args working for the arm test (ran in a new workflow for testing on push): https://github.com/Unstructured-IO/unstructured/actions/runs/6504213010	2023-10-13 01:08:27 -07:00
Steve Canny	4b84d596c2	docx: add hyperlink metadata (#1746 )	2023-10-13 06:26:14 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
Ahmet Melek	94836cfad4	feat: add file-based access permissions for SharePoint ingest (#1628 ) This PR: - defines rbac_data as a SourceMetadata field, - manages connections to an external api for obtaining rbac data with ConnectorRBAC class, - serializes rbac data and saves it to the disk, - matches the rbac_data in the disk to each IngestDoc, using a common field, - forwards rbac data to Elements, via the partition() function To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the relevant rbac & connector credentials --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-10-13 00:38:08 +00:00
shreyanid	3ec3673d34	feat: staging function to extract element text into one string (#1741 ) ### Summary In order to enable larger scale testing of the new text extraction metrics, create a helper function to get the clean, concatenated text (CCT) from partitioned elements. ### Test Partition any file, then pass the resulting elements into the new `elements_to_text` function. Can test getting the output as string or as text file. ``` from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_text elements = partition(filename="example-docs/chevron-page.pdf", strategy="hi_res") elements_text = elements_to_text(elements, "output-text-file.txt") print(elements_text) ```	2023-10-12 23:59:16 +00:00
ryannikolaidis	40523061ca	fix: _add_embeddings_to_elements bug resulting in duplicated elements (#1719 ) Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in `_add_embeddings_to_elements` it overwrites each Element's `to_dict` method, mistakenly resulting in each Element having identical values with the exception of the actual embedding value. This was due to the way it leverages a nested `new_to_dict` method to overwrite. Instead, this updates the original definition of Element itself to accommodate the `embeddings` field when available. This also adds a test to validate that values are not duplicated.	2023-10-12 21:47:32 +00:00
Roman Isecke	ebf0722dcc	roman/ingest continue on error (#1736 ) ### Description Add flag to raise an error on failure but default to only log it and continue with other docs	2023-10-12 21:33:10 +00:00
ryannikolaidis	d22044a44c	fix: unstructured-ingest embedding KeyError (#1727 ) Currently adding the embedding flag to any unstructured-ingest call results in this failure: ``` 2023-10-11 22:42:14,177 MainProcess ERROR 'b8a98c5d963a9dd75847a8f110cbf7c9' multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(args, kwds)) File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash] File "<string>", line 2, in __getitem__ File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod raise convert_to_error(kind, result) KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9' """ ``` This is because the run method for the embedding node is not adding the IngestDoc to the context map. This PR adds that logic and adds a test to validate that the embeddings option works as expected. NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719 goes in, the expected results include the duplicate element bug, however currently this does at least prove that embeddings are generated and the function doesn't error.	2023-10-12 20:27:30 +00:00
Steve Canny	d726963e42	serde tests round-trip through JSON (#1681 ) Each partitioner has a test like `test_partition_x_with_json()`. What these do is serialize the elements produced by the partitioner to JSON, then read them back in from JSON and compare the before and after elements. Because our element equality (`Element.__eq__()`) is shallow, this doesn't tell us a lot, but if we take it one more step, like `List[Element] -> JSON -> List[Element] -> JSON` and then compare the JSON, it gives us some confidence that the serialized elements can be "re-hydrated" without losing any information. This actually showed up a few problems, all in the serialization/deserialization (serde) code that all elements share.	2023-10-12 19:47:55 +00:00
Yuming Long	cb247d8cc4	doc: update comment for ingest test pdf-fast-reprocess (#1733 )	2023-10-12 17:33:25 +00:00
Roman Isecke	22e568cf64	roman/bugfix fix default language ingest option (#1729 ) ### Description Set language to None by default. Update ingest test to use local file used in language unit tests to validate. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-12 17:31:23 +00:00
Roman Isecke	9b5d5e0f9e	roman/cli infer table arg (#1685 ) ### Description Add new parameter to map to `skip_infer_table_types` partition arg. Applies to partition config which is set on all connectors. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-12 16:14:53 +00:00
Roman Isecke	35852bba83	roman/bugfix ingest pipeline reformat dir (#1717 ) Closes #1724 ### Description Add file needed to make that dir discoverable	2023-10-12 15:44:30 +00:00
Trevor Bossert	9864086bc8	Drop patch version of python in Scarf anonymous analytics (#1725 )	2023-10-12 01:36:47 +00:00
Trevor Bossert	569561e59b	Add more params for scarf (#1720 ) Allow to slice on further metrics	2023-10-12 00:09:19 +00:00
Inscore	8ab40c20c1	fix: correct PDF list item parsing (#1693 ) The current implementation removes elements from the beginning of the element list and duplicates the list items --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: yuming <305248291@qq.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-10-11 20:38:36 +00:00
Trevor Bossert	6acd06987b	Remove extra index url from docs (#1711 ) It’s no longer required to specify the extra index url as we utilize a different method of gathering install anonymous analytics.	2023-10-11 19:34:49 +00:00
Trevor Bossert	f0a63e2712	Add basic call to scarf to get anonymous analytics (#1705 ) There is a built in option to not send data by setting an env var, SCARF_NO_ANALYTICS=true. DoD: - When importing or running unstructured package it will make a get call to scarf - When env variable is set to not track, call is not made 0.10.21	2023-10-11 09:15:36 -07:00
John	9500d04791	detect document language across all partitioners (#1627 ) ### Summary Closes #1534 and #1535 Detects document language using `langdetect` package. Creates new kwargs for user to set the document language (`languages`) or detect the language at the element level instead of the default document level (`detect_language_per_element`) --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io> 0.10.20	2023-10-11 01:47:56 +00:00
Klaijan	ee75ce25e2	feat: element type frequency (#1688 ) Executive Summary Add function that returns frequency of given element types and depth. --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-11 00:36:44 +00:00
rvztz	7fd61e3a7f	Adds data source properties to git connectors (#1280 ) Adds data source properties to git connectors: - data_created - date_modified - version - record_locator These properties are instantiated when supported by the connector. Separates the logic between fetching the file from source and `get_file`. Retrieves file metadata when any of the properties are called. Adds logic to check if file exists in the remote source. For connectors that don't directly support it, adds exception handling to check any issues while retrieving the file. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rvztz <rvztz@users.noreply.github.com>	2023-10-10 22:56:56 +00:00
shreyanid	9d228c7ecb	feat: calculate metric for percent of text missing (#1701 ) ### Summary Missing text is a particularly important metric of quality for the Unstructured library because information from the document is not being captured and therefore not usable by downstream applications. Add function to calculate the percent of text missing relative to the source transcription. Function takes 2 text strings (output and source) as input, and returns the percentage of text missing as a decimal. ### Technical Details - The 2 input strings are both assumed to already contain clean and concatenated text (CCT) - Implementation compares the bags of words (frequency counts for each word present in the text) of each input text - Duplicated/extra text is not penalized - Value is limited to the range [0, 1] ### Test - Several edge cases are covered in the test function (missing text, duplicated text, spaced out words, etc). - Can test other cases or text inputs by calling the function with 2 CCT strings as "output" and "source"	2023-10-10 20:54:49 +00:00
Yuming Long	e597ec7a0f	Fix: skip empty annotation bbox (#1665 ) Address: https://github.com/Unstructured-IO/unstructured/issues/1663 ## Summary While trying to find how overlap between a element bbox and annotation bbox, we find the intersection of two bboxes and divide it by the size of annotation bbox, this will cause a zero division error if size of annotation bbox is 0. * this PR fix the zero division error for function `check_annotations_within_element` * also fix error: `TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'` by stop inserting empty word with None bbox into list of words in function `get_word_bounding_box_from_element` ## Test reproduce with code and document as the user mentioned and should see no error: ``` from unstructured.partition.auto import partition elements = partition( filename="./IZSAM8.2_221012.pdf", strategy="fast", ) ```	2023-10-10 20:48:44 +00:00
Mallori Harrell	a5d7ae4611	Feat: Bag of words for testing metric (#1650 ) This PR adds the `bag_of_words` function to count the frequency of words for evaluation. Testing ```Python from unstructured.cleaners.core import bag_of_words string = "The dog loved the cat, but the cat loved the cow." print(bag_of_words) --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local> Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com> Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-10-10 18:46:01 +00:00
Roman Isecke	b38a6b3022	feat: add Notion connector retry strategy (#1492 ) ### Description In order to add a retry strategy to the notion http calls, leveraging a generic backoff library with some tweaks to pass in values from the CLI.	2023-10-10 17:41:18 +00:00
Klaijan	1d80beaaf2	fix: initialize uri before try-except (#1690 ) Fix github issue https://github.com/Unstructured-IO/unstructured/issues/1686	2023-10-10 17:29:10 +00:00
ryannikolaidis	3e101d3e4f	build(test): skip full python matrix for most ingest tests (#1687 ) We’re probably unfairly (to the test) making a large volume of new connections and requests to test services when all of our ingest tests run across the full python test matrix and when a lot of PRs a firing at once. Lets limit the full matrix run to a select few, but still have all ingest tests run on python v3.10. This is done by checking the version and skipping in ingest-test.sh. Bonus: Bumps ingest test fixture workflow to use 3.10. This technically shouldn't make a difference, but since we're making 3.10 the default of the matrix strategy, it probably makes sense to use 3.10 for the ingest fixture generation as well for consistency. ## Testing - [example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537900978?pr=1687) running all tests in 3.10 - [example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537899999?pr=1687) skipping/running the expected tests in 3.8	2023-10-10 16:39:34 +00:00
Dev Khant	f09b87da23	Doc : replace link `upstream connectors` with `source connectors` (#1683 ) Fixes #1502 Here I have replaced `stream_connectors.html` with `source_connectors.html`.	2023-10-09 21:37:51 -07:00
Amanda Cameron	f98d5e65ca	chore: adding max_characters to other element type chunking (#1673 ) This PR adds the `max_characters` (hard max) param to non-table element chunking. Additionally updates the `num_characters` metadata to `max_characters` to make it clearer which param we're referencing. To test: ``` from unstructured.partition.html import partition_html filename = "example-docs/example-10k-1p.html" chunk_elements = partition_html( filename, chunking_strategy="by_title", combine_text_under_n_chars=0, new_after_n_chars=50, max_characters=100, ) for chunk in chunk_elements: print(len(chunk.text)) # previously we were only respecting the "soft max" (default of 500) for elements other than tables # now we should see that all the elements have text fields under 100 chars. ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-09 19:42:36 +00:00
David Potter	8b93217a33	built(test): exclude version metadata from google drive test (#1682 )	2023-10-07 19:34:32 -07:00
cragwolfe	46cb1b642a	chore: don't cleanup ingest test outputs (non-CI) (#1680 ) When running test-ingest test fixtures locally (but not in CI), keep output .json's and other workdir artifacts around for the convenience of debugging. Test Instructions Run bash -x ./test_unstructured_ingest/test-ingest-azure.sh and witness output .json's are visible. Yay! Now, to instead clean up output .json's and workdir, run: UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 bash -x ./test_unstructured_ingest/test-ingest-azure.sh and witness the files have been cleaned up. Yay!	2023-10-07 02:18:37 +00:00
Klaijan	33edbf84f5	feat: add calculate edit distance feature (#1656 ) Executive Summary Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance) Technical details - The `weights` param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher. - The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT) Important Note! Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:21:14 +00:00
Trevor Bossert	ce206f1f85	add extra-index-url for scarf anonymous tracking (#1668 ) This adds extra-index-url to our docs to allow for anonymous install analytics to help us understand and improve our product. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:16:38 +00:00
Jack Retterer	7e310ecac2	Update Getting Started Guide in Documentation (#1667 ) - Fixed typo that stated "infer_table_structured" instead of "infer_table_structure" Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-07 01:12:52 +00:00
Yuming Long	dcd6d0ff67	Refactor: support entire page OCR with `ocr_mode` and `ocr_languages` (#1579 ) ## Summary Second part of OCR refactor to move it from inference repo to unstructured repo, first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/231. This PR adds OCR process logics to entire page OCR, and support two OCR modes, "entire_page" or "individual_blocks". The updated workflow for `Hi_res` partition: * pass the document as data/filename to inference repo to get `inferred_layout` (DocumentLayout) * pass the document as data/filename to OCR module, which first open the document (create temp file/dir as needed), and split the document by pages (convert PDF pages to image pages for PDF file) * if ocr mode is `"entire_page"` * OCR the entire image * merge the OCR layout with inferred page layout * if ocr mode is `"individual_blocks"` * from inferred page layout, find element with no extracted text, crop the entire image by the bboxes of the element * replace empty text element with the text obtained from OCR the cropped image * return all merged PageLayouts and form a DocumentLayout subject for later on process This PR also bump `unstructured-inference==0.7.2` since the branch relay on OCR refactor from unstructured-inference. ## Test ``` from unstructured.partition.auto import partition entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res") individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res") print([el.text for el in entrie_page_ocr_mode_elements]) print([el.text for el in individual_blocks_ocr_mode_elements]) ``` latest output: ``` # entrie_page ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM\|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고'] # individual_blocks ['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM\| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email'] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-06 22:54:49 +00:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
qued	b9fa20ab46	fix: isolate metadata imports to doctype (#1671 ) In a different PR, some no-extras tests started failing with import errors when something innocuous was imported from `unstructured.file_utils.metadata`. This turned out to be because of the top-level, doctype-specific imports in that file. Importing a general metadata object shouldn't require installation of modules like `PIL`, `docx`, and `openpyxl`. To fix, I moved these functions to be imported inside the functions that use them, and added the `requires_dependencies` decorator to the functions. #### Testing: You should be able to run something like: ```python from unstructured.file_utils.metadata import Metadata ``` Without `openpyxl` installed.	2023-10-06 18:49:03 +00:00
John	6b7fe4469f	add docs with multiple languages for testing (#1591 ) ### Summary Closes #1536 Adds .txt documents for testing language detection.	2023-10-06 18:41:40 +00:00
Christine Straub	5d14a2aea0	feat: shrink bboxes by top left (#1633 ) Closes #1573. ### Summary - update `shrink_bbox()` to keep top left rather than center ### Evaluation Run the following command for this [PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf). ``` PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-10-06 05:16:11 +00:00
ryannikolaidis	1e32da6389	build: fix merge queue issues (#1654 ) Closes #1482 There are two known issues when attempting to merge PRs via merge queue: - CodeQL fails with: ``` Error: ref 'refs/heads/gh-readonly-queue/main/pr-968-499f37f64b27c66d4fc68446dbea519860d06cf7' not found in this repository ``` - CI.changelog fails with: ``` Get current git ref Error: The process '/usr/bin/git' failed with exit code [1](https://github.com/Unstructured-IO/unstructured/actions/runs/5735977683/job/15544656682#step:2:1)28 ``` The error with CodeQL is a known and still [open issue](https://github.com/github/codeql-action/issues/1572). We don't current enforce branch protection for CodeQL, so probably our best compromise is to simply not run this on the merge queue event. There could be a narrow margin where some issue is introduced via merge, but we'll still see issues on individual branches and on pushes to main, so this is probably acceptable. The changelog job now has a checkout step prior to paths-filter which guarantees the git ref exists before attempting to execute the filter action. ## Testing Prior to this change, I was able to validate both the [CodeQL](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414128010) and [changelog](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414128007/job/17414065768) test errors With these changes, validated that the merge queue was able to [successfully run](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414511843/job/17415024319) the changelog CI job.	2023-10-05 21:58:39 +00:00
Benjamin Torres	e0201e9a11	feat/add sources from unstructured inference (#1538 ) This PR adds support for `source` property from `unstructured_inference`, allowing the user to be able to see the origin of the data under `detection_origin`field environment variable UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true In order to try this feature you can use this code: ``` from unstructured.partition.pdf import partition_pdf_or_image yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox') sources = [e.detection_origin for e in yolox_elements] print(sources) ``` And will print 'yolox' as source for all the elements	2023-10-05 20:26:47 +00:00

... 4 5 6 7 8 ...

1109 Commits