unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-04 19:16:03 +00:00

Author	SHA1	Message	Date
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Steve Canny	b981d7197f	release: prepare release 0.16.11 (#3819 ) Release only, no code changes. 0.16.11	2024-12-09 23:48:00 +00:00
Magnus F	1e2da6df46	fix: ipv4 address regex (#3808 ) I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.	2024-12-09 14:19:13 -08:00
Steve Canny	4379d883a3	chunk: relax table segregation during chunking (#3812 ) Summary Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. Additional Context Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-09 18:57:22 +00:00
Christine Straub	18d6c81c47	Update CHANGELOG.md (#3818 )	2024-12-09 14:47:53 +00:00
Christine Straub	2f06d5a2a2	test: fix lint error	2024-12-07 19:51:43 -08:00
Christine Straub	9076d56d9f	fix: resolve mergeing conflict error	2024-12-07 19:40:11 -08:00
Tracy Shen	59e6cff611	release 0.16.10 (#3816 ) release 0.16.10 so that competitor-eval can install and take advantage of the latest change in the metric calculation 0.16.10	2024-12-07 17:24:45 +00:00
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00
Christine Straub	3bca724624	feat: update standardize_quote()	2024-12-05 13:24:57 -08:00
Christine Straub	ef1c85ef0f	feat: enhance quote standardization tests with additional Unicode scenarios	2024-12-05 11:35:19 -08:00
Christine Straub	7d06c120dc	Merge branch 'main' into ML-593/quote-standardization	2024-12-05 10:27:26 -08:00
Christine Straub	a4be1d6100	fix: modify changelog.md	2024-12-05 10:17:08 -08:00
Christine Straub	f44862a5ea	fix:confict error	2024-12-05 09:56:52 -08:00
Christine Straub	00a999153c	fix: confict error	2024-12-05 09:54:36 -08:00
Christine Straub	5e60942960	test:fix lint errors	2024-12-05 09:40:50 -08:00
Marianna	4140f625d0	add script to render html from unstructured elements (#3799 ) Script to render HTML from unstructured elements. NOTE: This script is not intended to be used as a module. NOTE: This script is only intended to be used with outputs with non-empty `metadata.text_as_html`. TODO: It was noted that unstructured_elements_to_ontology func always returns a single page This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds. Usage: test with any outputs with non-empty `metadata.text_as_html`. Example files attached. `[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)` [Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)	2024-12-04 19:46:51 -08:00
Christine Straub	bb26603a30	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:16:00 -08:00
Christine Straub	c0c3fd673f	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:02:07 -08:00
Christine Straub	9038b88b8e	fix: ensure newline at end of file in standardize_quotes function	2024-12-04 12:48:15 -08:00
Christine Straub	c821f12d29	test: update string tests for consistent quote handling	2024-12-04 12:17:16 -08:00
Christine Straub	4e0f7cdbc0	Feat: enhance quote standardization with comprehensive Unicode coverage and update tests	2024-12-04 11:33:03 -08:00
Christine Straub	371cb7528d	Feat: add quote standardization and update edit distance calculation	2024-12-03 21:21:39 -08:00
Nathan Van Gheem	0fb814db61	Use native ntlk download (#3796 ) This PR changes how we download NLTK data to use the native nltk downloader. We had moved to our own hosted NLTK dataset because of this CVE: https://nvd.nist.gov/vuln/detail/CVE-2024-39705 Ref: https://github.com/Unstructured-IO/unstructured/pull/3361 Latest versions of NLTK have fixed this issue: https://github.com/nltk/nltk/blob/develop/ChangeLog 0.16.9	2024-12-02 19:30:28 +00:00
cragwolfe	9445a2dd01	chore: fix CHANGELOG formatting (#3800 ) Fixes formatting in CHANGELOG.md where most of the page was bold and indented. (verify the branch version here: https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md) Bonus tweak: u-table-inspect.sh is more robust to adding borders for visualizations	2024-11-26 15:38:42 -08:00
Pluto	0fe6ac60aa	Add backward compatibility for metric calculation (#3798 ) Co-authored-by: cragwolfe <crag@unstructured.io> 0.16.8	2024-11-26 18:14:16 +00:00
Pluto	e48d79eca1	image alt support (#3797 ) 0.16.7	2024-11-26 16:20:23 +00:00
ryannikolaidis	626f73af5b	chore: remove dev and release as 0.16.6 (#3793 ) 0.16.6	2024-11-21 23:59:38 +00:00
Yao You	3b9b01c502	Feat: weighted average table metrics (#3348 ) This PR uses (number of actual table) weighted average instead of average without weights for table metrics. - pages where there are ground truth tables the weight is proportional to the number of ground truth tables in that page - pages where there are no ground truth tables but has predicted tables (false positive) are assigned as 1 table worth of weight for the whole page for calculating the mean value of `table_level_acc` - pages with false positive tables do not contribute to table structural or table content metrics ## test This PR updates the existing test for evaluating table metrics: - adds a second file with just 1 table vs. the existing file with 2 tables - test the weighted average is written to the report	2024-11-20 17:14:57 +00:00
Pluto	85ecdab077	Add text as html to orig elements chunks (#3779 ) This simplest solution doesn't drop HTML from metadata when merging Elements from HTML input. We still need to address how to handle nested elements, and if we want to have `LayoutElements` in the metadata of Composite Elements, a unit test showing the current behavior. Note: metadata still contains `orig_elements` which has all the metadata.	2024-11-20 13:27:17 +00:00
Pluto	e1babf0660	Define default HTML to ontology mapping (#3784 )	2024-11-20 13:01:28 +00:00
Pluto	ca27b8aa97	Set <table> to be ontology.Table not UncategorizedText (#3782 )	2024-11-15 14:30:48 +00:00
Yao You	a6aefee0cb	chore: remove dev and release as 0.16.5 (#3775 ) 0.16.5	2024-11-07 14:31:50 -06:00
Pluto	c2d17b1ca4	Fix extracting value from field (#3774 )	2024-11-07 18:21:39 +00:00
Pluto	66d1e5a5cb	Add max recursion limit and fix to_text() method (#3773 )	2024-11-07 15:08:16 +00:00
Christine Straub	df156ebe5a	feat: support pdf link extraction in hi_res strategy (#3753 ) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> 0.16.4	2024-10-31 16:52:27 +00:00
Pluto	1953b8699f	Ml 415/merge inline elements (#3749 )	2024-10-31 12:17:25 +00:00
Maksymilian Operlejn	eb1b294b73	ML-405/ML-427 - OntologyElement improvements (#3758 ) - the "value" attribute from <input/> tag will be taken into account and processed as "text" in ontology - the tables will now be parsed without any ids and classes - we have different reasons behind that, for example, embeddings with ids and classes can lose some semantic value. Also, more tokens = more expensive LLM call - cleaned to_html, created to_text for OntologyElement	2024-10-31 01:30:53 +00:00
ryannikolaidis	d0be1151a1	chore: pin unstructured-ingest (#3764 )	2024-10-30 23:25:47 +00:00
Tracy Shen	340a07f18b	[Merge] release to 0.16.3 (#3755 ) - bump version to 0.16.3 based on Pluto's fix on layout parsing - update unstructured-inference version to 0.8.1 in 0.16.3	2024-10-25 13:23:41 -07:00
Pluto	5a91f0cda9	Fix layout parsing (#3754 )	2024-10-25 14:42:06 +00:00
Pluto	2417f8ed84	Fix when parent id is none for first element in v2 notion: (#3752 )	2024-10-25 09:43:36 +00:00
Marianna	9835fe4d5b	set version 0.16.2 (#3748 ) 0.16.2	2024-10-24 16:29:43 +00:00
Marianna	aa5935b357	Ml 384/whitespaces in cct (#3747 ) This ticket ensures that CCT metric will not be sensitive to differences in whitespace (including newline). All whitespaces in string are changed to single space `" "` in both GT and PRED before the metric is computed. Additional changes in CHANGELOG due to auto-formatting.	2024-10-24 13:02:34 +00:00
Pawel Kmiecik	bdfcc14e3d	fix: fix partition_via_api retry mechanism when the default SDK's retry config is empty. (#3746 )	2024-10-24 09:37:22 +00:00
Pluto	0b4c72a618	Set version to 0.16.1 (#3745 ) 0.16.1	2024-10-23 12:24:44 -07:00
Pluto	03a3ed8d3b	Add parsing HTML to unstructured elements (#3732 ) > This is POC change; not everything is working correctly and code quality could be improved significantly This ticket add parsing HTML to unstructured element and back. How is it working? HTML has a tree structure, Unstructured Elements is a list. HTML structure is traversed in DFS order, creating Elements and adding them to list. So the reading order from HTML is preserved. To be able to compose tree again all elements has IDs, and metadata.parent_id is leveraged How html is preserved if there are 'layout' without text, or there are deeply nested HTMLs that are just text from the point of view of Unstructured Element? Each element is parsed back to HTML using metadata.text_as_html field. For layout elements only html_tag are there, for long text elements there is everything required to recreate HTML - you can see examples in unit tests or .json file I attached. Pros of solution: - Nothing had to be changed in element types Cons: - There are elements without Text which may be confusing (they could be replaced by some special type) Core transformation logic can be found in 2 functions in `unstructured/documents/transformations.py` Knowns bugs (they are minor): - sometimes html tag is changed incorrectly - metadata.category_depth and metadata.page_number are not set - page break is not added between pages How to test. Generate HTML: ```python3 from pathlib import Path from vlm_partitioner.src.partition import partition if __name__ == "__main__": doc_dir = Path("out_dir") file_path = Path("example_doc.pdf") partition(str(file_path), provider="anthropic", output_dir=str(doc_dir)) ``` Then parse to unstructured elements and back to html ```python3 from pathlib import Path from unstructured.documents.html_utils import indent_html from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \ unstructured_elements_to_ontology from unstructured.staging.base import elements_to_json if __name__ == "__main__": output_dir = Path("out_dir/") output_dir.mkdir(exist_ok=True, parents=True) doc_path = Path("out_dir/example_doc.html") html_content = doc_path.read_text() ontology = parse_html_to_ontology(html_content) unstructured_elements = ontology_to_unstructured_elements(ontology) elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json")) parsed_ontology = unstructured_elements_to_ontology(unstructured_elements) html_to_save = indent_html(parsed_ontology.to_html()) Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save) ``` I attached example doc before and after running these scripts [outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)	2024-10-23 12:28:07 +00:00
Pawel Kmiecik	6bceac1749	feat: expose retry params in partition via api (#3724 ) This PR: - adds parameters to control the retry-mechanism behaviour for `partition_via_api`: ``` retries_initial_interval: [int] = None, retries_max_interval: Optional[int] = None, retries_exponent: Optional[float] = None, retries_max_elapsed_time: Optional[int] = None, retries_connection_errors: Optional[bool] = None, ``` - adds tests that check using them according to defaults	2024-10-22 14:43:28 +00:00
Yao You	a11ad22609	bump `unstructured-inference` (#3711 ) This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-10-21 21:55:08 +00:00
Yao You	e764bc503c	feat: round numbers to reduce undeterministic behavior (#3740 ) This PR rounds the floating point number associated with coordinates in `pdfminer_processing.py`. This helps to eliminate machine precision caused randomness in bounding box overlap detection. Currently the rounding is set to the nearest machine precision for `np.float32` using `np.finfo(float)`, which yields resolution = `1e-15`. ## future work We should reduce the rounding to only 6 digits after floating point since the data type `float32` has a resolution of only `1e-6`. However it would break tests. A followup is required to tune the threshold values in `pdfminer_processing.py` so that it works with `1e-6` resolution.	2024-10-21 18:15:20 +00:00

1 2 3 4 5 ...

1648 Commits