unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-02 19:13:13 +00:00

Author	SHA1	Message	Date
Steve Canny	9ece0b5ad2	fix: improve false-positive Title elements on Chinese text (#3836 ) Summary Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2024-12-18 01:16:42 +00:00
Ribhu Lahiri	9a9bf4c4f5	Added contributing from archived repo (#3616 ) Added `CONTRIBUTING.md` from the archived repo as mentioned in the issue: https://github.com/Unstructured-IO/unstructured/issues/3540 Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>	2024-12-17 02:53:17 +00:00
Steve Canny	b5ff79d8db	fix: refine filetype detection (#3828 ) Summary Fixes a bug where a CSV file with asserted content-type `application/vnd.ms-excel` was incorrectly identified as an XLS file and failed partitioning. Additional Context The `content_type` argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the `.csv -> XLS` mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping. XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector. - Promote this direct-inspection strategy to be tried first. - When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type. - When one of those types is NOT detected, clear the asserted `content_type` when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type. - The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before. - Fix #3781 while we were in the neighborhood. - Fix #3596 as well, essentially an earlier report of #3781.	2024-12-17 00:56:21 +00:00
Steve Canny	10f0d54ac2	build: remove ruff version upper bound (#3829 ) Summary Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.	2024-12-16 23:01:22 +00:00
Steve Canny	b092fb7f47	fix: add .grype.yaml (#3834 ) Summary CVE-2024-11053 https://curl.se/docs/CVE-2024-11053.html (severity Low) was published on Dec 11, 2024 and began failing CI builds on open-core on Dec 13, 2024 when it appeared in `grype` apparently misclassified as a critical vulnerability. The severity reported on the CVE is "Low" so it should not fail builds. Add a `.grype.yaml` file to ignore this CVE until grype is updated.	2024-12-16 19:39:55 +00:00
Steve Canny	3b718ec89a	rfctr: prep for pluggable partitioners (#3806 ) Summary Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. Additional Context The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.	2024-12-10 20:44:34 +00:00
Steve Canny	b981d7197f	release: prepare release 0.16.11 (#3819 ) Release only, no code changes. 0.16.11	2024-12-09 23:48:00 +00:00
Magnus F	1e2da6df46	fix: ipv4 address regex (#3808 ) I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.	2024-12-09 14:19:13 -08:00
Steve Canny	4379d883a3	chunk: relax table segregation during chunking (#3812 ) Summary Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. Additional Context Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-09 18:57:22 +00:00
Christine Straub	18d6c81c47	Update CHANGELOG.md (#3818 )	2024-12-09 14:47:53 +00:00
Christine Straub	2f06d5a2a2	test: fix lint error	2024-12-07 19:51:43 -08:00
Christine Straub	9076d56d9f	fix: resolve mergeing conflict error	2024-12-07 19:40:11 -08:00
Tracy Shen	59e6cff611	release 0.16.10 (#3816 ) release 0.16.10 so that competitor-eval can install and take advantage of the latest change in the metric calculation 0.16.10	2024-12-07 17:24:45 +00:00
Tracy Shen	8c58bc57db	fix doctype parsing error (#3811 ) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".	2024-12-06 23:55:01 +00:00
Christine Straub	3bca724624	feat: update standardize_quote()	2024-12-05 13:24:57 -08:00
Christine Straub	ef1c85ef0f	feat: enhance quote standardization tests with additional Unicode scenarios	2024-12-05 11:35:19 -08:00
Christine Straub	7d06c120dc	Merge branch 'main' into ML-593/quote-standardization	2024-12-05 10:27:26 -08:00
Christine Straub	a4be1d6100	fix: modify changelog.md	2024-12-05 10:17:08 -08:00
Christine Straub	f44862a5ea	fix:confict error	2024-12-05 09:56:52 -08:00
Christine Straub	00a999153c	fix: confict error	2024-12-05 09:54:36 -08:00
Christine Straub	5e60942960	test:fix lint errors	2024-12-05 09:40:50 -08:00
Marianna	4140f625d0	add script to render html from unstructured elements (#3799 ) Script to render HTML from unstructured elements. NOTE: This script is not intended to be used as a module. NOTE: This script is only intended to be used with outputs with non-empty `metadata.text_as_html`. TODO: It was noted that unstructured_elements_to_ontology func always returns a single page This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds. Usage: test with any outputs with non-empty `metadata.text_as_html`. Example files attached. `[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)` [Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)	2024-12-04 19:46:51 -08:00
Christine Straub	bb26603a30	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:16:00 -08:00
Christine Straub	c0c3fd673f	test: enhance quote standardization tests with additional Unicode scenarios	2024-12-04 13:02:07 -08:00
Christine Straub	9038b88b8e	fix: ensure newline at end of file in standardize_quotes function	2024-12-04 12:48:15 -08:00
Christine Straub	c821f12d29	test: update string tests for consistent quote handling	2024-12-04 12:17:16 -08:00
Christine Straub	4e0f7cdbc0	Feat: enhance quote standardization with comprehensive Unicode coverage and update tests	2024-12-04 11:33:03 -08:00
Christine Straub	371cb7528d	Feat: add quote standardization and update edit distance calculation	2024-12-03 21:21:39 -08:00
Nathan Van Gheem	0fb814db61	Use native ntlk download (#3796 ) This PR changes how we download NLTK data to use the native nltk downloader. We had moved to our own hosted NLTK dataset because of this CVE: https://nvd.nist.gov/vuln/detail/CVE-2024-39705 Ref: https://github.com/Unstructured-IO/unstructured/pull/3361 Latest versions of NLTK have fixed this issue: https://github.com/nltk/nltk/blob/develop/ChangeLog 0.16.9	2024-12-02 19:30:28 +00:00
cragwolfe	9445a2dd01	chore: fix CHANGELOG formatting (#3800 ) Fixes formatting in CHANGELOG.md where most of the page was bold and indented. (verify the branch version here: https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md) Bonus tweak: u-table-inspect.sh is more robust to adding borders for visualizations	2024-11-26 15:38:42 -08:00
Pluto	0fe6ac60aa	Add backward compatibility for metric calculation (#3798 ) Co-authored-by: cragwolfe <crag@unstructured.io> 0.16.8	2024-11-26 18:14:16 +00:00
Pluto	e48d79eca1	image alt support (#3797 ) 0.16.7	2024-11-26 16:20:23 +00:00
ryannikolaidis	626f73af5b	chore: remove dev and release as 0.16.6 (#3793 ) 0.16.6	2024-11-21 23:59:38 +00:00
Yao You	3b9b01c502	Feat: weighted average table metrics (#3348 ) This PR uses (number of actual table) weighted average instead of average without weights for table metrics. - pages where there are ground truth tables the weight is proportional to the number of ground truth tables in that page - pages where there are no ground truth tables but has predicted tables (false positive) are assigned as 1 table worth of weight for the whole page for calculating the mean value of `table_level_acc` - pages with false positive tables do not contribute to table structural or table content metrics ## test This PR updates the existing test for evaluating table metrics: - adds a second file with just 1 table vs. the existing file with 2 tables - test the weighted average is written to the report	2024-11-20 17:14:57 +00:00
Pluto	85ecdab077	Add text as html to orig elements chunks (#3779 ) This simplest solution doesn't drop HTML from metadata when merging Elements from HTML input. We still need to address how to handle nested elements, and if we want to have `LayoutElements` in the metadata of Composite Elements, a unit test showing the current behavior. Note: metadata still contains `orig_elements` which has all the metadata.	2024-11-20 13:27:17 +00:00
Pluto	e1babf0660	Define default HTML to ontology mapping (#3784 )	2024-11-20 13:01:28 +00:00
Pluto	ca27b8aa97	Set <table> to be ontology.Table not UncategorizedText (#3782 )	2024-11-15 14:30:48 +00:00
Yao You	a6aefee0cb	chore: remove dev and release as 0.16.5 (#3775 ) 0.16.5	2024-11-07 14:31:50 -06:00
Pluto	c2d17b1ca4	Fix extracting value from field (#3774 )	2024-11-07 18:21:39 +00:00
Pluto	66d1e5a5cb	Add max recursion limit and fix to_text() method (#3773 )	2024-11-07 15:08:16 +00:00
Christine Straub	df156ebe5a	feat: support pdf link extraction in hi_res strategy (#3753 ) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> 0.16.4	2024-10-31 16:52:27 +00:00
Pluto	1953b8699f	Ml 415/merge inline elements (#3749 )	2024-10-31 12:17:25 +00:00
Maksymilian Operlejn	eb1b294b73	ML-405/ML-427 - OntologyElement improvements (#3758 ) - the "value" attribute from <input/> tag will be taken into account and processed as "text" in ontology - the tables will now be parsed without any ids and classes - we have different reasons behind that, for example, embeddings with ids and classes can lose some semantic value. Also, more tokens = more expensive LLM call - cleaned to_html, created to_text for OntologyElement	2024-10-31 01:30:53 +00:00
ryannikolaidis	d0be1151a1	chore: pin unstructured-ingest (#3764 )	2024-10-30 23:25:47 +00:00
Tracy Shen	340a07f18b	[Merge] release to 0.16.3 (#3755 ) - bump version to 0.16.3 based on Pluto's fix on layout parsing - update unstructured-inference version to 0.8.1 in 0.16.3	2024-10-25 13:23:41 -07:00
Pluto	5a91f0cda9	Fix layout parsing (#3754 )	2024-10-25 14:42:06 +00:00
Pluto	2417f8ed84	Fix when parent id is none for first element in v2 notion: (#3752 )	2024-10-25 09:43:36 +00:00
Marianna	9835fe4d5b	set version 0.16.2 (#3748 ) 0.16.2	2024-10-24 16:29:43 +00:00
Marianna	aa5935b357	Ml 384/whitespaces in cct (#3747 ) This ticket ensures that CCT metric will not be sensitive to differences in whitespace (including newline). All whitespaces in string are changed to single space `" "` in both GT and PRED before the metric is computed. Additional changes in CHANGELOG due to auto-formatting.	2024-10-24 13:02:34 +00:00
Pawel Kmiecik	bdfcc14e3d	fix: fix partition_via_api retry mechanism when the default SDK's retry config is empty. (#3746 )	2024-10-24 09:37:22 +00:00

1 2 3 4 5 ...

1653 Commits