unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-03 15:11:30 +00:00

Author	SHA1	Message	Date
Christine Straub	4ad01efe23	feat: improve reading order (#2219 ) Closes GH Issue #2208.	2023-12-07 23:21:10 -08:00
Klaijan	2c2d5b65ca	refactor: measure_text_edit_distance function for aggregation (#2108 ) - Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. - Switch to `DataFrame` for easier analysis and aggregation.	2023-11-22 13:30:16 -08:00
Klaijan	433c3889dc	ci: reorganize eval output folders and add azure to matrix test (#2093 ) Summary The CI workflow for evaluation previously saved the metric outputs to the `metrics/` folder. Currently structured in subfolders e.g. `metrics/text-extraction` `metrics/element-type` for the folder clean up purpose. Additionally, Azure connector is also added to `full_python_matrix_tests` in this PR. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-21 20:04:30 +00:00
Klaijan	5ba3b9c2c6	chore: get eval metrics from ingest in (#2097 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-17 18:22:36 +00:00
Klaijan	049b0f3fa8	chore: update metrics-json-manifest (#2047 ) Update `metrics-json-manigest.txt` master file for ingest evaluation. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-10 00:24:59 +00:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Klaijan	c471ea3cc7	chore: remove copy line from non-matrix connectors (#1976 )	2023-11-04 10:58:56 -07:00
shreyanid	c24e6e056c	chore: add doctype to ingest evaluation functions (#1977 ) ### Summary To combine ingest and holistic metrics efforts, add the `doctype` field to the results from the functions in evaluate.py for use in subsequent aggregation functions. ### Test Run `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` and there will be a new doctype column with the file's doctype extension. <img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM" src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf"> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>	2023-11-02 19:15:53 +00:00
Klaijan	a06b151897	refactor: ci workflow refactor (#1907 ) Refactor the evaluation scripts including `unstructured/ingest/evaluation.py` `test_unstructured_ingest/evaluation-metrics.sh` for more structured code and usage. - The script is now only use one python script call with param - Adds function to build string for output_args (`--output_dir --output_list) and source_args (`--source_dir --source_args`) - Now accepts evaluation to call as a param, currently only accepts `text-extraction` and `element-type` Example to call the function: ```sh evaluation-metrics.sh text-extraction``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-11-01 15:58:23 +00:00
Roman Isecke	680cfbabd4	expand fsspec downstream connectors (#1777 ) ### Description Replacing PR [1383](https://github.com/Unstructured-IO/unstructured/pull/1383) --------- Co-authored-by: Trevor Bossert <alanboss@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-30 20:09:49 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
Klaijan	466255eec3	build: element type frequency evaluation metrics workflow in ci (#1862 ) Executive Summary Measured element type frequency accuracy from the current version of code with the expected output. The performance is reported as tsv file under `metrics`. Technical Details - The evaluation measures element type frequencies from `structured-output-eval` against `expected-structured-output` - `evaluation.py` has been edited to support function calling using `click.group()` and `command()` - `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh` scripts Outputs 2 tsv files is saved ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4) 9-0e05-41d4-b69f-841a2aa131ec) and aggregated score is displayed. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-27 04:36:36 +00:00
Roman Isecke	135aa65906	update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814 ) ### Description * If the contents of a doc were updated by the process of reading/downloading it, this was not being persisted. To fix this, the data being passed around was updated to use a multiprocessing safe dict rather than the json string. Now that dict is updated after the `get_file` method is called. * Wikipedia connector was updated to use a static filename rather than one requiring a call to fetch data. * The read config param `re_download` was not being leveraged by the source node, this was fixed. * Added fix: chunking and embedding order reversed so chunking runs before embeddings --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-25 22:04:27 +00:00
Klaijan	6707cab250	build: text extraction evaluation metrics workflow added (#1757 ) Executive Summary This PR adds the evaluation metrics to our current workflow. It verifies the flow that when the code is pushed, the code will gets evaluate against our gold standard and output into `.tsv` file. Technical Details - Adds evaluation metrics to the test-ingest workflow - Make use of `structured-output` from `test-ingest` and compare to the gold-standard uploaded in s3, and download into local when make comparison. The current folder in-use is `s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the shell script. - With this PR, only one file from one connector is use to compare. Misc - Not many overlapped files between test-ingest and gold-standard. More files will be added. Outputs 2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-10-23 21:39:22 +00:00
Roman Isecke	a2af72bb79	local connector metadata and deserialization fix (#1800 ) ### Description * Priority of this was to fix deserialization of ingest docs. Currently the source metadata wasn't being persisted * To help debug this, source metadata was added to the local ingest doc as well. * Unit test added to make sure the metadata itself was persisted. * As part of serialization, it was forcing docs to fetch source metadata if it hadn't already to add to the generated dict/json. This shouldn't have happened if the underlying variable `_source_metadata` was `None`. This way the doc can be serialized without any calls being made. * Serialization was moved to the `to_dict` method to make it more universal. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-23 15:51:52 +00:00
Klaijan	4a9d48859c	chore: add test_unstructured_ingest/metrics dummy dir (#1827 ) Update path for metrics output to `test_unstructured_ingest/metrics` and create a dummy folder `test_unstructured_ingest/metrics` to prevent ci error.	2023-10-20 14:54:30 -07:00

16 Commits