unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-30 12:30:22 +00:00

Author	SHA1	Message	Date
Roman Isecke	680cfbabd4	expand fsspec downstream connectors (#1777 ) ### Description Replacing PR [1383](https://github.com/Unstructured-IO/unstructured/pull/1383) --------- Co-authored-by: Trevor Bossert <alanboss@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-30 20:09:49 +00:00
Klaijan	466255eec3	build: element type frequency evaluation metrics workflow in ci (#1862 ) Executive Summary Measured element type frequency accuracy from the current version of code with the expected output. The performance is reported as tsv file under `metrics`. Technical Details - The evaluation measures element type frequencies from `structured-output-eval` against `expected-structured-output` - `evaluation.py` has been edited to support function calling using `click.group()` and `command()` - `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh` scripts Outputs 2 tsv files is saved ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4) 9-0e05-41d4-b69f-841a2aa131ec) and aggregated score is displayed. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-27 04:36:36 +00:00
Roman Isecke	135aa65906	update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814 ) ### Description * If the contents of a doc were updated by the process of reading/downloading it, this was not being persisted. To fix this, the data being passed around was updated to use a multiprocessing safe dict rather than the json string. Now that dict is updated after the `get_file` method is called. * Wikipedia connector was updated to use a static filename rather than one requiring a call to fetch data. * The read config param `re_download` was not being leveraged by the source node, this was fixed. * Added fix: chunking and embedding order reversed so chunking runs before embeddings --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-25 22:04:27 +00:00
Klaijan	6707cab250	build: text extraction evaluation metrics workflow added (#1757 ) Executive Summary This PR adds the evaluation metrics to our current workflow. It verifies the flow that when the code is pushed, the code will gets evaluate against our gold standard and output into `.tsv` file. Technical Details - Adds evaluation metrics to the test-ingest workflow - Make use of `structured-output` from `test-ingest` and compare to the gold-standard uploaded in s3, and download into local when make comparison. The current folder in-use is `s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the shell script. - With this PR, only one file from one connector is use to compare. Misc - Not many overlapped files between test-ingest and gold-standard. More files will be added. Outputs 2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`. ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae) ![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-10-23 21:39:22 +00:00

4 Commits