16 Commits

Author SHA1 Message Date
Christine Straub
4ad01efe23
feat: improve reading order (#2219)
Closes GH Issue #2208.
2023-12-07 23:21:10 -08:00
Klaijan
2c2d5b65ca
refactor: measure_text_edit_distance function for aggregation (#2108)
- Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. 
- Switch to `DataFrame` for easier analysis and aggregation.
2023-11-22 13:30:16 -08:00
Klaijan
433c3889dc
ci: reorganize eval output folders and add azure to matrix test (#2093)
**Summary**
The CI workflow for evaluation previously saved the metric outputs to
the `metrics/` folder. Currently structured in subfolders e.g.
`metrics/text-extraction` `metrics/element-type` for the folder clean up
purpose.

Additionally, Azure connector is also added to
`full_python_matrix_tests` in this PR.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-21 20:04:30 +00:00
Klaijan
5ba3b9c2c6
chore: get eval metrics from ingest in (#2097)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-17 18:22:36 +00:00
Klaijan
049b0f3fa8
chore: update metrics-json-manifest (#2047)
Update `metrics-json-manigest.txt` master file for ingest evaluation.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-10 00:24:59 +00:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Klaijan
c471ea3cc7
chore: remove copy line from non-matrix connectors (#1976) 2023-11-04 10:58:56 -07:00
shreyanid
c24e6e056c
chore: add doctype to ingest evaluation functions (#1977)
### Summary
To combine ingest and holistic metrics efforts, add the `doctype` field
to the results from the functions in evaluate.py for use in subsequent
aggregation functions.

### Test
Run `sh ./test_unstructured_ingest/evaluation-metrics.sh
text-extraction` and there will be a new doctype column with the file's
doctype extension.
<img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf">

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
2023-11-02 19:15:53 +00:00
Klaijan
a06b151897
refactor: ci workflow refactor (#1907)
Refactor the evaluation scripts including
`unstructured/ingest/evaluation.py`
`test_unstructured_ingest/evaluation-metrics.sh` for more structured
code and usage.
- The script is now only use one python script call with param
- Adds function to build string for output_args (`--output_dir
--output_list) and source_args (`--source_dir --source_args`)
- Now accepts evaluation to call as a param, currently only accepts
`text-extraction` and `element-type`

Example to call the function:
```sh evaluation-metrics.sh text-extraction```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-01 15:58:23 +00:00
Roman Isecke
680cfbabd4
expand fsspec downstream connectors (#1777)
### Description
Replacing PR
[1383](https://github.com/Unstructured-IO/unstructured/pull/1383)

---------

Co-authored-by: Trevor Bossert <alanboss@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-30 20:09:49 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
Klaijan
466255eec3
build: element type frequency evaluation metrics workflow in ci (#1862)
**Executive Summary**
Measured element type frequency accuracy from the current version of
code with the expected output. The performance is reported as tsv file
under `metrics`.

**Technical Details**
- The evaluation measures element type frequencies from
`structured-output-eval` against `expected-structured-output`
- `evaluation.py` has been edited to support function calling using
`click.group()` and `command()`
- `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh`
scripts

**Outputs**
2 tsv files is saved

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a)

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4)
9-0e05-41d4-b69f-841a2aa131ec)
and aggregated score is displayed.

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 04:36:36 +00:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
Klaijan
6707cab250
build: text extraction evaluation metrics workflow added (#1757)
**Executive Summary**
This PR adds the evaluation metrics to our current workflow. It verifies
the flow that when the code is pushed, the code will gets evaluate
against our gold standard and output into `.tsv` file.

**Technical Details**
- Adds evaluation metrics to the test-ingest workflow
- Make use of `structured-output` from `test-ingest` and compare to the
gold-standard uploaded in s3, and download into local when make
comparison. The current folder in-use is
`s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the
shell script.
- With this PR, only one file from one connector is use to compare.

**Misc**
- Not many overlapped files between test-ingest and gold-standard. More
files will be added.

**Outputs**
2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae)


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-10-23 21:39:22 +00:00
Roman Isecke
a2af72bb79
local connector metadata and deserialization fix (#1800)
### Description
* Priority of this was to fix deserialization of ingest docs. Currently
the source metadata wasn't being persisted
* To help debug this, source metadata was added to the local ingest doc
as well.
* Unit test added to make sure the metadata itself was persisted.
* As part of serialization, it was forcing docs to fetch source metadata
if it hadn't already to add to the generated dict/json. This shouldn't
have happened if the underlying variable `_source_metadata` was `None`.
This way the doc can be serialized without any calls being made.
* Serialization was moved to the `to_dict` method to make it more
universal.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-23 15:51:52 +00:00
Klaijan
4a9d48859c
chore: add test_unstructured_ingest/metrics dummy dir (#1827)
Update path for metrics output to `test_unstructured_ingest/metrics` and
create a dummy folder `test_unstructured_ingest/metrics` to prevent ci
error.
2023-10-20 14:54:30 -07:00