**Executive Summary**
This PR adds the evaluation metrics to our current workflow. It verifies
the flow that when the code is pushed, the code will gets evaluate
against our gold standard and output into `.tsv` file.
**Technical Details**
- Adds evaluation metrics to the test-ingest workflow
- Make use of `structured-output` from `test-ingest` and compare to the
gold-standard uploaded in s3, and download into local when make
comparison. The current folder in-use is
`s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the
shell script.
- With this PR, only one file from one connector is use to compare.
**Misc**
- Not many overlapped files between test-ingest and gold-standard. More
files will be added.
**Outputs**
2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.


---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.
### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
Currently adding the embedding flag to any unstructured-ingest call
results in this failure:
```
2023-10-11 22:42:14,177 MainProcess ERROR 'b8a98c5d963a9dd75847a8f110cbf7c9'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run
ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash]
File "<string>", line 2, in __getitem__
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
raise convert_to_error(kind, result)
KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9'
"""
```
This is because the run method for the embedding node is not adding the
IngestDoc to the context map. This PR adds that logic and adds a test to
validate that the embeddings option works as expected.
NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719
goes in, the expected results include the duplicate element bug, however
currently this does at least prove that embeddings are generated and the
function doesn't error.
We’re probably unfairly (to the test) making a large volume of new
connections and requests to test services when all of our ingest tests
run across the full python test matrix and when a lot of PRs a firing at
once. Lets limit the full matrix run to a select few, but still have all
ingest tests run on python v3.10. This is done by checking the version
and skipping in ingest-test.sh.
Bonus: Bumps ingest test fixture workflow to use 3.10. This technically
shouldn't make a difference, but since we're making 3.10 the default of
the matrix strategy, it probably makes sense to use 3.10 for the ingest
fixture generation as well for consistency.
## Testing
-
[example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537900978?pr=1687)
running all tests in 3.10
-
[example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537899999?pr=1687)
skipping/running the expected tests in 3.8
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

### Description
Exposes the endpoint url as an access kwarg when using the s3 filesystem
library via the fsspec abstraction. This allows for any non-aws data
providers that support the s3 protocol to be used with the s3 connector
(i.e. minio)
Closes out https://github.com/Unstructured-IO/unstructured/issues/950
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
- resolves an issue where occasionally deltalake writer results in
SIGABRT event though the writer finished writing table properly on linux
- this is first observed in ingest test
- Putting the writer into a process mitigates this problem by forcing
python to finish the deltalake rust backend to finish its tasks
## test
To test this it is best to setup an instance on a Linux system since the
problem has only been observed on Linux so far. Run
```bash
PYTHONPATH=. ./unstructured/ingest/main.py delta-table --num-processes 2 --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.date_created,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth --table-uri ../tables/delta/ --preserve-downloads --verbose delta-table --write-column json_data --mode overwrite --table-uri file:///tmp/delta
```
Without this fix occasionally we'd encounter `SIGABTR`.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.
I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
* add the first version of airtable connector
* change imports as inline to fail gracefully in case of lacking dependency
* parse tables as csv rather than plain text
* add relevant logic to be able to use --airtable-list-of-paths
* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings
* fix ingest test names
* add scripts for the large table test
* remove large table test from diff test
* make base and table ids explicit
* add and remove comments
* use -ne instead of !=
* update code based on the recent ingest refactor, update changelog and version
* shellcheck fix
* update comments
* update check-num-rows-and-columns-output error message
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* update help comments
* update help comments
* update help comments
* update workflows to set auth tokens and to run make install
* add comments on create_scale_test_components
* separate component ids from the test script, add comments to document test component creation
* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id
* shellcheck fixes
* shellcheck fixes
* update docs
* update comment
* bump version
* add wrongly deleted file
* sort columns before saving to process
* Update ingest test fixtures (#1098)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add param
* expected test
* add option (to do doc nit)
* test with api for now
* typo
* test with api key
* use local only
* encoding -> partition-encoding
* changelog and version
* Update ingest test fixtures (#1055)
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
* ignore coordinates
* no witespace lol
* Update ingest test fixtures (#1061)
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
* Add confluence connector and an example script
* add test script, add dependency installations
* add authentication secret variables for ci tests and actions
* add dependency installation commands for workflows
* add dependency installation commands for workflows
* Update ingest test fixtures (#907)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values
* change workflow name to avoid confusion
* change workflow name to avoid confusion
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions
* Update ingest test fixtures (#911)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* revert back the test python version matrix
* recompile dependencies
* modifications for shellcheck
* update changelog and version
* changelog and version
* remove comments
* Update ingest test fixtures (#915)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add the option to state the number of spaces to be fetched
* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users
* add help message
* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test
* change test names
* rename connector arg
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* change arg name for connector
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* add comment to example
* change arg names
* add new tests to ingest test
* shellcheck remove redundant statement
* Update ingest test fixtures (#932)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* Update ingest test fixtures (#936)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* linting
* change file extensions to parse as html
* Update ingest test fixtures (#943)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* remove old fixtures
* update version to 0.8.2-dev3
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh
Updated ingest tests procedure:
* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
* Initial commit of discord connector
based off of initial work by @tnachen with modifications
https://github.com/tnachen/unstructured/tree/tnachen/discord_connector
* Add test file
change format of imports
* working version of the connector
More work to be done to tidy it up and add any additional options
* add to test fixtures update
* fix spacing
* tests working, switching to bot testing channel
* add additional channel
add reprocess to tests
* add try clause to allow for exit on error
Update changelog and bump version
* add updated expected output filtes
* add logic to check if —discord-period is an integer
Add more to option description
* fix lint error
* Update discord reqs
* PR feedback
* add newline
* another newline
---------
Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
* Add --partition-by-api and --partition-host args to ingest
* Fix error in make check
* Bump changelog
* Add a test ingest script
Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.
* Remove the content type workaround
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
favor of `--remote-url`.
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Add GitLab data connector for ingest.
Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.
Prevent code duplication for functionality between GitHub and GitLab ingest connectors.
Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.
These work for GitHub and GitLab.
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py
Note that there were no logic changes, this is essentially a refactoring PR.
Test instructions:
Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.