2023-02-14 12:27:45 -08:00
|
|
|
#!/usr/bin/env bash
|
|
|
|
|
2023-09-21 14:51:08 -04:00
|
|
|
set -eu -o pipefail
|
2023-02-14 12:27:45 -08:00
|
|
|
|
2023-02-21 10:15:33 -08:00
|
|
|
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
|
|
|
cd "$SCRIPT_DIR"/.. || exit 1
|
2023-02-16 08:45:50 -08:00
|
|
|
|
2023-04-11 00:11:50 -07:00
|
|
|
# NOTE(crag): sets number of tesseract threads to 1 which may help with more reproducible outputs
|
|
|
|
export OMP_THREAD_LIMIT=1
|
|
|
|
|
2023-10-10 09:39:34 -07:00
|
|
|
all_tests=(
|
2023-10-30 16:09:49 -04:00
|
|
|
'test-ingest-azure-dest.sh'
|
|
|
|
'test-ingest-box-dest.sh'
|
|
|
|
'test-ingest-dropbox-dest.sh'
|
|
|
|
'test-ingest-gcs-dest.sh'
|
|
|
|
'test-ingest-s3-dest.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
'test-ingest-s3.sh'
|
2023-10-03 14:31:28 -04:00
|
|
|
'test-ingest-s3-minio.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
'test-ingest-azure.sh'
|
2023-10-02 16:47:24 -04:00
|
|
|
'test-ingest-biomed-api.sh'
|
|
|
|
'test-ingest-biomed-path.sh'
|
2023-10-12 13:33:25 -04:00
|
|
|
# NOTE(yuming): The pdf-fast-reprocess test should be put after any tests that save downloaded files
|
2023-10-02 16:47:24 -04:00
|
|
|
'test-ingest-pdf-fast-reprocess.sh'
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
'test-ingest-salesforce.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
'test-ingest-box.sh'
|
|
|
|
'test-ingest-discord.sh'
|
2023-10-20 10:00:19 -04:00
|
|
|
'test-ingest-dropbox.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
'test-ingest-github.sh'
|
|
|
|
'test-ingest-gitlab.sh'
|
|
|
|
'test-ingest-google-drive.sh'
|
|
|
|
'test-ingest-wikipedia.sh'
|
|
|
|
'test-ingest-local.sh'
|
|
|
|
'test-ingest-slack.sh'
|
|
|
|
'test-ingest-against-api.sh'
|
|
|
|
'test-ingest-gcs.sh'
|
|
|
|
'test-ingest-onedrive.sh'
|
|
|
|
'test-ingest-outlook.sh'
|
|
|
|
'test-ingest-elasticsearch.sh'
|
|
|
|
'test-ingest-confluence-diff.sh'
|
|
|
|
'test-ingest-confluence-large.sh'
|
|
|
|
'test-ingest-airtable-diff.sh'
|
2023-09-22 19:00:01 -07:00
|
|
|
# NOTE(ryan): This test is disabled because it is triggering too many requests to the API
|
|
|
|
# 'test-ingest-airtable-large.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
'test-ingest-local-single-file.sh'
|
|
|
|
'test-ingest-local-single-file-with-encoding.sh'
|
|
|
|
'test-ingest-local-single-file-with-pdf-infer-table-structure.sh'
|
|
|
|
'test-ingest-notion.sh'
|
|
|
|
'test-ingest-delta-table.sh'
|
|
|
|
'test-ingest-jira.sh'
|
|
|
|
'test-ingest-sharepoint.sh'
|
fix: unstructured-ingest embedding KeyError (#1727)
Currently adding the embedding flag to any unstructured-ingest call
results in this failure:
```
2023-10-11 22:42:14,177 MainProcess ERROR 'b8a98c5d963a9dd75847a8f110cbf7c9'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run
ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash]
File "<string>", line 2, in __getitem__
File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
raise convert_to_error(kind, result)
KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9'
"""
```
This is because the run method for the embedding node is not adding the
IngestDoc to the context map. This PR adds that logic and adds a test to
validate that the embeddings option works as expected.
NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719
goes in, the expected results include the duplicate element bug, however
currently this does at least prove that embeddings are generated and the
function doesn't error.
2023-10-12 13:27:30 -07:00
|
|
|
'test-ingest-embed.sh'
|
2023-09-21 14:51:08 -04:00
|
|
|
)
|
|
|
|
|
2023-10-10 09:39:34 -07:00
|
|
|
full_python_matrix_tests=(
|
|
|
|
'test-ingest-sharepoint.sh'
|
|
|
|
'test-ingest-local.sh'
|
|
|
|
'test-ingest-local-single-file.sh'
|
|
|
|
'test-ingest-local-single-file-with-encoding.sh'
|
|
|
|
'test-ingest-local-single-file-with-pdf-infer-table-structure.sh'
|
|
|
|
'test-ingest-s3.sh'
|
|
|
|
'test-ingest-google-drive.sh'
|
|
|
|
'test-ingest-gcs.sh'
|
|
|
|
)
|
|
|
|
|
|
|
|
CURRENT_TEST="none"
|
2023-09-21 14:51:08 -04:00
|
|
|
|
|
|
|
function print_last_run() {
|
2023-10-10 09:39:34 -07:00
|
|
|
if [ "$CURRENT_TEST" != "none" ]; then
|
|
|
|
echo "Last ran script: $CURRENT_TEST"
|
2023-09-21 14:51:08 -04:00
|
|
|
fi
|
|
|
|
}
|
|
|
|
|
|
|
|
trap print_last_run EXIT
|
|
|
|
|
2023-10-10 09:39:34 -07:00
|
|
|
python_version=$(python --version 2>&1)
|
|
|
|
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
tests_to_ignore=(
|
|
|
|
'test-ingest-notion.sh'
|
2023-10-23 17:39:22 -04:00
|
|
|
'test-ingest-dropbox.sh'
|
2023-10-27 14:07:00 +01:00
|
|
|
'test-ingest-sharepoint.sh'
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
)
|
|
|
|
|
2023-10-10 09:39:34 -07:00
|
|
|
for test in "${all_tests[@]}"; do
|
|
|
|
CURRENT_TEST="$test"
|
|
|
|
# IF: python_version is not 3.10 (wildcarded to match any subminor version) AND the current test is not in full_python_matrix_tests
|
|
|
|
# Note: to test we expand the full_python_matrix_tests array to a string and then regex match the current test
|
|
|
|
if [[ "$python_version" != "Python 3.10"* ]] && [[ ! "${full_python_matrix_tests[*]}" =~ $test ]] ; then
|
|
|
|
echo "--------- SKIPPING SCRIPT $test ---------"
|
|
|
|
continue
|
|
|
|
fi
|
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.
1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.
2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.
**Technical Summary**
The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.
Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:
```python
>>> element.regex_metadata
{
"mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
"version": [
{"text": "current: v1.7.2", "start": 7, "end": 21},
{"text": "supersedes: v1.7.0", "start": 22, "end": 40},
],
}
```
*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.
* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.
Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.
**Over-chunking Behavior**
The over-chunking looked like this:
Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
chunks = chunk_by_title(elements)
assert chunks == [
CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
]
```
Observed behavior looked like this:
```python
chunks => [
CompositeElement('Lorem Ipsum')
CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]
```
The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.
**Dropping regex-metadata Behavior**
Chunking this section:
```python
elements: List[Element] = [
Title(
"Lorem Ipsum",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
),
),
Text(
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
metadata=ElementMetadata(
regex_metadata={
"dolor": [RegexMetadata(text="dolor", start=12, end=17)],
"ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
}
),
),
Text(
"In rhoncus ipsum sed lectus porta volutpat.",
metadata=ElementMetadata(
regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
),
),
]
```
..should produce this regex_metadata on the single produced chunk:
```python
assert chunk == CompositeElement(
"Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
" ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
"dolor": [RegexMetadata(text="dolor", start=25, end=30)],
"ipsum": [
RegexMetadata(text="Ipsum", start=6, end=11),
RegexMetadata(text="ipsum", start=19, end=24),
RegexMetadata(text="ipsum", start=81, end=86),
],
}
```
but instead produced this:
```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```
Which is the regex-metadata from the first element only.
The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 20:16:02 -07:00
|
|
|
if [[ "${tests_to_ignore[*]}" =~ $test ]]; then
|
2023-10-10 09:39:34 -07:00
|
|
|
echo "--------- RUNNING SCRIPT $test --- IGNORING FAILURES"
|
2023-09-22 00:19:21 -07:00
|
|
|
set +e
|
2023-10-10 09:39:34 -07:00
|
|
|
echo "Running ./test_unstructured_ingest/$test"
|
|
|
|
./test_unstructured_ingest/"$test"
|
2023-09-22 00:19:21 -07:00
|
|
|
set -e
|
2023-10-10 09:39:34 -07:00
|
|
|
echo "--------- FINISHED SCRIPT $test ---------"
|
2023-09-22 00:19:21 -07:00
|
|
|
else
|
2023-10-10 09:39:34 -07:00
|
|
|
echo "--------- RUNNING SCRIPT $test ---------"
|
|
|
|
echo "Running ./test_unstructured_ingest/$test"
|
|
|
|
./test_unstructured_ingest/"$test"
|
|
|
|
echo "--------- FINISHED SCRIPT $test ---------"
|
2023-09-22 00:19:21 -07:00
|
|
|
fi
|
2023-09-21 14:51:08 -04:00
|
|
|
done
|
2023-10-23 17:39:22 -04:00
|
|
|
|
2023-10-27 00:36:36 -04:00
|
|
|
all_eval=(
|
|
|
|
'text-extraction'
|
|
|
|
'element-type'
|
|
|
|
)
|
|
|
|
for eval in "${all_eval[@]}"; do
|
|
|
|
CURRENT_TEST="$eval"
|
|
|
|
echo "--------- RUNNING SCRIPT $eval ---------"
|
|
|
|
./test_unstructured_ingest/evaluation-metrics.sh "$eval"
|
|
|
|
echo "--------- FINISHED SCRIPT $eval ---------"
|
2023-10-27 14:07:00 +01:00
|
|
|
done
|