1 Commits

Author SHA1 Message Date
ryannikolaidis
d22044a44c
fix: unstructured-ingest embedding KeyError (#1727)
Currently adding the embedding flag to any unstructured-ingest call
results in this failure:

```
2023-10-11 22:42:14,177 MainProcess ERROR    'b8a98c5d963a9dd75847a8f110cbf7c9'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run
    ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash]
  File "<string>", line 2, in __getitem__
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9'
"""
```

This is because the run method for the embedding node is not adding the
IngestDoc to the context map. This PR adds that logic and adds a test to
validate that the embeddings option works as expected.

NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719
goes in, the expected results include the duplicate element bug, however
currently this does at least prove that embeddings are generated and the
function doesn't error.
2023-10-12 20:27:30 +00:00