ingest download-only fix (#1943)

### Description
move check for download only after source node run
This commit is contained in:
Roman Isecke 2023-10-31 10:05:37 -04:00 committed by GitHub
parent 857195b6e6
commit 4f8cb04663
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 4 additions and 3 deletions

View File

@ -1,4 +1,4 @@
## 0.10.29-dev1 ## 0.10.29-dev2
### Enhancements ### Enhancements
@ -10,6 +10,7 @@
### Fixes ### Fixes
* **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python. * **Ingest session handler not being shared correctly** All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
* **Ingest download-only fix** Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
## 0.10.28 ## 0.10.28

View File

@ -1 +1 @@
__version__ = "0.10.29-dev1" # pragma: no cover __version__ = "0.10.29-dev2" # pragma: no cover

View File

@ -59,10 +59,10 @@ class Pipeline(DataClassJsonMixin):
) )
for doc in dict_docs: for doc in dict_docs:
self.pipeline_context.ingest_docs_map[get_ingest_doc_hash(doc)] = doc self.pipeline_context.ingest_docs_map[get_ingest_doc_hash(doc)] = doc
fetched_filenames = self.source_node(iterable=dict_docs)
if self.source_node.read_config.download_only: if self.source_node.read_config.download_only:
logger.info("stopping pipeline after downloading files") logger.info("stopping pipeline after downloading files")
return return
fetched_filenames = self.source_node(iterable=dict_docs)
if not fetched_filenames: if not fetched_filenames:
logger.info("No files to run partition over") logger.info("No files to run partition over")
return return