haystack/rest_api/controller/file_upload.py

from typing import Optional, List

import json
import shutil
import uuid
from pathlib import Path

from fastapi import FastAPI, APIRouter, UploadFile, File, Form, HTTPException, Depends
from pydantic import BaseModel
from haystack import Pipeline
from haystack.nodes import BaseConverter, PreProcessor

from rest_api.utils import get_app, get_pipelines
from rest_api.config import FILE_UPLOAD_PATH
from rest_api.controller.utils import as_form


router = APIRouter()
app: FastAPI = get_app()
indexing_pipeline: Pipeline = get_pipelines().get("indexing_pipeline", None)


@as_form
class FileConverterParams(BaseModel):
    remove_numeric_tables: Optional[bool] = None
    valid_languages: Optional[List[str]] = None


@as_form
class PreprocessorParams(BaseModel):
    clean_whitespace: Optional[bool] = None
    clean_empty_lines: Optional[bool] = None
    clean_header_footer: Optional[bool] = None
    split_by: Optional[str] = None
    split_length: Optional[int] = None
    split_overlap: Optional[int] = None
    split_respect_sentence_boundary: Optional[bool] = None


class Response(BaseModel):
    file_id: str


@router.post("/file-upload")
def upload_file(
    files: List[UploadFile] = File(...),
    # JSON serialized string
    meta: Optional[str] = Form("null"),  # type: ignore
    fileconverter_params: FileConverterParams = Depends(FileConverterParams.as_form),  # type: ignore
    preprocessor_params: PreprocessorParams = Depends(PreprocessorParams.as_form),  # type: ignore
):
    """
    You can use this endpoint to upload a file for indexing
    (see https://haystack.deepset.ai/guides/rest-api#indexing-documents-in-the-haystack-rest-api-document-store).
    """
    if not indexing_pipeline:
        raise HTTPException(status_code=501, detail="Indexing Pipeline is not configured.")

    file_paths: list = []
    file_metas: list = []

    meta_form = json.loads(meta) or {}  # type: ignore
    if not isinstance(meta_form, dict):
        raise HTTPException(status_code=500, detail=f"The meta field must be a dict or None, not {type(meta_form)}")

    for file in files:
        try:
            file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"
            with file_path.open("wb") as buffer:
                shutil.copyfileobj(file.file, buffer)

            file_paths.append(file_path)
            meta_form["name"] = file.filename
            file_metas.append(meta_form)
        finally:
            file.file.close()

    # Find nodes names
    converters = indexing_pipeline.get_nodes_by_class(BaseConverter)
    preprocessors = indexing_pipeline.get_nodes_by_class(PreProcessor)

    params = {}
    for converter in converters:
        params[converter.name] = fileconverter_params.dict()
    for preprocessor in preprocessors:
        params[preprocessor.name] = preprocessor_params.dict()

    indexing_pipeline.run(file_paths=file_paths, meta=file_metas, params=params)
Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`from typing import Optional, List`
Introduce pylint & other improvements on the CI (#2130) * Make mypy check also ui and rest_api, fix ui * Remove explicit type packages from extras, mypy now downloads them * Make pylint and mypy run on every file except tests * Rename tasks * Change cache key * Fix mypy errors in rest_api * Normalize python versions to avoid cache misses * Add all exclusions to make pylint pass * Run mypy on rest_api and ui as well * test if installing the package really changes outcome * Comment out installation of packages * Experiment: randomize tests * Add fallback installation steps on cache misses * Remove randomization * Add comment on cache Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-09 18:27:12 +01:00
Refactor REST APIs to use Pipelines (#922) 2021-04-07 17:53:32 +02:00			`import json`
Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00			`import shutil`
			`import uuid`
			`from pathlib import Path`

Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`from fastapi import FastAPI, APIRouter, UploadFile, File, Form, HTTPException, Depends`
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00			`from pydantic import BaseModel`
Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`from haystack import Pipeline`
			`from haystack.nodes import BaseConverter, PreProcessor`
Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00
Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`from rest_api.utils import get_app, get_pipelines`
			`from rest_api.config import FILE_UPLOAD_PATH`
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00			`from rest_api.controller.utils import as_form`
Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00
Improve dependency management (#1994) * Fist attempt at using setup.cfg for dependency management * Trying the new package on the CI and in Docker too * Add composite extras_require * Add the safe_import function for document store imports and add some try-catch statements on rest_api and ui imports * Fix bug on class import and rephrase error message * Introduce typing for optional modules and add type: ignore in sparse.py * Include importlib_metadata backport for py3.7 * Add colab group to extra_requires * Fix pillow version * Fix grpcio * Separate out the crawler as another extra * Make paths relative in rest_api and ui * Update the test matrix in the CI * Add try catch statements around the optional imports too to account for direct imports * Never mix direct deps with self-references and add ES deps to the base install * Refactor several paths in tests to make them insensitive to the execution path * Include tstadel review and re-introduce Milvus1 in the tests suite, to fix * Wrap pdf conversion utils into safe_import * Update some tutorials and rever Milvus1 as default for now, see #2067 * Fix mypy config Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-01-26 18:12:55 +01:00
Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00			`router = APIRouter()`
Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`app: FastAPI = get_app()`
			`indexing_pipeline: Pipeline = get_pipelines().get("indexing_pipeline", None)`
Create file upload dir if not exists (#166) 2020-06-24 15:05:30 +02:00
Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00			`@as_form`
Apply black formatting (#2115) * Testing black on ui/ * Applying black on docstores * Add latest docstring and tutorial changes * Create a single GH action for Black and docs to reduce commit noise to the minimum, slightly refactor the OpenAPI action too * Remove comments * Relax constraints on pydoc-markdown * Split temporary black from the docs. Pydoc-markdown was obsolete and needs a separate PR to upgrade * Fix a couple of bugs * Add a type: ignore that was missing somehow * Give path to black * Apply Black * Apply Black * Relocate a couple of type: ignore * Update documentation * Make Linux CI run after applying Black * Triggering Black * Apply Black * Remove dependency, does not work well * Remove manually double trailing commas * Update documentation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-03 13:43:18 +01:00			`class FileConverterParams(BaseModel):`
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00			`remove_numeric_tables: Optional[bool] = None`
			`valid_languages: Optional[List[str]] = None`
Pipeline node names validation (#1601) * Add node names validation * Add tests * Improve test and test that params exists before validating * Fix the REST API * Use minilm-uncased-squad2 instead of roberta-base-squad2 * Use roberta model for test_pipeline.yaml * Turn off TOKENIZERS_PARALLELISM in generator tests (#1605) * Account for non-targeted parameters * Restore previous parameters handling in the rest api Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> 2021-10-19 15:22:44 +02:00

			`@as_form`
			`class PreprocessorParams(BaseModel):`
			`clean_whitespace: Optional[bool] = None`
			`clean_empty_lines: Optional[bool] = None`
			`clean_header_footer: Optional[bool] = None`
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00			`split_by: Optional[str] = None`
			`split_length: Optional[int] = None`
			`split_overlap: Optional[int] = None`
			`split_respect_sentence_boundary: Optional[bool] = None`


			`class Response(BaseModel):`
			`file_id: str`


Add API endpoint to upload files (#154) 2020-06-17 16:28:26 +02:00			`@router.post("/file-upload")`
Improve open api spec (#1700) * improve open api spec * move to automatic generation of better operationIDs 2021-11-11 09:40:58 +01:00			`def upload_file(`
[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231) * [pipeline] Allow for batch indexing when using Pipelines fix #1168 * [pipeline] Test case fixed fix #1168 * [file_converter] Path.suffix updated #1168 * [file_converter] meta can be one of these three cases: A single dict that is applied to all files One dict for each file being converted None #1168 * [file_converter] mypy error fixed. * [file_converter] mypy error fixed. * [rest_api] batch file upload introduced in indexing API. * [test_case] Test_api file upload parameter name updated. * [ui] Streamlit file upload parameter updated. 2021-06-30 17:13:46 +05:00			`files: List[UploadFile] = File(...),`
Introduce pylint & other improvements on the CI (#2130) * Make mypy check also ui and rest_api, fix ui * Remove explicit type packages from extras, mypy now downloads them * Make pylint and mypy run on every file except tests * Rename tasks * Change cache key * Fix mypy errors in rest_api * Normalize python versions to avoid cache misses * Add all exclusions to make pylint pass * Run mypy on rest_api and ui as well * test if installing the package really changes outcome * Comment out installation of packages * Experiment: randomize tests * Add fallback installation steps on cache misses * Remove randomization * Add comment on cache Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-09 18:27:12 +01:00			`# JSON serialized string`
			`meta: Optional[str] = Form("null"), # type: ignore`
			`fileconverter_params: FileConverterParams = Depends(FileConverterParams.as_form), # type: ignore`
			`preprocessor_params: PreprocessorParams = Depends(PreprocessorParams.as_form), # type: ignore`
Add response for successful file upload (#195) 2020-07-06 17:35:47 +02:00			`):`
Autogenerate OpenAPI specs file (#2047) * Add docstrings to the REST API endpoint to have them included in the OpenAPI specs * Attempt at make GitHub CI generate the OpenAPI specs * Missing __init__.py was breaking rest_api import * Add comment on dummy pipeline * Create separate workflow file for the OpenAPI specs generation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> 2022-01-27 13:06:01 +01:00			`"""`
Apply black formatting (#2115) * Testing black on ui/ * Applying black on docstores * Add latest docstring and tutorial changes * Create a single GH action for Black and docs to reduce commit noise to the minimum, slightly refactor the OpenAPI action too * Remove comments * Relax constraints on pydoc-markdown * Split temporary black from the docs. Pydoc-markdown was obsolete and needs a separate PR to upgrade * Fix a couple of bugs * Add a type: ignore that was missing somehow * Give path to black * Apply Black * Apply Black * Relocate a couple of type: ignore * Update documentation * Make Linux CI run after applying Black * Triggering Black * Apply Black * Remove dependency, does not work well * Remove manually double trailing commas * Update documentation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-03 13:43:18 +01:00			`You can use this endpoint to upload a file for indexing`
Update url in POST /file-uploads (#2193) 2022-02-16 13:06:05 +01:00			`(see https://haystack.deepset.ai/guides/rest-api#indexing-documents-in-the-haystack-rest-api-document-store).`
Autogenerate OpenAPI specs file (#2047) * Add docstrings to the REST API endpoint to have them included in the OpenAPI specs * Attempt at make GitHub CI generate the OpenAPI specs * Missing __init__.py was breaking rest_api import * Add comment on dummy pipeline * Create separate workflow file for the OpenAPI specs generation Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Markus Paff <markuspaff.mp@gmail.com> 2022-01-27 13:06:01 +01:00			`"""`
Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`if not indexing_pipeline:`
Refactor REST APIs to use Pipelines (#922) 2021-04-07 17:53:32 +02:00			`raise HTTPException(status_code=501, detail="Indexing Pipeline is not configured.")`
Add File Upload Functionality in UI (#995) 2021-04-30 14:16:30 +05:30
[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231) * [pipeline] Allow for batch indexing when using Pipelines fix #1168 * [pipeline] Test case fixed fix #1168 * [file_converter] Path.suffix updated #1168 * [file_converter] meta can be one of these three cases: A single dict that is applied to all files One dict for each file being converted None #1168 * [file_converter] mypy error fixed. * [file_converter] mypy error fixed. * [rest_api] batch file upload introduced in indexing API. * [test_case] Test_api file upload parameter name updated. * [ui] Streamlit file upload parameter updated. 2021-06-30 17:13:46 +05:00			`file_paths: list = []`
			`file_metas: list = []`
Add type check for meta on REST API & add tests (#2184) * Add type check for meta & add tests * Improve tests * Handle properly the ValueError ad an HTTPException Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-16 10:32:22 +01:00
			`meta_form = json.loads(meta) or {} # type: ignore`
			`if not isinstance(meta_form, dict):`
			`raise HTTPException(status_code=500, detail=f"The meta field must be a dict or None, not {type(meta_form)}")`
Add File Upload Functionality in UI (#995) 2021-04-30 14:16:30 +05:30
[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231) * [pipeline] Allow for batch indexing when using Pipelines fix #1168 * [pipeline] Test case fixed fix #1168 * [file_converter] Path.suffix updated #1168 * [file_converter] meta can be one of these three cases: A single dict that is applied to all files One dict for each file being converted None #1168 * [file_converter] mypy error fixed. * [file_converter] mypy error fixed. * [rest_api] batch file upload introduced in indexing API. * [test_case] Test_api file upload parameter name updated. * [ui] Streamlit file upload parameter updated. 2021-06-30 17:13:46 +05:00			`for file in files:`
			`try:`
			`file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"`
			`with file_path.open("wb") as buffer:`
			`shutil.copyfileobj(file.file, buffer)`

			`file_paths.append(file_path)`
Introduce pylint & other improvements on the CI (#2130) * Make mypy check also ui and rest_api, fix ui * Remove explicit type packages from extras, mypy now downloads them * Make pylint and mypy run on every file except tests * Rename tasks * Change cache key * Fix mypy errors in rest_api * Normalize python versions to avoid cache misses * Add all exclusions to make pylint pass * Run mypy on rest_api and ui as well * test if installing the package really changes outcome * Comment out installation of packages * Experiment: randomize tests * Add fallback installation steps on cache misses * Remove randomization * Add comment on cache Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-02-09 18:27:12 +01:00			`meta_form["name"] = file.filename`
			`file_metas.append(meta_form)`
[pipeline] Allow for batch indexing when using Pipelines fix #1168 (#1231) * [pipeline] Allow for batch indexing when using Pipelines fix #1168 * [pipeline] Test case fixed fix #1168 * [file_converter] Path.suffix updated #1168 * [file_converter] meta can be one of these three cases: A single dict that is applied to all files One dict for each file being converted None #1168 * [file_converter] mypy error fixed. * [file_converter] mypy error fixed. * [rest_api] batch file upload introduced in indexing API. * [test_case] Test_api file upload parameter name updated. * [ui] Streamlit file upload parameter updated. 2021-06-30 17:13:46 +05:00			`finally:`
			`file.file.close()`

Pylint (import related warnings) and REST API improvements (#2326) * remove duplicate imports * fix ungrouped-imports * Fix wrong-import-position * Fix unused-import * pyproject.toml * Working on wrong-import-order * Solve wrong-import-order * fix Pool import * Move open_search_index_to_document_store and elasticsearch_index_to_document_store in elasticsearch.py * remove Converter from modeling * Fix mypy issues on adaptive_model.py * create es_converter.py * remove converter import * change import path in tests * Restructure REST API to not rely on global vars from search.apy and improve tests * Fix openapi generator * Move variable initialization * Change type of FilterRequest.filters Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> 2022-04-12 16:41:05 +02:00			`# Find nodes names`
			`converters = indexing_pipeline.get_nodes_by_class(BaseConverter)`
			`preprocessors = indexing_pipeline.get_nodes_by_class(PreProcessor)`

			`params = {}`
			`for converter in converters:`
			`params[converter.name] = fileconverter_params.dict()`
			`for preprocessor in preprocessors:`
			`params[preprocessor.name] = preprocessor_params.dict()`

			`indexing_pipeline.run(file_paths=file_paths, meta=file_metas, params=params)`