238 Commits

Author SHA1 Message Date
Sebastian Husch Lee
28ad78c73d
feat: Add XLSXToDocument converter (#8522)
* Add draft of the Excel To Document converter

* Add license header

* Add release note

* Use Union instead of pipe

* Add openpyxl as additional dep

* Fix zip issue

* few updates from Bijay

* Update deps

* Add markdown test

* Adding more example excels and expanding tests

* Added more tests

* Fix windows test by setting lineterminator

* Addressing PR comments

* PR comments

* Fix linting
2025-01-09 09:03:19 +01:00
Stefano Fiorucci
2bc58d2987
feat: support for tools in HuggingFaceAPIChatGenerator (#8661)
* message conversion function

* hfapi w tools

* right test file + hf_hub version

* release note

* feedback
2024-12-19 15:04:37 +01:00
Stefano Fiorucci
96b4a1d2fd
feat: Tool dataclass - unified abstraction to represent tools (#8652)
* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* port Tool from experimental

* release note

* docs upd

* Update tool.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-12-18 11:36:44 +00:00
Stefano Fiorucci
2a9a6401d2
chore: pin openai>=1.56.1 (#8632)
* pin openai>=1.56.1

* release note
2024-12-12 16:26:38 +01:00
David S. Batista
248dccbdd3
chore: fixing pylint issues (#8610)
* initial import

* fixing internal methods

* fixing some internal methods

* modify _preprocess

* fixed internal methods

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-12-09 16:53:37 +00:00
Stefano Fiorucci
de7099e560
ci: add job to check imports (#8594)
* try checking imports

* clarify error message

* better fmt

* do not show complete list of successfully imported packages

* refinements

* relnote

* add missing forward references

* better function name

* linting

* fix linting

* Update .github/utils/check_imports.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-11-29 14:00:59 +00:00
Stefano Fiorucci
f085959067
chore: declare requires-python<3.13 in pyproject (#8547)
* restrict to python<3.13

* try unpinning dulwich

* reintroduce dulwich pin
2024-11-15 09:28:39 +00:00
Silvano Cerza
ebb45d3d1e
Remove ddtrace version pin (#8529) 2024-11-11 11:21:10 +01:00
Stefano Fiorucci
c7b898994e
build: unpin numpy + use Python 3.9 in CI (#8492)
* try unpinning numpy

* try python 3.9

* release note
2024-10-28 12:15:17 +01:00
Silvano Cerza
0157459a7b
Pin ddtrace test dependency to fix tests (#8478) 2024-10-22 10:19:25 +00:00
Stefano Fiorucci
f6935d1456
ci: add pip to test dependencies (#8475)
* add pip to test dependencies

* trigger

* release note

* rm trigger
2024-10-22 08:35:30 +00:00
Stefano Fiorucci
7788bfe558
ci: upgrade Hatch to 1.13.0 and adopt uv as installer (#8313)
* try uv

* upgrade hatch

* rm unnecessary specification

* release note
2024-10-17 10:32:14 +02:00
Silvano Cerza
29672d4b42
feat: Add JSONConverter Component (#8397)
* Add JSONConverter Component

* Handle some corner cases

* Add JSONConverter to pydoc config

* Add a way to extract all non content fields as metadata

* Small fix in docstring

* Fix tests

* docstrings upd

* Update json.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-09-25 12:34:51 +02:00
Silvano Cerza
4b77ec1b6f
Fix codespell config (#8392) 2024-09-24 12:00:45 +02:00
Vladimir Blagojevic
badd0594cc
feat: Port NLTKDocumentSplitter from dC to Haystack (#8350)
* Port NLTKDocumentSplitter from dC to Haystack

* Improve pydocs

* Use haystack logging

* Add NLTKDocumentSplitter to __init__.py

* Use haystack logging, rename test classes

* Fixing _needs_join return

* Linting

* PR feedback

* More static methods

* Increase test coverage

* Compile pattern

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-09-17 13:59:19 +02:00
Silvano Cerza
da49e782e2
chore: Make arrow an optional dependency (#8345)
* Make arrow an optional dependency

* Fix imports
2024-09-09 16:09:51 +02:00
Mo Sriha
75955922b9
feat: Add current date in UTC to PromptBuilder (#8233)
* initial commit

* add unit tests

* add release notes

* update function name
2024-09-09 09:47:03 +02:00
Stefano Fiorucci
25d333bed3
update transformers (#8296) 2024-08-27 16:04:11 +00:00
Stefano Fiorucci
6b0ee4c193
chore: update test dependency and LazyImport block to make compatibility with sentence-transformers>=3.0.0 explicit (#8295)
* sentence-transformers-3 update test dep and lazyimport block

* clearer release note
2024-08-27 15:51:03 +00:00
Tobias Wochinger
5a3ea75196
docs: document Python 3.11 and 3.12 support (#8159)
* docs: add Python 3.11 and 3.12 to supported versions

* docs: add release notes
2024-08-02 14:46:20 +02:00
Tobias Wochinger
4dde6fbaec
build: unpin structlog (#8071) 2024-07-24 20:58:34 +02:00
Vladimir Blagojevic
a59de1d7b3
chore: Combined main unblock (#8045)
* Pin structlog to 24.2.0 due to unit test failures

* Remove object init parameter in huggingface_hub unit tests

* Use less restrictive structlog pin

* Add release note
2024-07-19 10:39:10 +02:00
Vladimir Blagojevic
b3b3f89302
feat: Add haystack-experimental dependency (#7921)
* Add haystack-experimental dependency

* Add reno note
2024-07-08 14:07:15 +02:00
Stefano Fiorucci
d80e01492b
update sentence transformers import error message (#7906) 2024-06-20 18:15:01 +02:00
Massimiliano Pippi
3a03fce71c
ci: Add code formatting checks (#7882)
* ruff settings

enable ruff format and re-format outdated files

feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe`  (#7879)

* adding param to explictily state which cols to keep

* adding param to explictily state which cols to keep

* adding param to explictily state which cols to keep

* updating tests

* adding release notes

* Update haystack/evaluation/eval_run_result.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* updating docstring

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

add format-check

fail on format and linting failures

fix string formatting

reformat long lines

fix tests

fix typing

linter

pull from main

* reformat

* lint -> check

* lint -> check
2024-06-18 15:52:46 +00:00
Stefano Fiorucci
2413bb3f42
chore: pin numpy<2; tenacity!=8.4.0 (#7876)
* pin numpy<2

* reno

* pin tenacity too
2024-06-17 10:54:02 +02:00
Massimiliano Pippi
324bbc3868
chore: clean up default env and add a script to generate release notes. (#7858)
* clean up default env and add reno script

* update contributions guidelines

* use test script

* format

* re-add missing dep

* remove black in favour of ruff
2024-06-14 14:57:24 +02:00
Carlos Fernández
c1c339923f
feat: add DocxToDocument converter (#7838)
* first fucntioning DocxFileToDocument

* fix lazy import message

* add reno

* Add license headder

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* change DocxFileToDocument to DocxToDocument

* Update library install to the maintained version

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* clan try-exvept to only take non haystack errors into account

* Add wanring on docstring of component ignoring page brakes, mark test as skip

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Make warnings lazy evaluated

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Solve f bug

* Get more metadata from docx files

* add 'python-docx' dependency and docs

* Change logging import

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Fix typo

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* remake metadata extraction for docx

* solve bug regarding _get_docx_metadata method

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Delete unused test

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-06-12 11:58:36 +02:00
Sebastian Husch Lee
2c2c7c9f56
feat: Add PPTXToDocument converter (#7808)
* Add first pass at PPTXToDocument converter

* Add test and update code

* Add doc string

* Update docstrings

* Add release notes

* remove unused imports, add to api docs, update pyproject.toml

* Add a new test

* Add dep so tests can run
2024-06-07 09:43:29 +00:00
Stefano Fiorucci
bde92fda67
upgrade transformers and reorganize extras (#7815) 2024-06-06 15:57:18 +02:00
Silvano Cerza
23011c215e
chore: Change trafilatura dependency to use lazy import (#7809)
* Change trafilatura dependency to use lazy import

* Add release notes
2024-06-05 18:04:24 +02:00
Silvano Cerza
fd838fc573
Update indexing and rag default templates to use InMemoryDocumentStore (#7782) 2024-06-04 12:57:33 +02:00
Silvano Cerza
3dcc21fd73
test: Pipeline run tests rework (#7748)
* Rework Pipeline.run() tests

* Remove test_linear_pipeline.py

* Add test for components execution order

* Add new pytest-bdd tests dependency

* Update README.md

* Add function to dinamically add integration marker

* Fix marking tests as integration
2024-05-28 15:42:47 +02:00
Stefano Fiorucci
7181f6b7e9
feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705)
* change HTML conversion backed to Trafilatura

* rm unused var
2024-05-17 10:38:47 +02:00
Guest400123064
cd66a80ba2
perf: enhanced InMemoryDocumentStore BM25 query efficiency with incremental indexing (#7549)
* incorporating better bm25 impl without breaking interface

* all three bm25 algos

* 1. setting algo post-init not allowed; 2. remove extra underscore for naming consistency; 3. remove unused import

* 1. rename attribute name for IDF computation 2. organize document statistics as a dataclass instead of tuple to improve readability

* fix score type initialization (int -> float) to pass mypy check

* release note included

* fixing linting issues and mypy

* fixing tests

* removing heapq import and cleaning up logging

* changing indexing order

* adding more tests

* increasing tests

* removing rank_bm25 from pyproject.toml

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-05-03 12:10:15 +00:00
Vladimir Blagojevic
5f813373eb
chore: Update huggingface_hub classes used after library upgrade (#7631)
* Update huggingface_hub classes used after library upgrade

* Fix chat tests

* Update lazy import guard and other references to huggingface_hub>=0.23.0

* In huggingface_hub 0.23.0 TextGenerationOutput property details is now optional

* More fixes

* Add reno note
2024-05-03 10:14:54 +02:00
Mo
2e35f13085
feat: add converter based on pdfminer (#7607)
* Initial commit pdfminer converter

* Revert back naming of argument all_text per pdfminer documentation

* Add the component decorator

* Add release notes

* Reformat code with black

* Remove LTPage and comments

* Update dependencies in pyproject.toml

* Added some tests and incorporated reference doc in docstring

* Added some tests and incorporated reference doc in docstring
2024-05-02 10:36:54 +02:00
David S. Batista
8d04e530da
test: end2end evaluation tests (#7601)
* initial import

* wip

* cleaning up tests

* fixing tests

* adding context relevance

* reverting some wrong changes to due PyCharm error in refactoring

* building eval pipeline only once

* handling mypy issues
2024-04-26 14:07:05 +00:00
David S. Batista
958f1eb3a3
doc: adding docstring linting based on ruff (#7463)
* wip: docstrings linting

* set ruff rules
2024-04-23 18:43:09 +02:00
Massimiliano Pippi
5d0ccfe7d4
fix hatch scripts (#7546) 2024-04-12 18:04:18 +02:00
Massimiliano Pippi
e90ffafb47
chore: forward hatch command args to pytest (#7537) 2024-04-11 21:30:34 +02:00
Massimiliano Pippi
2dca53f69b
chore: set linting parameters to the minimum (#7501)
* set line-length to the minimum

* add more defaults

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-04-09 08:56:16 +02:00
Stefano Fiorucci
e26ee0f1db
refactor!: make TGI generators compatible with huggingface_hub>=0.22.0 (#7425)
* progress

* progress

* better lazy imports

* fixes

* reno
2024-03-26 16:10:06 +01:00
Stefano Fiorucci
19d3f39e75
ci: pin huggingface_hub in tests dependencies (#7417)
* pin huggingface_hub in tests dependencies

* Update pyproject.toml
2024-03-25 18:52:02 +01:00
Stefano Fiorucci
e793c718b6
chore: Upgrade transformers to 4.38.2 in test environment (#7363)
* upgrade transformers to 4.38.2 in test environment

* add pyproject to files to check in test workflow
2024-03-15 10:06:28 +01:00
Stefano Fiorucci
abda78c122
unpin OpenAI and fix problem with mock (#7364) 2024-03-15 08:32:28 +01:00
Vladimir Blagojevic
5b4f9f1cda
Pin openai to latest working version (#7359) 2024-03-14 10:47:28 +01:00
Tobias Wochinger
655d4a1a8d
test: test for missing dependencies (#7278)
* tests: import test for missing libraries

* build: add missing dependencies

* refactor: use glob instead of tree walk

* test: extract constants + more documentation
2024-03-05 12:14:10 +01:00
Stefano Fiorucci
721691c036
replace flaky with pytest-rerunfailures (#7298) 2024-03-04 12:26:40 +01:00
Stefano Fiorucci
727794cb70
pin pytest (#7295) 2024-03-04 10:14:39 +01:00