haystack/requirements.txt
Divya Yeruva 6c3ec540a4
Add crawler to get texts from websites (#775)
* add fetch_data_from_url to extract data and store as files

* corrected a typo

* corrected variable name error

* correction of urlparse error

* type error

* added selenium, urllib to requirements

* removed urllib

* minor changes and added function to find out inpage navigation links

* quick duplicate links fix

* quick type annotation fix

* created seperate module for crawler

* type error fix

* type error fix

* import  fix

* quick type error fix

* addee return description

* updated include type to list

* refactor modules. Add Crawler class. rename params.

* add basic pipeline compatibility

* update docstrings

* fix mypy issues

* update args, docstrings, return filepaths

* fix mypy

* make urls optional in init

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-18 12:00:49 +01:00

30 lines
683 B
Plaintext

farm==0.6.2
--find-links=https://download.pytorch.org/whl/torch_stable.html
fastapi
uvicorn
gunicorn
pandas
sklearn
psycopg2-binary; sys_platform != 'win32' and sys_platform != 'cygwin'
elasticsearch>=7.7,<=7.10
elastic-apm
tox
coverage
langdetect # for PDF conversions
# optional: sentence-transformers
python-multipart
python-docx
sqlalchemy_utils
# for using FAISS with GPUs, install faiss-gpu
faiss-cpu>=1.6.3
tika
uvloop==0.14; sys_platform != 'win32' and sys_platform != 'cygwin'
httptools
nltk
more_itertools
networkx
# Refer milvus version support matrix at https://github.com/milvus-io/pymilvus#install-pymilvus
pymilvus
# Optional: For crawling
#selenium
#webdriver-manager