10 Commits

Author SHA1 Message Date
Daniel Bichuetti
e0b0fe1bc3
feat!: Increase Crawler standardization regarding Pipelines (#4122)
* feat!(Crawler): Integrate Crawler in the Pipeline.

+Output Documents
+Optional file saving
+Optional Document meta about file path

* refactor: add Optional decl.

* chore: dummy commit

* chore: dummy commit

* refactor: improve overwrite flow

* refactor: change custom file path meta logic + add test

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update haystack/nodes/connector/crawler.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-02-22 17:34:19 +01:00
ZanSara
3ffdb0a9a3
chore: fix all EOF (#3852)
* fix all eof

* fix test

* fix test

* fix test

* typo

* fix sample

* fix sample

* add logs

* fix page_dynamic_result.txt
2023-01-16 12:34:50 +01:00
Stefano Fiorucci
be31178892
fix: make the crawler runnable and testable on Windows (#3830)
* fix crawler and try to run CI

* more compact expression

* try to fix

* improve naming regex

* revert regex

* make test_url compatible wirh Windows

* better conditional expression
2023-01-10 20:27:28 +01:00
Daniel Augustus Bichuetti Silva
77a513fe49
Fix crawler long file names (#2723)
* Changing the name that crawled page is saved to avoid long file names error on some file systems

* Custom naming function for saving crawled files

* Update Documentation & Code Style

* Remove bad characters on file name and preffix

* Add test for naming function

* Update Documentation & Code Style

* Fix expensive regex recalculation and linter warns

* Check for exceptions on file dump

* Remove param_naming variable

* Fix file paths on Windows, Linux and Mac

* Update Documentation & Code Style

* Test using one of the docstrings examples

* Change default naming function
Update docstrings

* Applying formatting rules

* Update Documentation & Code Style

* Fix mypy incompatible assignment error

* Remove unused type declaration

* Fix typo

* Update tests for naming function

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-11 12:16:32 +02:00
Daniel Augustus Bichuetti Silva
e3b2ee956a
Improved crawler support for dynamically loaded pages (#2710)
* Improved crawler support for dynamically loaded pages

* Reduced scope of StaleElementReferenceException and removed deprecated code from WebDriver initialization

* Improvements on crawler testing code

* Code format and style applied on f028331948c170448613e86dfdfa222f7c2043fd

* Update Documentation & Code Style

* Remove unused imports/parameters

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-07-01 10:47:33 +02:00
Sara Zan
e8546e2124
Replace deprecated Selenium methods (#2724)
* Fix crawler.py

* Fix test_connector.py

* unused import

Co-authored-by: danielbichuetti <daniel.bichuetti@gmail.com>
2022-06-24 12:05:32 +02:00
Stefano Fiorucci
c178f60e3a
Make crawler extract also hidden text (#2642)
* make crawler extract also hidden text

* Update Documentation & Code Style

* try to adapt test for extract_hidden_text

* Update Documentation & Code Style

* fix test bug

* fix bug in test

* added test for hidden text"

* Update Documentation & Code Style

* fix bug in test

* Update Documentation & Code Style

* fix test

* Update Documentation & Code Style

* fix other test bug

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-10 09:51:41 +02:00
Sara Zan
c17969e001
Fix failing Crawler test (#2640)
* Make tests insensntive to ordering of crawled pages

* fix docstring
2022-06-07 18:14:43 +02:00
Sara Zan
83648b9bc0
[CI refactoring] Rewrite Crawler tests (#2557)
* Rewrite crawler tests (very slow) and fix small crawler bug

* Update Documentation & Code Style

* compile the regex only once

* Factor out the html files & add content check to most tests

* Clarify that even starting URLs can be excluded

* Update Documentation & Code Style

* Change signature

* Fix failing test

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-06 17:52:37 +02:00
Sara Zan
ff4303c51b
[CI refactoring] Categorize tests into folders (#2554)
* Categorize tests into folders

* Fix linux_ci.yml and an import

* Wrong path
2022-05-17 09:55:53 +01:00