
* remove not needed githab actions and reactivate docstrings and tutorial generation * test workflow * update pydoc version * update python version * update watchdog * move to latest version pydoc-markdown * remove version check * Add latest docstring and tutorial changes * remove test workflow * test for param docstrings * pin pydoc-markdown version * add test workflow * pin watchdog version * Add latest docstring and tutorial changes * update original workflow and delete test Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
4.1 KiB
Module crawler
Crawler Objects
class Crawler(BaseComponent)
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
Example:
| from haystack.connector import Crawler
|
| crawler = Crawler(output_dir="crawled_files")
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
| filter_urls= ["haystack\.deepset\.ai\/docs\/"])
__init__
| __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
Init object with basic params for crawling (can be overwritten later).
Arguments:
output_dir
: Path for the directory to store filesurls
: List of http(s) address(es) (can also be supplied later when calling crawl())crawler_depth
: How many sublinks to follow from the initial list of URLs. Current options: 0: Only initial list of urls 1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with. All URLs not matching at least one of the regular expressions will be dropped.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new content
crawl
| crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
file per URL, including text and basic meta data).
You can optionally specify via filter_urls
to only crawl URLs that match a certain pattern.
All parameters are optional here and only meant to overwrite instance attributes at runtime.
If no parameters are provided to this method, the instance attributes that were passed during init will be used.
Arguments:
output_dir
: Path for the directory to store filesurls
: List of http addresses or single http addresscrawler_depth
: How many sublinks to follow from the initial list of URLs. Current options: 0: Only initial list of urls 1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with. All URLs not matching at least one of the regular expressions will be dropped.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new content
Returns:
List of paths where the crawled webpages got stored
run
| run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False) -> Tuple[Dict, str]
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
Arguments:
output_dir
: Path for the directory to store filesurls
: List of http addresses or single http addresscrawler_depth
: How many sublinks to follow from the initial list of URLs. Current options: 0: Only initial list of urls 1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with. All URLs not matching at least one of the regular expressions will be dropped.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new contentreturn_documents
: Return json files content
Returns:
Tuple({"paths": List of filepaths, ...}, Name of output edge)