mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-08-25 08:58:40 +00:00

* Upgrade to v1.6.0 and copy docs folder * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
127 lines
6.6 KiB
Markdown
127 lines
6.6 KiB
Markdown
<a id="crawler"></a>
|
|
|
|
# Module crawler
|
|
|
|
<a id="crawler.Crawler"></a>
|
|
|
|
## Crawler
|
|
|
|
```python
|
|
class Crawler(BaseComponent)
|
|
```
|
|
|
|
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
|
|
|
|
**Example:**
|
|
```python
|
|
| from haystack.nodes.connector import Crawler
|
|
|
|
|
| crawler = Crawler(output_dir="crawled_files")
|
|
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
|
|
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
|
|
| filter_urls= ["haystack\.deepset\.ai\/overview\/"])
|
|
```
|
|
|
|
<a id="crawler.Crawler.__init__"></a>
|
|
|
|
#### Crawler.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None, extract_hidden_text=True, loading_wait_time: Optional[int] = None)
|
|
```
|
|
|
|
Init object with basic params for crawling (can be overwritten later).
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
|
|
E.g. the text can be inside a span with style="display: none"
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
|
|
<a id="crawler.Crawler.crawl"></a>
|
|
|
|
#### Crawler.crawl
|
|
|
|
```python
|
|
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = None, loading_wait_time: Optional[int] = None) -> List[Path]
|
|
```
|
|
|
|
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
|
|
|
|
file per URL, including text and basic meta data).
|
|
You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
|
|
All parameters are optional here and only meant to overwrite instance attributes at runtime.
|
|
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http addresses or single http address
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
|
|
**Returns**:
|
|
|
|
List of paths where the crawled webpages got stored
|
|
|
|
<a id="crawler.Crawler.run"></a>
|
|
|
|
#### Crawler.run
|
|
|
|
```python
|
|
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = True, loading_wait_time: Optional[int] = None) -> Tuple[Dict[str, Union[List[Document], List[Path]]], str]
|
|
```
|
|
|
|
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http addresses or single http address
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `return_documents`: Return json files content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
|
|
E.g. the text can be inside a span with style="display: none"
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
|
|
**Returns**:
|
|
|
|
Tuple({"paths": List of filepaths, ...}, Name of output edge)
|
|
|