mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 08:52:16 +00:00

* Changing the name that crawled page is saved to avoid long file names error on some file systems * Custom naming function for saving crawled files * Update Documentation & Code Style * Remove bad characters on file name and preffix * Add test for naming function * Update Documentation & Code Style * Fix expensive regex recalculation and linter warns * Check for exceptions on file dump * Remove param_naming variable * Fix file paths on Windows, Linux and Mac * Update Documentation & Code Style * Test using one of the docstrings examples * Change default naming function Update docstrings * Applying formatting rules * Update Documentation & Code Style * Fix mypy incompatible assignment error * Remove unused type declaration * Fix typo * Update tests for naming function * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
145 lines
9.1 KiB
Markdown
145 lines
9.1 KiB
Markdown
<a id="crawler"></a>
|
|
|
|
# Module crawler
|
|
|
|
<a id="crawler.Crawler"></a>
|
|
|
|
## Crawler
|
|
|
|
```python
|
|
class Crawler(BaseComponent)
|
|
```
|
|
|
|
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
|
|
|
|
**Example:**
|
|
```python
|
|
| from haystack.nodes.connector import Crawler
|
|
|
|
|
| crawler = Crawler(output_dir="crawled_files")
|
|
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
|
|
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
|
|
| filter_urls= ["haystack.deepset.ai/overview/"])
|
|
```
|
|
|
|
<a id="crawler.Crawler.__init__"></a>
|
|
|
|
#### Crawler.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None, extract_hidden_text=True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None)
|
|
```
|
|
|
|
Init object with basic params for crawling (can be overwritten later).
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
|
|
E.g. the text can be inside a span with style="display: none"
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
- `crawler_naming_function`: A function mapping the crawled page to a file name.
|
|
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
|
|
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
|
|
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
|
|
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
|
|
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
|
|
|
|
<a id="crawler.Crawler.crawl"></a>
|
|
|
|
#### Crawler.crawl
|
|
|
|
```python
|
|
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = None, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> List[Path]
|
|
```
|
|
|
|
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
|
|
|
|
file per URL, including text and basic meta data).
|
|
You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
|
|
All parameters are optional here and only meant to overwrite instance attributes at runtime.
|
|
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http addresses or single http address
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
- `crawler_naming_function`: A function mapping the crawled page to a file name.
|
|
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
|
|
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
|
|
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
|
|
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
|
|
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
|
|
|
|
**Returns**:
|
|
|
|
List of paths where the crawled webpages got stored
|
|
|
|
<a id="crawler.Crawler.run"></a>
|
|
|
|
#### Crawler.run
|
|
|
|
```python
|
|
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> Tuple[Dict[str, Union[List[Document], List[Path]]], str]
|
|
```
|
|
|
|
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
|
|
|
|
**Arguments**:
|
|
|
|
- `output_dir`: Path for the directory to store files
|
|
- `urls`: List of http addresses or single http address
|
|
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
|
|
0: Only initial list of urls
|
|
1: Follow links found on the initial URLs (but no further)
|
|
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
|
|
All URLs not matching at least one of the regular expressions will be dropped.
|
|
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
|
|
- `return_documents`: Return json files content
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
|
|
E.g. the text can be inside a span with style="display: none"
|
|
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
|
|
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
|
|
E.g. 2: Crawler will wait 2 seconds before scraping page
|
|
- `crawler_naming_function`: A function mapping the crawled page to a file name.
|
|
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
|
|
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
|
|
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
|
|
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
|
|
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
|
|
|
|
**Returns**:
|
|
|
|
Tuple({"paths": List of filepaths, ...}, Name of output edge)
|
|
|