2024-11-17 19:44:45 +08:00
# Download Handling in Crawl4AI
2024-12-19 21:02:29 +08:00
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
2024-11-17 19:44:45 +08:00
## Enabling Downloads
2024-12-19 21:02:29 +08:00
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
2024-11-17 19:44:45 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
2024-11-17 19:44:45 +08:00
async def main():
2024-12-19 21:02:29 +08:00
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
async with AsyncWebCrawler(config=config) as crawler:
2024-11-17 19:44:45 +08:00
# ... your crawling logic ...
asyncio.run(main())
```
## Specifying Download Location
2024-12-19 21:02:29 +08:00
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
2024-11-17 19:44:45 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig
2024-11-17 19:44:45 +08:00
import os
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
2024-12-19 21:02:29 +08:00
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
2024-11-17 19:44:45 +08:00
2024-12-19 21:02:29 +08:00
async def main():
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(url="https://example.com")
2024-11-17 19:44:45 +08:00
# ...
```
## Triggering Downloads
2024-12-19 21:02:29 +08:00
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
2024-11-17 19:44:45 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(
2024-11-17 19:44:45 +08:00
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
2024-12-19 21:02:29 +08:00
wait_for=5 # Wait 5 seconds for the download to start
2024-11-17 19:44:45 +08:00
)
2024-12-19 21:02:29 +08:00
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
2024-11-17 19:44:45 +08:00
```
## Accessing Downloaded Files
2024-12-19 21:02:29 +08:00
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
2024-11-17 19:44:45 +08:00
```python
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
```
2024-12-19 21:02:29 +08:00
## Example: Downloading Multiple Files
2024-11-17 19:44:45 +08:00
```python
2024-12-19 21:02:29 +08:00
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
2024-11-17 19:44:45 +08:00
import os
from pathlib import Path
async def download_multiple_files(url: str, download_path: str):
2024-12-19 21:02:29 +08:00
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
2024-11-17 19:44:45 +08:00
js_code="""
2024-12-19 21:02:29 +08:00
const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) {
link.click();
2025-01-07 20:49:50 +08:00
// Delay between clicks
await new Promise(r => setTimeout(r, 2000));
2024-12-19 21:02:29 +08:00
}
2024-11-17 19:44:45 +08:00
""",
2024-12-19 21:02:29 +08:00
wait_for=10 # Wait for all downloads to start
2024-11-17 19:44:45 +08:00
)
2024-12-19 21:02:29 +08:00
result = await crawler.arun(url=url, config=run_config)
2024-11-17 19:44:45 +08:00
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
2024-12-19 21:02:29 +08:00
# Usage
2024-11-17 19:44:45 +08:00
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
2024-12-19 21:02:29 +08:00
os.makedirs(download_path, exist_ok=True)
2024-11-17 19:44:45 +08:00
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```
## Important Considerations
2024-12-19 21:02:29 +08:00
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
- **Security:** Scan downloaded files for potential security threats before use.
2024-11-17 19:44:45 +08:00
2024-12-19 21:02:29 +08:00
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!