149 lines
5.2 KiB
Markdown
149 lines
5.2 KiB
Markdown
![]() |
# Download Handling in Crawl4AI
|
||
|
|
||
|
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||
|
|
||
|
## Enabling Downloads
|
||
|
|
||
|
By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
|
||
|
|
||
|
```python
|
||
|
from crawl4ai import AsyncWebCrawler
|
||
|
|
||
|
async def main():
|
||
|
async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads
|
||
|
# ... your crawling logic ...
|
||
|
|
||
|
asyncio.run(main())
|
||
|
```
|
||
|
|
||
|
Or, enable it for a specific crawl:
|
||
|
|
||
|
```python
|
||
|
async def main():
|
||
|
async with AsyncWebCrawler() as crawler:
|
||
|
result = await crawler.arun(url="...", accept_downloads=True)
|
||
|
# ...
|
||
|
```
|
||
|
|
||
|
## Specifying Download Location
|
||
|
|
||
|
You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||
|
|
||
|
```python
|
||
|
import os
|
||
|
from pathlib import Path
|
||
|
|
||
|
# ... inside your crawl function:
|
||
|
|
||
|
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
|
||
|
os.makedirs(downloads_path, exist_ok=True)
|
||
|
|
||
|
result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
|
||
|
|
||
|
# ...
|
||
|
```
|
||
|
|
||
|
If you are setting it globally, provide the path to the AsyncWebCrawler:
|
||
|
```python
|
||
|
async def crawl_with_downloads(url: str, download_path: str):
|
||
|
async with AsyncWebCrawler(
|
||
|
accept_downloads=True,
|
||
|
downloads_path=download_path, # or set it on arun
|
||
|
verbose=True
|
||
|
) as crawler:
|
||
|
result = await crawler.arun(url=url) # you still need to enable downloads per call.
|
||
|
# ...
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
## Triggering Downloads
|
||
|
|
||
|
Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
|
||
|
|
||
|
```python
|
||
|
result = await crawler.arun(
|
||
|
url="https://www.python.org/downloads/",
|
||
|
js_code="""
|
||
|
// Find and click the first Windows installer link
|
||
|
const downloadLink = document.querySelector('a[href$=".exe"]');
|
||
|
if (downloadLink) {
|
||
|
downloadLink.click();
|
||
|
}
|
||
|
""",
|
||
|
wait_for=5 # Wait for 5 seconds for the download to start
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## Accessing Downloaded Files
|
||
|
|
||
|
Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file.
|
||
|
|
||
|
```python
|
||
|
if result.downloaded_files:
|
||
|
print("Downloaded files:")
|
||
|
for file_path in result.downloaded_files:
|
||
|
print(f"- {file_path}")
|
||
|
# Perform operations with downloaded files, e.g., check file size
|
||
|
file_size = os.path.getsize(file_path)
|
||
|
print(f"- File size: {file_size} bytes")
|
||
|
else:
|
||
|
print("No files downloaded.")
|
||
|
```
|
||
|
|
||
|
|
||
|
## Example: Downloading Multiple Files
|
||
|
|
||
|
```python
|
||
|
import asyncio
|
||
|
import os
|
||
|
from pathlib import Path
|
||
|
from crawl4ai import AsyncWebCrawler
|
||
|
|
||
|
async def download_multiple_files(url: str, download_path: str):
|
||
|
|
||
|
async with AsyncWebCrawler(
|
||
|
accept_downloads=True,
|
||
|
downloads_path=download_path,
|
||
|
verbose=True
|
||
|
) as crawler:
|
||
|
result = await crawler.arun(
|
||
|
url=url,
|
||
|
js_code="""
|
||
|
// Trigger multiple downloads (example)
|
||
|
const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
|
||
|
for (const link of downloadLinks) {
|
||
|
link.click();
|
||
|
await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
|
||
|
}
|
||
|
""",
|
||
|
wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
|
||
|
)
|
||
|
|
||
|
if result.downloaded_files:
|
||
|
print("Downloaded files:")
|
||
|
for file in result.downloaded_files:
|
||
|
print(f"- {file}")
|
||
|
else:
|
||
|
print("No files downloaded.")
|
||
|
|
||
|
|
||
|
# Example usage
|
||
|
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
||
|
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
|
||
|
|
||
|
|
||
|
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
|
||
|
```
|
||
|
|
||
|
## Important Considerations
|
||
|
|
||
|
- **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page.
|
||
|
- **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
|
||
|
- **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
|
||
|
- **Security:** Downloaded files should be scanned for potential security threats before use.
|
||
|
|
||
|
|
||
|
|
||
|
This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.
|