# Download Handling in Crawl4AI This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files. ## Enabling Downloads By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method. ```python from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads # ... your crawling logic ... asyncio.run(main()) ``` Or, enable it for a specific crawl: ```python async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun(url="...", accept_downloads=True) # ... ``` ## Specifying Download Location You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory. ```python import os from pathlib import Path # ... inside your crawl function: downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path os.makedirs(downloads_path, exist_ok=True) result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True) # ... ``` If you are setting it globally, provide the path to the AsyncWebCrawler: ```python async def crawl_with_downloads(url: str, download_path: str): async with AsyncWebCrawler( accept_downloads=True, downloads_path=download_path, # or set it on arun verbose=True ) as crawler: result = await crawler.arun(url=url) # you still need to enable downloads per call. # ... ``` ## Triggering Downloads Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds. ```python result = await crawler.arun( url="https://www.python.org/downloads/", js_code=""" // Find and click the first Windows installer link const downloadLink = document.querySelector('a[href$=".exe"]'); if (downloadLink) { downloadLink.click(); } """, wait_for=5 # Wait for 5 seconds for the download to start ) ``` ## Accessing Downloaded Files Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file. ```python if result.downloaded_files: print("Downloaded files:") for file_path in result.downloaded_files: print(f"- {file_path}") # Perform operations with downloaded files, e.g., check file size file_size = os.path.getsize(file_path) print(f"- File size: {file_size} bytes") else: print("No files downloaded.") ``` ## Example: Downloading Multiple Files ```python import asyncio import os from pathlib import Path from crawl4ai import AsyncWebCrawler async def download_multiple_files(url: str, download_path: str): async with AsyncWebCrawler( accept_downloads=True, downloads_path=download_path, verbose=True ) as crawler: result = await crawler.arun( url=url, js_code=""" // Trigger multiple downloads (example) const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector for (const link of downloadLinks) { link.click(); await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed } """, wait_for=10 # Adjust the timeout to match the expected time for all downloads to start ) if result.downloaded_files: print("Downloaded files:") for file in result.downloaded_files: print(f"- {file}") else: print("No files downloaded.") # Example usage download_path = os.path.join(Path.home(), ".crawl4ai", "downloads") os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path)) ``` ## Important Considerations - **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page. - **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur. - **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths. - **Security:** Downloaded files should be scanned for potential security threats before use. This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.