crawl4ai/docs/md_v2/basic/file-download.md

5.2 KiB

Download Handling in Crawl4AI

This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.

Enabling Downloads

By default, Crawl4AI does not download files. To enable downloads, set the accept_downloads parameter to True in either the AsyncWebCrawler constructor or the arun method.

from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(accept_downloads=True) as crawler:  # Globally enable downloads
        # ... your crawling logic ...

asyncio.run(main())

Or, enable it for a specific crawl:

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="...", accept_downloads=True)
        # ...

Specifying Download Location

You can specify the download directory using the downloads_path parameter. If not provided, Crawl4AI creates a "downloads" directory inside the .crawl4ai folder in your home directory.

import os
from pathlib import Path

# ... inside your crawl function:

downloads_path = os.path.join(os.getcwd(), "my_downloads")  # Custom download path
os.makedirs(downloads_path, exist_ok=True)

result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)

# ...

If you are setting it globally, provide the path to the AsyncWebCrawler:

async def crawl_with_downloads(url: str, download_path: str):
    async with AsyncWebCrawler(
        accept_downloads=True,
        downloads_path=download_path, # or set it on arun
        verbose=True
    ) as crawler:
        result = await crawler.arun(url=url) # you still need to enable downloads per call.
        # ...

Triggering Downloads

Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the js_code parameter, injecting JavaScript code to be executed within the browser context. The wait_for parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.

result = await crawler.arun(
    url="https://www.python.org/downloads/",
    js_code="""
        // Find and click the first Windows installer link
        const downloadLink = document.querySelector('a[href$=".exe"]');
        if (downloadLink) {
            downloadLink.click();
        }
    """,
    wait_for=5  # Wait for 5 seconds for the download to start
)

Accessing Downloaded Files

Downloaded file paths are stored in the downloaded_files attribute of the returned CrawlResult object. This is a list of strings, with each string representing the absolute path to a downloaded file.

if result.downloaded_files:
    print("Downloaded files:")
    for file_path in result.downloaded_files:
        print(f"- {file_path}")
        # Perform operations with downloaded files, e.g., check file size
        file_size = os.path.getsize(file_path)
        print(f"- File size: {file_size} bytes")
else:
    print("No files downloaded.")

Example: Downloading Multiple Files

import asyncio
import os
from pathlib import Path
from crawl4ai import AsyncWebCrawler

async def download_multiple_files(url: str, download_path: str):

    async with AsyncWebCrawler(
        accept_downloads=True,
        downloads_path=download_path,
        verbose=True
    ) as crawler:
        result = await crawler.arun(
            url=url,
            js_code="""
            // Trigger multiple downloads (example)
            const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
            for (const link of downloadLinks) {
                link.click();
                await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
            }
            """,
            wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
        )

        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")
            

# Example usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist


asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))

Important Considerations

  • Browser Context: Downloads are managed within the browser context. Ensure your js_code correctly targets the download triggers on the specific web page.
  • Waiting: Use wait_for to manage the timing of the crawl process if immediate download might not occur.
  • Error Handling: Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
  • Security: Downloaded files should be scanned for potential security threats before use.

This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.