crawl4ai/docs/md_v2/advanced/file-downloading.md

# Download Handling in Crawl4AI

This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.

## Enabling Downloads

To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.

```python
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler

async def main():
    config = BrowserConfig(accept_downloads=True)  # Enable downloads globally
    async with AsyncWebCrawler(config=config) as crawler:
        # ... your crawling logic ...

asyncio.run(main())
```

## Specifying Download Location

Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.

```python
from crawl4ai.async_configs import BrowserConfig
import os

downloads_path = os.path.join(os.getcwd(), "my_downloads")  # Custom download path
os.makedirs(downloads_path, exist_ok=True)

config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)

async def main():
    async with AsyncWebCrawler(config=config) as crawler:
        result = await crawler.arun(url="https://example.com")
        # ...
```

## Triggering Downloads

Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.

```python
from crawl4ai.async_configs import CrawlerRunConfig

config = CrawlerRunConfig(
    js_code="""
        const downloadLink = document.querySelector('a[href$=".exe"]');
        if (downloadLink) {
            downloadLink.click();
        }
    """,
    wait_for=5  # Wait 5 seconds for the download to start
)

result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
```

## Accessing Downloaded Files

The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.

```python
if result.downloaded_files:
    print("Downloaded files:")
    for file_path in result.downloaded_files:
        print(f"- {file_path}")
        file_size = os.path.getsize(file_path)
        print(f"- File size: {file_size} bytes")
else:
    print("No files downloaded.")
```

## Example: Downloading Multiple Files

```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os
from pathlib import Path

async def download_multiple_files(url: str, download_path: str):
    config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
    async with AsyncWebCrawler(config=config) as crawler:
        run_config = CrawlerRunConfig(
            js_code="""
                const downloadLinks = document.querySelectorAll('a[download]');
                for (const link of downloadLinks) {
                    link.click();
                    // Delay between clicks
                    await new Promise(r => setTimeout(r, 2000));  
                }
            """,
            wait_for=10  # Wait for all downloads to start
        )
        result = await crawler.arun(url=url, config=run_config)

        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")

# Usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True)

asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```

## Important Considerations

- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
- **Security:** Scan downloaded files for potential security threats before use.

This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`# Download Handling in Crawl4AI`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			`## Enabling Downloads`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			`async def main():`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`config = BrowserConfig(accept_downloads=True) # Enable downloads globally`
			`async with AsyncWebCrawler(config=config) as crawler:`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`# ... your crawling logic ...`

			`asyncio.run(main())`
			```

			`## Specifying Download Location`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai.async_configs import BrowserConfig`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`import os`

			`downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path`
			`os.makedirs(downloads_path, exist_ok=True)`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`async def main():`
			`async with AsyncWebCrawler(config=config) as crawler:`
			`result = await crawler.arun(url="https://example.com")`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`# ...`
			```

			`## Triggering Downloads`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai.async_configs import CrawlerRunConfig`

			`config = CrawlerRunConfig(`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`js_code="""`
			`const downloadLink = document.querySelector('a[href$=".exe"]');`
			`if (downloadLink) {`
			`downloadLink.click();`
			`}`
			`""",`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`wait_for=5 # Wait 5 seconds for the download to start`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`)`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00
			`result = await crawler.arun(url="https://www.python.org/downloads/", config=config)`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			```

			`## Accessing Downloaded Files`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			```python
			`if result.downloaded_files:`
			`print("Downloaded files:")`
			`for file_path in result.downloaded_files:`
			`print(f"- {file_path}")`
			`file_size = os.path.getsize(file_path)`
			`print(f"- File size: {file_size} bytes")`
			`else:`
			`print("No files downloaded.")`
			```

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`## Example: Downloading Multiple Files`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			```python
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`import os`
			`from pathlib import Path`

			`async def download_multiple_files(url: str, download_path: str):`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`config = BrowserConfig(accept_downloads=True, downloads_path=download_path)`
			`async with AsyncWebCrawler(config=config) as crawler:`
			`run_config = CrawlerRunConfig(`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`js_code="""`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`const downloadLinks = document.querySelectorAll('a[download]');`
			`for (const link of downloadLinks) {`
			`link.click();`
refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized 2025-01-07 20:49:50 +08:00			`// Delay between clicks`
			`await new Promise(r => setTimeout(r, 2000));`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`}`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`""",`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`wait_for=10 # Wait for all downloads to start`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`)`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`result = await crawler.arun(url=url, config=run_config)`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			`if result.downloaded_files:`
			`print("Downloaded files:")`
			`for file in result.downloaded_files:`
			`print(f"- {file}")`
			`else:`
			`print("No files downloaded.")`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`# Usage`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00			`download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")`
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			`os.makedirs(download_path, exist_ok=True)`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
			`asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))`
			```

			`## Important Considerations`

Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			- Browser Context: Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
			- Timing: Use `wait_for` in `CrawlerRunConfig` to manage download timing.
			`- Error Handling: Handle errors to manage failed downloads or incorrect paths gracefully.`
			`- Security: Scan downloaded files for potential security threats before use.`
feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity 2024-11-17 19:44:45 +08:00
Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies. 2024-12-19 21:02:29 +08:00			This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!