crawl4ai/docs/md_v2/basic/prefix-based-input.md

# Prefix-Based Input Handling in Crawl4AI

This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.

## Table of Contents
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
  - [Table of Contents](#table-of-contents)
    - [Crawling a Web URL](#crawling-a-web-url)
    - [Crawling a Local HTML File](#crawling-a-local-html-file)
    - [Crawling Raw HTML Content](#crawling-raw-html-content)
  - [Complete Example](#complete-example)
    - [**How It Works**](#how-it-works)
    - [**Running the Example**](#running-the-example)
  - [Conclusion](#conclusion)

---


### Crawling a Web URL

To crawl a live web page, provide the URL starting with `http://` or `https://`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_web():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
        if result.success:
            print("Markdown Content:")
            print(result.markdown)
        else:
            print(f"Failed to crawl: {result.error_message}")

asyncio.run(crawl_web())
```

### Crawling a Local HTML File

To crawl a local HTML file, prefix the file path with `file://`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=file_url, bypass_cache=True)
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")

asyncio.run(crawl_local_file())
```

### Crawling Raw HTML Content

To crawl raw HTML content, prefix the HTML string with `raw:`.

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_raw_html():
    raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
    raw_html_url = f"raw:{raw_html}"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=raw_html_url, bypass_cache=True)
        if result.success:
            print("Markdown Content from Raw HTML:")
            print(result.markdown)
        else:
            print(f"Failed to crawl raw HTML: {result.error_message}")

asyncio.run(crawl_raw_html())
```

---

## Complete Example

Below is a comprehensive script that:
1. **Crawls the Wikipedia page for "Apple".**
2. **Saves the HTML content to a local file (`apple.html`).**
3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
4. **Crawls the raw HTML content from the saved file and verifies consistency.**

```python
import os
import sys
import asyncio
from pathlib import Path

# Adjust the parent directory to include the crawl4ai module
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)

from crawl4ai import AsyncWebCrawler

async def main():
    # Define the URL to crawl
    wikipedia_url = "https://en.wikipedia.org/wiki/apple"
    
    # Define the path to save the HTML file
    # Save the file in the same directory as the script
    script_dir = Path(__file__).parent
    html_file_path = script_dir / "apple.html"
    
    async with AsyncWebCrawler(verbose=True) as crawler:
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
        # Crawl the Wikipedia URL
        result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not result.success:
            print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
            return
        
        # Save the HTML content to a local file
        with open(html_file_path, 'w', encoding='utf-8') as f:
            f.write(result.html)
        print(f"Saved HTML content to {html_file_path}")
        
        # Store the length of the generated markdown
        web_crawl_length = len(result.markdown)
        print(f"Length of markdown from web crawl: {web_crawl_length}\n")
        
        print("=== Step 2: Crawling from the Local HTML File ===")
        # Construct the file URL with 'file://' prefix
        file_url = f"file://{html_file_path.resolve()}"
        
        # Crawl the local HTML file
        local_result = await crawler.arun(url=file_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not local_result.success:
            print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
            return
        
        # Store the length of the generated markdown from local file
        local_crawl_length = len(local_result.markdown)
        print(f"Length of markdown from local file crawl: {local_crawl_length}")
        
        # Compare the lengths
        assert web_crawl_length == local_crawl_length, (
            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
        )
        print("✅ Markdown length matches between web crawl and local file crawl.\n")
        
        print("=== Step 3: Crawling Using Raw HTML Content ===")
        # Read the HTML content from the saved file
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
        
        # Prefix the raw HTML content with 'raw:'
        raw_html_url = f"raw:{raw_html_content}"
        
        # Crawl using the raw HTML content
        raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
        
        # Check if crawling was successful
        if not raw_result.success:
            print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
            return
        
        # Store the length of the generated markdown from raw HTML
        raw_crawl_length = len(raw_result.markdown)
        print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
        
        # Compare the lengths
        assert web_crawl_length == raw_crawl_length, (
            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
        )
        print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
        
        print("All tests passed successfully!")
        
    # Clean up by removing the saved HTML file
    if html_file_path.exists():
        os.remove(html_file_path)
        print(f"Removed the saved HTML file: {html_file_path}")

# Run the main function
if __name__ == "__main__":
    asyncio.run(main())
```

### **How It Works**

1. **Step 1: Crawl the Web URL**
   - Crawls `https://en.wikipedia.org/wiki/apple`.
   - Saves the HTML content to `apple.html`.
   - Records the length of the generated markdown.

2. **Step 2: Crawl from the Local HTML File**
   - Uses the `file://` prefix to crawl `apple.html`.
   - Ensures the markdown length matches the original web crawl.

3. **Step 3: Crawl Using Raw HTML Content**
   - Reads the HTML from `apple.html`.
   - Prefixes it with `raw:` and crawls.
   - Verifies the markdown length matches the previous results.

4. **Cleanup**
   - Deletes the `apple.html` file after testing.

### **Running the Example**

1. **Save the Script:**
   - Save the above code as `test_crawl4ai.py` in your project directory.

2. **Execute the Script:**
   - Run the script using:
     ```bash
     python test_crawl4ai.py
     ```

3. **Observe the Output:**
   - The script will print logs detailing each step.
   - Assertions ensure consistency across different crawling methods.
   - Upon success, it confirms that all markdown lengths match.

---

## Conclusion

With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.
perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253 2024-11-13 19:40:40 +08:00			`# Prefix-Based Input Handling in Crawl4AI`

			`This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.`

			`## Table of Contents`
			`- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)`
			`- [Table of Contents](#table-of-contents)`
			`- [Crawling a Web URL](#crawling-a-web-url)`
			`- [Crawling a Local HTML File](#crawling-a-local-html-file)`
			`- [Crawling Raw HTML Content](#crawling-raw-html-content)`
			`- [Complete Example](#complete-example)`
			`- [How It Works](#how-it-works)`
			`- [Running the Example](#running-the-example)`
			`- [Conclusion](#conclusion)`

			`---`


			`### Crawling a Web URL`

			To crawl a live web page, provide the URL starting with `http://` or `https://`.

			```python
			`import asyncio`
			`from crawl4ai import AsyncWebCrawler`

			`async def crawl_web():`
			`async with AsyncWebCrawler(verbose=True) as crawler:`
			`result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)`
			`if result.success:`
			`print("Markdown Content:")`
			`print(result.markdown)`
			`else:`
			`print(f"Failed to crawl: {result.error_message}")`

			`asyncio.run(crawl_web())`
			```

			`### Crawling a Local HTML File`

			To crawl a local HTML file, prefix the file path with `file://`.

			```python
			`import asyncio`
			`from crawl4ai import AsyncWebCrawler`

			`async def crawl_local_file():`
			`local_file_path = "/path/to/apple.html" # Replace with your file path`
			`file_url = f"file://{local_file_path}"`

			`async with AsyncWebCrawler(verbose=True) as crawler:`
			`result = await crawler.arun(url=file_url, bypass_cache=True)`
			`if result.success:`
			`print("Markdown Content from Local File:")`
			`print(result.markdown)`
			`else:`
			`print(f"Failed to crawl local file: {result.error_message}")`

			`asyncio.run(crawl_local_file())`
			```

			`### Crawling Raw HTML Content`

			To crawl raw HTML content, prefix the HTML string with `raw:`.

			```python
			`import asyncio`
			`from crawl4ai import AsyncWebCrawler`

			`async def crawl_raw_html():`
			`raw_html = "<html><body><h1>Hello, World!</h1></body></html>"`
			`raw_html_url = f"raw:{raw_html}"`

			`async with AsyncWebCrawler(verbose=True) as crawler:`
			`result = await crawler.arun(url=raw_html_url, bypass_cache=True)`
			`if result.success:`
			`print("Markdown Content from Raw HTML:")`
			`print(result.markdown)`
			`else:`
			`print(f"Failed to crawl raw HTML: {result.error_message}")`

			`asyncio.run(crawl_raw_html())`
			```

			`---`

			`## Complete Example`

			`Below is a comprehensive script that:`
			`1. Crawls the Wikipedia page for "Apple".`
			2. Saves the HTML content to a local file (`apple.html`).
			`3. Crawls the local HTML file and verifies the markdown length matches the original crawl.`
			`4. Crawls the raw HTML content from the saved file and verifies consistency.`

			```python
			`import os`
			`import sys`
			`import asyncio`
			`from pathlib import Path`

			`# Adjust the parent directory to include the crawl4ai module`
			`parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))`
			`sys.path.append(parent_dir)`

			`from crawl4ai import AsyncWebCrawler`

			`async def main():`
			`# Define the URL to crawl`
			`wikipedia_url = "https://en.wikipedia.org/wiki/apple"`

			`# Define the path to save the HTML file`
			`# Save the file in the same directory as the script`
			`script_dir = Path(__file__).parent`
			`html_file_path = script_dir / "apple.html"`

			`async with AsyncWebCrawler(verbose=True) as crawler:`
			`print("\n=== Step 1: Crawling the Wikipedia URL ===")`
			`# Crawl the Wikipedia URL`
			`result = await crawler.arun(url=wikipedia_url, bypass_cache=True)`

			`# Check if crawling was successful`
			`if not result.success:`
			`print(f"Failed to crawl {wikipedia_url}: {result.error_message}")`
			`return`

			`# Save the HTML content to a local file`
			`with open(html_file_path, 'w', encoding='utf-8') as f:`
			`f.write(result.html)`
			`print(f"Saved HTML content to {html_file_path}")`

			`# Store the length of the generated markdown`
			`web_crawl_length = len(result.markdown)`
			`print(f"Length of markdown from web crawl: {web_crawl_length}\n")`

			`print("=== Step 2: Crawling from the Local HTML File ===")`
			`# Construct the file URL with 'file://' prefix`
			`file_url = f"file://{html_file_path.resolve()}"`

			`# Crawl the local HTML file`
			`local_result = await crawler.arun(url=file_url, bypass_cache=True)`

			`# Check if crawling was successful`
			`if not local_result.success:`
			`print(f"Failed to crawl local file {file_url}: {local_result.error_message}")`
			`return`

			`# Store the length of the generated markdown from local file`
			`local_crawl_length = len(local_result.markdown)`
			`print(f"Length of markdown from local file crawl: {local_crawl_length}")`

			`# Compare the lengths`
			`assert web_crawl_length == local_crawl_length, (`
			`f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"`
			`)`
			`print("✅ Markdown length matches between web crawl and local file crawl.\n")`

			`print("=== Step 3: Crawling Using Raw HTML Content ===")`
			`# Read the HTML content from the saved file`
			`with open(html_file_path, 'r', encoding='utf-8') as f:`
			`raw_html_content = f.read()`

			`# Prefix the raw HTML content with 'raw:'`
			`raw_html_url = f"raw:{raw_html_content}"`

			`# Crawl using the raw HTML content`
			`raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)`

			`# Check if crawling was successful`
			`if not raw_result.success:`
			`print(f"Failed to crawl raw HTML content: {raw_result.error_message}")`
			`return`

			`# Store the length of the generated markdown from raw HTML`
			`raw_crawl_length = len(raw_result.markdown)`
			`print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")`

			`# Compare the lengths`
			`assert web_crawl_length == raw_crawl_length, (`
			`f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"`
			`)`
			`print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")`

			`print("All tests passed successfully!")`

			`# Clean up by removing the saved HTML file`
			`if html_file_path.exists():`
			`os.remove(html_file_path)`
			`print(f"Removed the saved HTML file: {html_file_path}")`

			`# Run the main function`
			`if __name__ == "__main__":`
			`asyncio.run(main())`
			```

			`### How It Works`

			`1. Step 1: Crawl the Web URL`
			- Crawls `https://en.wikipedia.org/wiki/apple`.
			- Saves the HTML content to `apple.html`.
			`- Records the length of the generated markdown.`

			`2. Step 2: Crawl from the Local HTML File`
			- Uses the `file://` prefix to crawl `apple.html`.
			`- Ensures the markdown length matches the original web crawl.`

			`3. Step 3: Crawl Using Raw HTML Content`
			- Reads the HTML from `apple.html`.
			- Prefixes it with `raw:` and crawls.
			`- Verifies the markdown length matches the previous results.`

			`4. Cleanup`
			- Deletes the `apple.html` file after testing.

			`### Running the Example`

			`1. Save the Script:`
			- Save the above code as `test_crawl4ai.py` in your project directory.

			`2. Execute the Script:`
			`- Run the script using:`
			```bash
			`python test_crawl4ai.py`
			```

			`3. Observe the Output:`
			`- The script will print logs detailing each step.`
			`- Assertions ensure consistency across different crawling methods.`
			`- Upon success, it confirms that all markdown lengths match.`

			`---`

			`## Conclusion`

			With the new prefix-based input handling in Crawl4AI, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.