docs(api): improve formatting and readability of API documentation

Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting
2025-12-28 02:48:36 +00:00 · 2025-01-25 22:06:11 +08:00 · 2025-01-25 22:06:11 +08:00 · 54c84079c4
commit 54c84079c4
parent 09ac7ed008
4 changed files with 103 additions and 80 deletions
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@ -1,6 +1,6 @@
 # `arun()` Parameter Guide (New Approach)

-In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
+In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:

 ```python
 await crawler.arun(
@ -9,11 +9,11 @@ await crawler.arun(
 )
 ```

-Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
+Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).

 ---

-## 1. Core Usage
+## 1. Core Usage

 ```python
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
@ -23,7 +23,7 @@ async def main():
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        check_robots_txt=True,   # Respect robots.txt rules
-        # ... other parameters
+        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
@ -38,15 +38,16 @@ async def main():
 ```

 **Key Fields**:
- `verbose=True` logs each crawl step.  
+- `verbose=True` logs each crawl step.  
 - `cache_mode` decides how to read/write the local crawl cache.

 ---

-## 2. Cache Control
+## 2. Cache Control

 **`cache_mode`** (default: `CacheMode.ENABLED`)  
 Use a built-in enum from `CacheMode`:
+
 - `ENABLED`: Normal caching—reads if available, writes if missing.
 - `DISABLED`: No caching—always refetch pages.
 - `READ_ONLY`: Reads from cache only; no new writes.
@ -60,6 +61,7 @@ run_config = CrawlerRunConfig(
 ```

 **Additional flags**:
+
 - `bypass_cache=True` acts like `CacheMode.BYPASS`.
 - `disable_cache=True` acts like `CacheMode.DISABLED`.
 - `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
@ -67,7 +69,7 @@ run_config = CrawlerRunConfig(

 ---

-## 3. Content Processing & Selection
+## 3. Content Processing & Selection

 ### 3.1 Text Processing

@ -111,7 +113,7 @@ run_config = CrawlerRunConfig(

 ---

-## 4. Page Navigation & Timing
+## 4. Page Navigation & Timing

 ### 4.1 Basic Browser Flow

@ -124,12 +126,13 @@ run_config = CrawlerRunConfig(
 ```

 **Key Fields**:
+
 - `wait_for`:  
  - `"css:selector"` or  
  - `"js:() => boolean"`  
-  e.g. `js:() => document.querySelectorAll('.item').length > 10`.
+  e.g. `js:() => document.querySelectorAll('.item').length > 10`.

- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  
+- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  
 - `semaphore_count`: concurrency limit when crawling multiple URLs.

 ### 4.2 JavaScript Execution
@ -144,7 +147,7 @@ run_config = CrawlerRunConfig(
 )
 ```

- `js_code` can be a single string or a list of strings.  
+- `js_code` can be a single string or a list of strings.  
 - `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”

 ### 4.3 Anti-Bot
@ -156,13 +159,13 @@ run_config = CrawlerRunConfig(
    override_navigator=True
 )
 ```
- `magic=True` tries multiple stealth features.  
- `simulate_user=True` mimics mouse movements or random delays.  
+- `magic=True` tries multiple stealth features.  
+- `simulate_user=True` mimics mouse movements or random delays.  
 - `override_navigator=True` fakes some navigator properties (like user agent checks).

 ---

-## 5. Session Management
+## 5. Session Management

 **`session_id`**: 
 ```python
@ -174,7 +177,7 @@ If re-used in subsequent `arun()` calls, the same tab/page context is continued

 ---

-## 6. Screenshot, PDF & Media Options
+## 6. Screenshot, PDF & Media Options

 ```python
 run_config = CrawlerRunConfig(
@ -191,7 +194,7 @@ run_config = CrawlerRunConfig(

 ---

-## 7. Extraction Strategy
+## 7. Extraction Strategy

 **For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:

@ -205,7 +208,7 @@ The extracted data will appear in `result.extracted_content`.

 ---

-## 8. Comprehensive Example
+## 8. Comprehensive Example

 Below is a snippet combining many parameters:

@ -274,32 +277,33 @@ if __name__ == "__main__":
 ```

 **What we covered**:
-1. **Crawling** the main content region, ignoring external links.  
-2. Running **JavaScript** to click “.show-more”.  
-3. **Waiting** for “.loaded-block” to appear.  
-4. Generating a **screenshot** & **PDF** of the final page.  
-5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
+
+1. **Crawling** the main content region, ignoring external links.  
+2. Running **JavaScript** to click “.show-more”.  
+3. **Waiting** for “.loaded-block” to appear.  
+4. Generating a **screenshot** & **PDF** of the final page.  
+5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.

 ---

-## 9. Best Practices
+## 9. Best Practices

-1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  
-2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  
-3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.  
-4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  
-5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
+1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  
+2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  
+3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.  
+4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  
+5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.

 ---

-## 10. Conclusion
+## 10. Conclusion

-All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
+All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:

- Makes code **clearer** and **more maintainable**.  
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.  
+- Makes code **clearer** and **more maintainable**.  
+- Minimizes confusion about which arguments affect global vs. per-crawl behavior.  
 - Allows you to create **reusable** config objects for different pages or tasks.

-For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). 
+For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). 

 Happy crawling with your **structured, flexible** config approach!
--- a/docs/md_v2/api/arun_many.md
+++ b/docs/md_v2/api/arun_many.md
@ -1,6 +1,6 @@
 # `arun_many(...)` Reference

-> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
+> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.

 ## Function Signature

@ -16,7 +16,7 @@ async def arun_many(

    :param urls: A list of URLs (or tasks) to crawl.
    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
-    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
+    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
    """
@ -24,22 +24,26 @@ async def arun_many(

 ## Differences from `arun()`

-1. **Multiple URLs**:  
-   - Instead of crawling a single URL, you pass a list of them (strings or tasks).  
+1. **Multiple URLs**:  
+   
+   - Instead of crawling a single URL, you pass a list of them (strings or tasks).  
   - The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.

-2. **Concurrency & Dispatchers**:  
-   - **`dispatcher`** param allows advanced concurrency control.  
-   - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  
+2. **Concurrency & Dispatchers**:  
+
+   - **`dispatcher`** param allows advanced concurrency control.  
+   - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  
   - Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).

-3. **Streaming Support**:  
+3. **Streaming Support**:  
+
   - Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
   - When streaming, use `async for` to process results as they become available.
   - Ideal for processing large numbers of URLs without waiting for all to complete.

-4. **Parallel** Execution**:  
-   - `arun_many()` can run multiple requests concurrently under the hood.  
+4. **Parallel** Execution**:  
+
+   - `arun_many()` can run multiple requests concurrently under the hood.  
   - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).

 ### Basic Example (Batch Mode)
@ -93,19 +97,19 @@ results = await crawler.arun_many(

 **Key Points**:
 - Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  
+- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  
 - If you need to handle authentication or session IDs, pass them in each individual task or within your run config.

 ### Return Value

-Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
+Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.

 ---

 ## Dispatcher Reference

- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  
+- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  
+- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  

 For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).

@ -113,12 +117,14 @@ For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers]

 ## Common Pitfalls

-1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.  
-2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.  
-3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
+1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.  
+
+2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.  
+
+3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.

 ---

 ## Conclusion

-Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
+Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@ -1,16 +1,20 @@
 # AsyncWebCrawler

-The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
+The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.

 **Recommended usage**:
-1. **Create** a `BrowserConfig` for global browser settings.  
-2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  
-3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  
+
+1. **Create** a `BrowserConfig` for global browser settings.  
+
+2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  
+
+3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  
+
 4. **Call** `arun(url, config=crawler_run_config)` for each page you want.

 ---

-## 1. Constructor Overview
+## 1. Constructor Overview

 ```python
 class AsyncWebCrawler:
@ -37,7 +41,7 @@ class AsyncWebCrawler:
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
-                If True, attempts some concurrency safeguards. Usually False.
+                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
@ -58,11 +62,12 @@ crawler = AsyncWebCrawler(config=browser_cfg)
 ```

 **Notes**:
+
 - **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.

 ---

-## 2. Lifecycle: Start/Close or Context Manager
+## 2. Lifecycle: Start/Close or Context Manager

 ### 2.1 Context Manager (Recommended)

@ -90,7 +95,7 @@ Use this style if you have a **long-running** application or need full control o

 ---

-## 3. Primary Method: `arun()`
+## 3. Primary Method: `arun()`

 ```python
 async def arun(
@ -130,7 +135,7 @@ For **backward** compatibility, `arun()` can still accept direct arguments like

 ---

-## 4. Batch Processing: `arun_many()`
+## 4. Batch Processing: `arun_many()`

 ```python
 async def arun_many(
@ -147,6 +152,7 @@ async def arun_many(
 ### 4.1 Resource-Aware Crawling

 The `arun_many()` method now uses an intelligent dispatcher that:
+
 - Monitors system memory usage
 - Implements adaptive rate limiting
 - Provides detailed progress monitoring
@ -192,30 +198,34 @@ async with AsyncWebCrawler(config=browser_cfg) as crawler:

 ### 4.3 Key Features

-1. **Rate Limiting**
+1. **Rate Limiting**
+   
   - Automatic delay between requests
   - Exponential backoff on rate limit detection
   - Domain-specific rate limiting
   - Configurable retry strategy

-2. **Resource Monitoring**
+2. **Resource Monitoring**
+
   - Memory usage tracking
   - Adaptive concurrency based on system load
   - Automatic pausing when resources are constrained

-3. **Progress Monitoring**
+3. **Progress Monitoring**
+
   - Detailed or aggregated progress display
   - Real-time status updates
   - Memory usage statistics

-4. **Error Handling**
+4. **Error Handling**
+
   - Graceful handling of rate limits
   - Automatic retries with backoff
   - Detailed error reporting

 ---

-## 5. `CrawlResult` Output
+## 5. `CrawlResult` Output

 Each `arun()` returns a **`CrawlResult`** containing:

@ -232,7 +242,7 @@ For details, see [CrawlResult doc](./crawl-result.md).

 ---

-## 6. Quick Example
+## 6. Quick Example

 Below is an example hooking it all together:

@ -243,14 +253,14 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 import json

 async def main():
-    # 1. Browser config
+    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

-    # 2. Run config
+    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
@ -295,17 +305,18 @@ asyncio.run(main())
 ```

 **Explanation**:
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  
+
+- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  
+- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  
 - We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.

 ---

-## 7. Best Practices & Migration Notes
+## 7. Best Practices & Migration Notes

-1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  
-2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  
-3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
+1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  
+2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  
+3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:

   ```python
   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
@ -316,16 +327,17 @@ asyncio.run(main())

 ---

-## 8. Summary
+## 8. Summary

 **AsyncWebCrawler** is your entry point to asynchronous crawling:

- **Constructor** accepts **`BrowserConfig`** (or defaults).  
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  
- For advanced lifecycle control, use `start()` and `close()` explicitly.  
+- **Constructor** accepts **`BrowserConfig`** (or defaults).  
+- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  
+- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  
+- For advanced lifecycle control, use `start()` and `close()` explicitly.  

 **Migration**:  
+
 - If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.

-This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
+This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@ -294,3 +294,4 @@ stream_cfg = run_cfg.clone(
    stream=True,
    cache_mode=CacheMode.BYPASS
 )
+```