mirror of https://github.com/unclecode/crawl4ai.git synced 2025-12-02 22:09:14 +00:00

UncleCode ca3e33122e refactor(docs): reorganize documentation structure and update styles

Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized

2025-01-07 20:49:50 +08:00

9.1 KiB

Raw Permalink Blame History

Fit Markdown with Pruning & BM25

Fit Markdown is a specialized filtered version of your page’s markdown, focusing on the most relevant content. By default, Crawl4AI converts the entire HTML into a broad raw_markdown. With fit markdown, we apply a content filter algorithm (e.g., Pruning or BM25) to remove or rank low-value sections—such as repetitive sidebars, shallow text blocks, or irrelevancies—leaving a concise textual “core.”

1. How “Fit Markdown” Works

1.1 The `content_filter`

In CrawlerRunConfig, you can specify a content_filter to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied before or during the HTML→Markdown process, producing:

result.markdown_v2.raw_markdown (unfiltered)
result.markdown_v2.fit_markdown (filtered or “fit” version)
result.markdown_v2.fit_html (the corresponding HTML snippet that produced fit_markdown)

Note

: We’re currently storing the result in markdown_v2, but eventually we’ll unify it as result.markdown.

1.2 Common Filters

1. PruningContentFilter – Scores each node by text density, link density, and tag importance, discarding those below a threshold.
2. BM25ContentFilter – Focuses on textual relevance using BM25 ranking, especially useful if you have a specific user query (e.g., “machine learning” or “food nutrition”).

2. PruningContentFilter

Pruning discards less relevant nodes based on text density, link density, and tag importance. It’s a heuristic-based approach—if certain sections appear too “thin” or too “spammy,” they’re pruned.

2.1 Usage Example

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Step 1: Create a pruning filter
    prune_filter = PruningContentFilter(
        # Lower → more content retained, higher → more content pruned
        threshold=0.45,           
        # "fixed" or "dynamic"
        threshold_type="dynamic",  
        # Ignore nodes with <5 words
        min_word_threshold=5      
    )

    # Step 2: Insert it into a Markdown Generator
    md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
    
    # Step 3: Pass it to CrawlerRunConfig
    config = CrawlerRunConfig(
        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com", 
            config=config
        )
        
        if result.success:
            # 'fit_markdown' is your pruned content, focusing on "denser" text
            print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
            print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

2.2 Key Parameters

min_word_threshold (int): If a block has fewer words than this, it’s pruned.
threshold_type (str):
- "fixed" → each node must exceed threshold (0–1).
- "dynamic" → node scoring adjusts according to tag type, text/link density, etc.
threshold (float, default ~0.48): The base or “anchor” cutoff.

Algorithmic Factors:

Text density – Encourages blocks that have a higher ratio of text to overall content.
Link density – Penalizes sections that are mostly links.
Tag importance – e.g., an <article> or <p> might be more important than a <div>.
Structural context – If a node is deeply nested or in a suspected sidebar, it might be deprioritized.

3. BM25ContentFilter

BM25 is a classical text ranking algorithm often used in search engines. If you have a user query or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.

3.1 Usage Example

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # 1) A BM25 filter with a user query
    bm25_filter = BM25ContentFilter(
        user_query="startup fundraising tips",
        # Adjust for stricter or looser results
        bm25_threshold=1.2  
    )

    # 2) Insert into a Markdown Generator
    md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
    
    # 3) Pass to crawler config
    config = CrawlerRunConfig(
        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com", 
            config=config
        )
        if result.success:
            print("Fit Markdown (BM25 query-based):")
            print(result.markdown_v2.fit_markdown)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

3.2 Parameters

user_query (str, optional): E.g. "machine learning". If blank, the filter tries to glean a query from page metadata.
bm25_threshold (float, default 1.0):
- Higher → fewer chunks but more relevant.
- Lower → more inclusive.

In more advanced scenarios, you might see parameters like use_stemming, case_sensitive, or priority_tags to refine how text is tokenized or weighted.

4. Accessing the “Fit” Output

After the crawl, your “fit” content is found in result.markdown_v2.fit_markdown. In future versions, it will be result.markdown.fit_markdown. Meanwhile:

fit_md = result.markdown_v2.fit_markdown
fit_html = result.markdown_v2.fit_html

If the content filter is BM25, you might see additional logic or references in fit_markdown that highlight relevant segments. If it’s Pruning, the text is typically well-cleaned but not necessarily matched to a query.

5. Code Patterns Recap

5.1 Pruning

prune_filter = PruningContentFilter(
    threshold=0.5,
    threshold_type="fixed",
    min_word_threshold=10
)
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown

5.2 BM25

bm25_filter = BM25ContentFilter(
    user_query="health benefits fruit",
    bm25_threshold=1.2
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown

6. Combining with “word_count_threshold” & Exclusions

Remember you can also specify:

config = CrawlerRunConfig(
    word_count_threshold=10,
    excluded_tags=["nav", "footer", "header"],
    exclude_external_links=True,
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.5)
    )
)

Thus, multi-level filtering occurs:

The crawler’s excluded_tags are removed from the HTML first.
The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
The final “fit” content is generated in result.markdown_v2.fit_markdown.

7. Custom Filters

If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from RelevantContentFilter and implement filter_content(html). Then inject it into your markdown generator:

from crawl4ai.content_filter_strategy import RelevantContentFilter

class MyCustomFilter(RelevantContentFilter):
    def filter_content(self, html, min_word_threshold=None):
        # parse HTML, implement custom logic
        return [block for block in ... if ... some condition...]

Steps:

Subclass RelevantContentFilter.
Implement filter_content(...).
Use it in your DefaultMarkdownGenerator(content_filter=MyCustomFilter(...)).

8. Final Thoughts

Fit Markdown is a crucial feature for:

Summaries: Quickly get the important text from a cluttered page.
Search: Combine with BM25 to produce content relevant to a query.
AI Pipelines: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.

Key Points:

PruningContentFilter: Great if you just want the “meatiest” text without a user query.
BM25ContentFilter: Perfect for query-based extraction or searching.
Combine with excluded_tags, exclude_external_links, word_count_threshold to refine your final “fit” text.
Fit markdown ends up in result.markdown_v2.fit_markdown; eventually result.markdown.fit_markdown in future versions.

With these tools, you can zero in on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!

Last Updated: 2025-01-01

9.1 KiB Raw Permalink Blame History Unescape Escape