Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
9.1 KiB
Fit Markdown with Pruning & BM25
Fit Markdown is a specialized filtered version of your page’s markdown, focusing on the most relevant content. By default, Crawl4AI converts the entire HTML into a broad raw_markdown. With fit markdown, we apply a content filter algorithm (e.g., Pruning or BM25) to remove or rank low-value sections—such as repetitive sidebars, shallow text blocks, or irrelevancies—leaving a concise textual “core.”
1. How “Fit Markdown” Works
1.1 The content_filter
In CrawlerRunConfig, you can specify a content_filter to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied before or during the HTML→Markdown process, producing:
result.markdown_v2.raw_markdown(unfiltered)result.markdown_v2.fit_markdown(filtered or “fit” version)result.markdown_v2.fit_html(the corresponding HTML snippet that producedfit_markdown)
Note
: We’re currently storing the result in
markdown_v2, but eventually we’ll unify it asresult.markdown.
1.2 Common Filters
1. PruningContentFilter – Scores each node by text density, link density, and tag importance, discarding those below a threshold.
2. BM25ContentFilter – Focuses on textual relevance using BM25 ranking, especially useful if you have a specific user query (e.g., “machine learning” or “food nutrition”).
2. PruningContentFilter
Pruning discards less relevant nodes based on text density, link density, and tag importance. It’s a heuristic-based approach—if certain sections appear too “thin” or too “spammy,” they’re pruned.
2.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Step 1: Create a pruning filter
prune_filter = PruningContentFilter(
# Lower → more content retained, higher → more content pruned
threshold=0.45,
# "fixed" or "dynamic"
threshold_type="dynamic",
# Ignore nodes with <5 words
min_word_threshold=5
)
# Step 2: Insert it into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
# Step 3: Pass it to CrawlerRunConfig
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
# 'fit_markdown' is your pruned content, focusing on "denser" text
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
2.2 Key Parameters
min_word_threshold(int): If a block has fewer words than this, it’s pruned.threshold_type(str):"fixed"→ each node must exceedthreshold(0–1)."dynamic"→ node scoring adjusts according to tag type, text/link density, etc.
threshold(float, default ~0.48): The base or “anchor” cutoff.
Algorithmic Factors:
- Text density – Encourages blocks that have a higher ratio of text to overall content.
- Link density – Penalizes sections that are mostly links.
- Tag importance – e.g., an
<article>or<p>might be more important than a<div>. - Structural context – If a node is deeply nested or in a suspected sidebar, it might be deprioritized.
3. BM25ContentFilter
BM25 is a classical text ranking algorithm often used in search engines. If you have a user query or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.
3.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# 1) A BM25 filter with a user query
bm25_filter = BM25ContentFilter(
user_query="startup fundraising tips",
# Adjust for stricter or looser results
bm25_threshold=1.2
)
# 2) Insert into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
# 3) Pass to crawler config
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
print("Fit Markdown (BM25 query-based):")
print(result.markdown_v2.fit_markdown)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
3.2 Parameters
user_query(str, optional): E.g."machine learning". If blank, the filter tries to glean a query from page metadata.bm25_threshold(float, default 1.0):- Higher → fewer chunks but more relevant.
- Lower → more inclusive.
In more advanced scenarios, you might see parameters like
use_stemming,case_sensitive, orpriority_tagsto refine how text is tokenized or weighted.
4. Accessing the “Fit” Output
After the crawl, your “fit” content is found in result.markdown_v2.fit_markdown. In future versions, it will be result.markdown.fit_markdown. Meanwhile:
fit_md = result.markdown_v2.fit_markdown
fit_html = result.markdown_v2.fit_html
If the content filter is BM25, you might see additional logic or references in fit_markdown that highlight relevant segments. If it’s Pruning, the text is typically well-cleaned but not necessarily matched to a query.
5. Code Patterns Recap
5.1 Pruning
prune_filter = PruningContentFilter(
threshold=0.5,
threshold_type="fixed",
min_word_threshold=10
)
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
5.2 BM25
bm25_filter = BM25ContentFilter(
user_query="health benefits fruit",
bm25_threshold=1.2
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
6. Combining with “word_count_threshold” & Exclusions
Remember you can also specify:
config = CrawlerRunConfig(
word_count_threshold=10,
excluded_tags=["nav", "footer", "header"],
exclude_external_links=True,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.5)
)
)
Thus, multi-level filtering occurs:
- The crawler’s
excluded_tagsare removed from the HTML first. - The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
- The final “fit” content is generated in
result.markdown_v2.fit_markdown.
7. Custom Filters
If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from RelevantContentFilter and implement filter_content(html). Then inject it into your markdown generator:
from crawl4ai.content_filter_strategy import RelevantContentFilter
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html, min_word_threshold=None):
# parse HTML, implement custom logic
return [block for block in ... if ... some condition...]
Steps:
- Subclass
RelevantContentFilter. - Implement
filter_content(...). - Use it in your
DefaultMarkdownGenerator(content_filter=MyCustomFilter(...)).
8. Final Thoughts
Fit Markdown is a crucial feature for:
- Summaries: Quickly get the important text from a cluttered page.
- Search: Combine with BM25 to produce content relevant to a query.
- AI Pipelines: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.
Key Points:
- PruningContentFilter: Great if you just want the “meatiest” text without a user query.
- BM25ContentFilter: Perfect for query-based extraction or searching.
- Combine with
excluded_tags,exclude_external_links,word_count_thresholdto refine your final “fit” text. - Fit markdown ends up in
result.markdown_v2.fit_markdown; eventuallyresult.markdown.fit_markdownin future versions.
With these tools, you can zero in on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
- Last Updated: 2025-01-01