236 lines
7.6 KiB
Plaintext
236 lines
7.6 KiB
Plaintext
![]() |
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# 🚀 Crawl4AI v0.3.72 Release Announcement\n",
|
|||
|
"\n",
|
|||
|
"Welcome to the new release of **Crawl4AI v0.3.72**! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!\n",
|
|||
|
"\n",
|
|||
|
"### What’s New?\n",
|
|||
|
"- ✨ `Fit Markdown`: Extracts only the main content from articles and blogs\n",
|
|||
|
"- 🛡️ **Magic Mode**: Comprehensive anti-bot detection bypass\n",
|
|||
|
"- 🌐 **Multi-browser support**: Switch between Chromium, Firefox, WebKit\n",
|
|||
|
"- 🔍 **Knowledge Graph Extraction**: Generate structured graphs of entities & relationships from any URL\n",
|
|||
|
"- 🤖 **Crawl4AI GPT Assistant**: Chat directly with our AI assistant for help, code generation, and faster learning (available [here](https://tinyurl.com/your-gpt-assistant-link))\n",
|
|||
|
"\n",
|
|||
|
"---\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 📥 Setup\n",
|
|||
|
"To start, we'll install `Crawl4AI` along with Playwright and `nest_asyncio` to ensure compatibility with Colab’s asynchronous environment."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Install Crawl4AI and dependencies\n",
|
|||
|
"!pip install crawl4ai\n",
|
|||
|
"!playwright install\n",
|
|||
|
"!pip install nest_asyncio"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Import nest_asyncio and apply it to allow asyncio in Colab\n",
|
|||
|
"import nest_asyncio\n",
|
|||
|
"nest_asyncio.apply()\n",
|
|||
|
"\n",
|
|||
|
"print('Setup complete!')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"## ✨ Feature 1: `Fit Markdown`\n",
|
|||
|
"Extracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import asyncio\n",
|
|||
|
"from crawl4ai import AsyncWebCrawler\n",
|
|||
|
"\n",
|
|||
|
"async def fit_markdown_demo():\n",
|
|||
|
" async with AsyncWebCrawler() as crawler:\n",
|
|||
|
" result = await crawler.arun(url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\")\n",
|
|||
|
" print(result.fit_markdown) # Shows main content in Markdown format\n",
|
|||
|
"\n",
|
|||
|
"# Run the demo\n",
|
|||
|
"await fit_markdown_demo()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"## 🛡️ Feature 2: Magic Mode\n",
|
|||
|
"Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"async def magic_mode_demo():\n",
|
|||
|
" async with AsyncWebCrawler() as crawler: # Enables anti-bot detection bypass\n",
|
|||
|
" result = await crawler.arun(\n",
|
|||
|
" url=\"https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/\",\n",
|
|||
|
" magic=True # Enables magic mode\n",
|
|||
|
" )\n",
|
|||
|
" print(result.markdown) # Shows the full content in Markdown format\n",
|
|||
|
"\n",
|
|||
|
"# Run the demo\n",
|
|||
|
"await magic_mode_demo()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"## 🌐 Feature 3: Multi-Browser Support\n",
|
|||
|
"Crawl4AI now supports Chromium, Firefox, and WebKit. Here’s how to specify Firefox for a crawl.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"async def multi_browser_demo():\n",
|
|||
|
" async with AsyncWebCrawler(browser_type=\"firefox\") as crawler: # Using Firefox instead of default Chromium\n",
|
|||
|
" result = await crawler.arun(url=\"https://crawl4i.com\")\n",
|
|||
|
" print(result.markdown) # Shows content extracted using Firefox\n",
|
|||
|
"\n",
|
|||
|
"# Run the demo\n",
|
|||
|
"await multi_browser_demo()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"## ✨ Put them all together\n",
|
|||
|
"\n",
|
|||
|
"Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from crawl4ai.extraction_strategy import LLMExtractionStrategy\n",
|
|||
|
"from pydantic import BaseModel\n",
|
|||
|
"import json, os\n",
|
|||
|
"from typing import List\n",
|
|||
|
"\n",
|
|||
|
"# Define classes for the knowledge graph structure\n",
|
|||
|
"class Landmark(BaseModel):\n",
|
|||
|
" name: str\n",
|
|||
|
" description: str\n",
|
|||
|
" activities: list[str] # E.g., visiting, sightseeing, relaxing\n",
|
|||
|
"\n",
|
|||
|
"class City(BaseModel):\n",
|
|||
|
" name: str\n",
|
|||
|
" description: str\n",
|
|||
|
" landmarks: list[Landmark]\n",
|
|||
|
" cultural_highlights: list[str] # E.g., food, music, traditional crafts\n",
|
|||
|
"\n",
|
|||
|
"class TravelKnowledgeGraph(BaseModel):\n",
|
|||
|
" cities: list[City] # Central Mexican cities to visit\n",
|
|||
|
"\n",
|
|||
|
"async def combined_demo():\n",
|
|||
|
" # Define the knowledge graph extraction strategy\n",
|
|||
|
" strategy = LLMExtractionStrategy(\n",
|
|||
|
" # provider=\"ollama/nemotron\",\n",
|
|||
|
" provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models\n",
|
|||
|
" pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass \"no-token\"\n",
|
|||
|
" schema=TravelKnowledgeGraph.schema(),\n",
|
|||
|
" instruction=(\n",
|
|||
|
" \"Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. \"\n",
|
|||
|
" \"For each city, list main landmarks with descriptions and activities, as well as cultural highlights.\"\n",
|
|||
|
" )\n",
|
|||
|
" )\n",
|
|||
|
"\n",
|
|||
|
" # Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown\n",
|
|||
|
" async with AsyncWebCrawler(browser_type=\"firefox\") as crawler:\n",
|
|||
|
" result = await crawler.arun(\n",
|
|||
|
" url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\",\n",
|
|||
|
" extraction_strategy=strategy,\n",
|
|||
|
" bypass_cache=True,\n",
|
|||
|
" magic=True\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" # Display main article content in Fit Markdown format\n",
|
|||
|
" print(\"Extracted Main Content:\\n\", result.fit_markdown)\n",
|
|||
|
" \n",
|
|||
|
" # Display extracted knowledge graph of cities, landmarks, and cultural highlights\n",
|
|||
|
" if result.extracted_content:\n",
|
|||
|
" travel_graph = json.loads(result.extracted_content)\n",
|
|||
|
" print(\"\\nExtracted Knowledge Graph:\\n\", json.dumps(travel_graph, indent=2))\n",
|
|||
|
"\n",
|
|||
|
"# Run the combined demo\n",
|
|||
|
"await combined_demo()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"---\n",
|
|||
|
"\n",
|
|||
|
"## 🤖 Crawl4AI GPT Assistant\n",
|
|||
|
"Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out [here](https://tinyurl.com/crawl4ai-gpt)!"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": []
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"name": "python",
|
|||
|
"version": "3.9"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|