crawl4ai/docs/notebooks/Crawl4AI_v0.3.72_Release_Announcement.ipynb

236 lines
7.6 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🚀 Crawl4AI v0.3.72 Release Announcement\n",
"\n",
"Welcome to the new release of **Crawl4AI v0.3.72**! This notebook highlights the latest features and demonstrates how they work in real-time. Follow along to see each feature in action!\n",
"\n",
"### Whats New?\n",
"- ✨ `Fit Markdown`: Extracts only the main content from articles and blogs\n",
"- 🛡️ **Magic Mode**: Comprehensive anti-bot detection bypass\n",
"- 🌐 **Multi-browser support**: Switch between Chromium, Firefox, WebKit\n",
"- 🔍 **Knowledge Graph Extraction**: Generate structured graphs of entities & relationships from any URL\n",
"- 🤖 **Crawl4AI GPT Assistant**: Chat directly with our AI assistant for help, code generation, and faster learning (available [here](https://tinyurl.com/your-gpt-assistant-link))\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📥 Setup\n",
"To start, we'll install `Crawl4AI` along with Playwright and `nest_asyncio` to ensure compatibility with Colabs asynchronous environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install Crawl4AI and dependencies\n",
"!pip install crawl4ai\n",
"!playwright install\n",
"!pip install nest_asyncio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import nest_asyncio and apply it to allow asyncio in Colab\n",
"import nest_asyncio\n",
"nest_asyncio.apply()\n",
"\n",
"print('Setup complete!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## ✨ Feature 1: `Fit Markdown`\n",
"Extracts only the main content from articles and blog pages, removing sidebars, ads, and other distractions.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from crawl4ai import AsyncWebCrawler\n",
"\n",
"async def fit_markdown_demo():\n",
" async with AsyncWebCrawler() as crawler:\n",
" result = await crawler.arun(url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\")\n",
" print(result.fit_markdown) # Shows main content in Markdown format\n",
"\n",
"# Run the demo\n",
"await fit_markdown_demo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## 🛡️ Feature 2: Magic Mode\n",
"Magic Mode bypasses anti-bot detection to make crawling more reliable on protected websites.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async def magic_mode_demo():\n",
" async with AsyncWebCrawler() as crawler: # Enables anti-bot detection bypass\n",
" result = await crawler.arun(\n",
" url=\"https://www.reuters.com/markets/us/global-markets-view-usa-pix-2024-08-29/\",\n",
" magic=True # Enables magic mode\n",
" )\n",
" print(result.markdown) # Shows the full content in Markdown format\n",
"\n",
"# Run the demo\n",
"await magic_mode_demo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## 🌐 Feature 3: Multi-Browser Support\n",
"Crawl4AI now supports Chromium, Firefox, and WebKit. Heres how to specify Firefox for a crawl.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async def multi_browser_demo():\n",
" async with AsyncWebCrawler(browser_type=\"firefox\") as crawler: # Using Firefox instead of default Chromium\n",
" result = await crawler.arun(url=\"https://crawl4i.com\")\n",
" print(result.markdown) # Shows content extracted using Firefox\n",
"\n",
"# Run the demo\n",
"await multi_browser_demo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## ✨ Put them all together\n",
"\n",
"Let's combine all the features to extract the main content from a blog post, bypass anti-bot detection, and generate a knowledge graph from the extracted content."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from crawl4ai.extraction_strategy import LLMExtractionStrategy\n",
"from pydantic import BaseModel\n",
"import json, os\n",
"from typing import List\n",
"\n",
"# Define classes for the knowledge graph structure\n",
"class Landmark(BaseModel):\n",
" name: str\n",
" description: str\n",
" activities: list[str] # E.g., visiting, sightseeing, relaxing\n",
"\n",
"class City(BaseModel):\n",
" name: str\n",
" description: str\n",
" landmarks: list[Landmark]\n",
" cultural_highlights: list[str] # E.g., food, music, traditional crafts\n",
"\n",
"class TravelKnowledgeGraph(BaseModel):\n",
" cities: list[City] # Central Mexican cities to visit\n",
"\n",
"async def combined_demo():\n",
" # Define the knowledge graph extraction strategy\n",
" strategy = LLMExtractionStrategy(\n",
" # provider=\"ollama/nemotron\",\n",
" provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models\n",
" pi_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass \"no-token\"\n",
" schema=TravelKnowledgeGraph.schema(),\n",
" instruction=(\n",
" \"Extract cities, landmarks, and cultural highlights for places to visit in Central Mexico. \"\n",
" \"For each city, list main landmarks with descriptions and activities, as well as cultural highlights.\"\n",
" )\n",
" )\n",
"\n",
" # Set up the AsyncWebCrawler with multi-browser support, Magic Mode, and Fit Markdown\n",
" async with AsyncWebCrawler(browser_type=\"firefox\") as crawler:\n",
" result = await crawler.arun(\n",
" url=\"https://janineintheworld.com/places-to-visit-in-central-mexico\",\n",
" extraction_strategy=strategy,\n",
" bypass_cache=True,\n",
" magic=True\n",
" )\n",
" \n",
" # Display main article content in Fit Markdown format\n",
" print(\"Extracted Main Content:\\n\", result.fit_markdown)\n",
" \n",
" # Display extracted knowledge graph of cities, landmarks, and cultural highlights\n",
" if result.extracted_content:\n",
" travel_graph = json.loads(result.extracted_content)\n",
" print(\"\\nExtracted Knowledge Graph:\\n\", json.dumps(travel_graph, indent=2))\n",
"\n",
"# Run the combined demo\n",
"await combined_demo()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## 🤖 Crawl4AI GPT Assistant\n",
"Chat with the Crawl4AI GPT Assistant for code generation, support, and learning Crawl4AI faster. Try it out [here](https://tinyurl.com/crawl4ai-gpt)!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}