443 lines
16 KiB
Plaintext
443 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Crawl4AI: Advanced Web Crawling and Data Extraction\n",
|
|
"\n",
|
|
"Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n",
|
|
"\n",
|
|
"- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n",
|
|
"- Twitter: [@unclecode](https://twitter.com/unclecode)\n",
|
|
"- Website: [https://crawl4ai.com](https://crawl4ai.com)\n",
|
|
"\n",
|
|
"Let's explore the powerful features of Crawl4AI!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Installation\n",
|
|
"\n",
|
|
"First, let's install Crawl4AI from GitHub:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
|
|
"!pip install nest-asyncio\n",
|
|
"!playwright install"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, let's import the necessary libraries:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import asyncio\n",
|
|
"import nest_asyncio\n",
|
|
"from crawl4ai import AsyncWebCrawler\n",
|
|
"from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n",
|
|
"import json\n",
|
|
"import time\n",
|
|
"from pydantic import BaseModel, Field\n",
|
|
"\n",
|
|
"nest_asyncio.apply()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Basic Usage\n",
|
|
"\n",
|
|
"Let's start with a simple crawl example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"async def simple_crawl():\n",
|
|
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
|
|
" result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n",
|
|
" print(result.markdown[:500]) # Print first 500 characters\n",
|
|
"\n",
|
|
"await simple_crawl()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Advanced Features\n",
|
|
"\n",
|
|
"### Executing JavaScript and Using CSS Selectors"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"async def js_and_css():\n",
|
|
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
|
|
" js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=\"https://www.nbcnews.com/business\",\n",
|
|
" js_code=js_code,\n",
|
|
" css_selector=\"article.tease-card\",\n",
|
|
" bypass_cache=True\n",
|
|
" )\n",
|
|
" print(result.extracted_content[:500]) # Print first 500 characters\n",
|
|
"\n",
|
|
"await js_and_css()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Using a Proxy\n",
|
|
"\n",
|
|
"Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"async def use_proxy():\n",
|
|
" async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=\"https://www.nbcnews.com/business\",\n",
|
|
" bypass_cache=True\n",
|
|
" )\n",
|
|
" print(result.markdown[:500]) # Print first 500 characters\n",
|
|
"\n",
|
|
"# Uncomment the following line to run the proxy example\n",
|
|
"# await use_proxy()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Extracting Structured Data with OpenAI\n",
|
|
"\n",
|
|
"Note: You'll need to set your OpenAI API key as an environment variable for this example to work."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"class OpenAIModelFee(BaseModel):\n",
|
|
" model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n",
|
|
" input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n",
|
|
" output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n",
|
|
"\n",
|
|
"async def extract_openai_fees():\n",
|
|
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
|
|
" result = await crawler.arun(\n",
|
|
" url='https://openai.com/api/pricing/',\n",
|
|
" word_count_threshold=1,\n",
|
|
" extraction_strategy=LLMExtractionStrategy(\n",
|
|
" provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'), \n",
|
|
" schema=OpenAIModelFee.schema(),\n",
|
|
" extraction_type=\"schema\",\n",
|
|
" instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens. \n",
|
|
" Do not miss any models in the entire content. One extracted model JSON format should look like this: \n",
|
|
" {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n",
|
|
" ), \n",
|
|
" bypass_cache=True,\n",
|
|
" )\n",
|
|
" print(result.extracted_content)\n",
|
|
"\n",
|
|
"# Uncomment the following line to run the OpenAI extraction example\n",
|
|
"# await extract_openai_fees()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Advanced Multi-Page Crawling with JavaScript Execution"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import re\n",
|
|
"from bs4 import BeautifulSoup\n",
|
|
"\n",
|
|
"async def crawl_typescript_commits():\n",
|
|
" first_commit = \"\"\n",
|
|
" async def on_execution_started(page):\n",
|
|
" nonlocal first_commit \n",
|
|
" try:\n",
|
|
" while True:\n",
|
|
" await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n",
|
|
" commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n",
|
|
" commit = await commit.evaluate('(element) => element.textContent')\n",
|
|
" commit = re.sub(r'\\s+', '', commit)\n",
|
|
" if commit and commit != first_commit:\n",
|
|
" first_commit = commit\n",
|
|
" break\n",
|
|
" await asyncio.sleep(0.5)\n",
|
|
" except Exception as e:\n",
|
|
" print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n",
|
|
"\n",
|
|
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
|
|
" crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n",
|
|
"\n",
|
|
" url = \"https://github.com/microsoft/TypeScript/commits/main\"\n",
|
|
" session_id = \"typescript_commits_session\"\n",
|
|
" all_commits = []\n",
|
|
"\n",
|
|
" js_next_page = \"\"\"\n",
|
|
" const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n",
|
|
" if (button) button.click();\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" for page in range(3): # Crawl 3 pages\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=url,\n",
|
|
" session_id=session_id,\n",
|
|
" css_selector=\"li.Box-sc-g0xbh4-0\",\n",
|
|
" js=js_next_page if page > 0 else None,\n",
|
|
" bypass_cache=True,\n",
|
|
" js_only=page > 0\n",
|
|
" )\n",
|
|
"\n",
|
|
" assert result.success, f\"Failed to crawl page {page + 1}\"\n",
|
|
"\n",
|
|
" soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n",
|
|
" commits = soup.select(\"li\")\n",
|
|
" all_commits.extend(commits)\n",
|
|
"\n",
|
|
" print(f\"Page {page + 1}: Found {len(commits)} commits\")\n",
|
|
"\n",
|
|
" await crawler.crawler_strategy.kill_session(session_id)\n",
|
|
" print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n",
|
|
"\n",
|
|
"await crawl_typescript_commits()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Using JsonCssExtractionStrategy for Fast Structured Output"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"async def extract_news_teasers():\n",
|
|
" schema = {\n",
|
|
" \"name\": \"News Teaser Extractor\",\n",
|
|
" \"baseSelector\": \".wide-tease-item__wrapper\",\n",
|
|
" \"fields\": [\n",
|
|
" {\n",
|
|
" \"name\": \"category\",\n",
|
|
" \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n",
|
|
" \"type\": \"text\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"headline\",\n",
|
|
" \"selector\": \".wide-tease-item__headline\",\n",
|
|
" \"type\": \"text\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"summary\",\n",
|
|
" \"selector\": \".wide-tease-item__description\",\n",
|
|
" \"type\": \"text\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"time\",\n",
|
|
" \"selector\": \"[data-testid='wide-tease-date']\",\n",
|
|
" \"type\": \"text\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"image\",\n",
|
|
" \"type\": \"nested\",\n",
|
|
" \"selector\": \"picture.teasePicture img\",\n",
|
|
" \"fields\": [\n",
|
|
" {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n",
|
|
" {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n",
|
|
" ],\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"name\": \"link\",\n",
|
|
" \"selector\": \"a[href]\",\n",
|
|
" \"type\": \"attribute\",\n",
|
|
" \"attribute\": \"href\",\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" }\n",
|
|
"\n",
|
|
" extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n",
|
|
"\n",
|
|
" async with AsyncWebCrawler(verbose=True) as crawler:\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=\"https://www.nbcnews.com/business\",\n",
|
|
" extraction_strategy=extraction_strategy,\n",
|
|
" bypass_cache=True,\n",
|
|
" )\n",
|
|
"\n",
|
|
" assert result.success, \"Failed to crawl the page\"\n",
|
|
"\n",
|
|
" news_teasers = json.loads(result.extracted_content)\n",
|
|
" print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n",
|
|
" print(json.dumps(news_teasers[0], indent=2))\n",
|
|
"\n",
|
|
"await extract_news_teasers()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Speed Comparison\n",
|
|
"\n",
|
|
"Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import time\n",
|
|
"\n",
|
|
"async def speed_comparison():\n",
|
|
" # Simulated Firecrawl performance\n",
|
|
" print(\"Firecrawl (simulated):\")\n",
|
|
" print(\"Time taken: 7.02 seconds\")\n",
|
|
" print(\"Content length: 42074 characters\")\n",
|
|
" print(\"Images found: 49\")\n",
|
|
" print()\n",
|
|
"\n",
|
|
" async with AsyncWebCrawler() as crawler:\n",
|
|
" # Crawl4AI simple crawl\n",
|
|
" start = time.time()\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=\"https://www.nbcnews.com/business\",\n",
|
|
" word_count_threshold=0,\n",
|
|
" bypass_cache=True, \n",
|
|
" verbose=False\n",
|
|
" )\n",
|
|
" end = time.time()\n",
|
|
" print(\"Crawl4AI (simple crawl):\")\n",
|
|
" print(f\"Time taken: {end - start:.2f} seconds\")\n",
|
|
" print(f\"Content length: {len(result.markdown)} characters\")\n",
|
|
" print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
|
|
" print()\n",
|
|
"\n",
|
|
" # Crawl4AI with JavaScript execution\n",
|
|
" start = time.time()\n",
|
|
" result = await crawler.arun(\n",
|
|
" url=\"https://www.nbcnews.com/business\",\n",
|
|
" js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n",
|
|
" word_count_threshold=0,\n",
|
|
" bypass_cache=True, \n",
|
|
" verbose=False\n",
|
|
" )\n",
|
|
" end = time.time()\n",
|
|
" print(\"Crawl4AI (with JavaScript execution):\")\n",
|
|
" print(f\"Time taken: {end - start:.2f} seconds\")\n",
|
|
" print(f\"Content length: {len(result.markdown)} characters\")\n",
|
|
" print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
|
|
"\n",
|
|
"await speed_comparison()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As you can see, Crawl4AI outperforms Firecrawl significantly:\n",
|
|
"- Simple crawl: Crawl4AI is typically over 4 times faster than Firecrawl.\n",
|
|
"- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n",
|
|
"\n",
|
|
"Please note that actual performance may vary depending on network conditions and the specific content being crawled."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"In this notebook, we've explored the powerful features of Crawl4AI, including:\n",
|
|
"\n",
|
|
"1. Basic crawling\n",
|
|
"2. JavaScript execution and CSS selector usage\n",
|
|
"3. Proxy support\n",
|
|
"4. Structured data extraction with OpenAI\n",
|
|
"5. Advanced multi-page crawling with JavaScript execution\n",
|
|
"6. Fast structured output using JsonCssExtractionStrategy\n",
|
|
"7. Speed comparison with other services\n",
|
|
"\n",
|
|
"Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
|
|
"\n",
|
|
"For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n",
|
|
"\n",
|
|
"Happy crawling!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|