# Quick Start Guide πŸš€ Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies. Let's dive in! 🌟 ## Getting Started πŸ› οΈ First, let's create an instance of `WebCrawler` and call the `warmup()` function. This might take a few seconds the first time you run Crawl4AI, as it loads the required model files. ```python from crawl4ai import WebCrawler def create_crawler(): crawler = WebCrawler(verbose=True) crawler.warmup() return crawler crawler = create_crawler() ``` ### Basic Usage Simply provide a URL and let Crawl4AI do the magic! ```python result = crawler.run(url="https://www.nbcnews.com/business") print(f"Basic crawl result: {result}") ``` ### Taking Screenshots πŸ“Έ Let's take a screenshot of the page! ```python result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True) with open("screenshot.png", "wb") as f: f.write(base64.b64decode(result.screenshot)) print("Screenshot saved to 'screenshot.png'!") ``` ### Understanding Parameters 🧠 By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action. First crawl (caches the result): ```python result = crawler.run(url="https://www.nbcnews.com/business") print(f"First crawl result: {result}") ``` Force to crawl again: ```python result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True) print(f"Second crawl result: {result}") ``` ### Adding a Chunking Strategy 🧩 Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern. ```python from crawl4ai.chunking_strategy import RegexChunking result = crawler.run( url="https://www.nbcnews.com/business", chunking_strategy=RegexChunking(patterns=["\n\n"]) ) print(f"RegexChunking result: {result}") ``` You can also use `NlpSentenceChunking` which splits the text into sentences using NLP techniques. ```python from crawl4ai.chunking_strategy import NlpSentenceChunking result = crawler.run( url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking() ) print(f"NlpSentenceChunking result: {result}") ``` ### Adding an Extraction Strategy 🧠 Let's get smarter with an extraction strategy: `CosineStrategy`! This strategy uses cosine similarity to extract semantically similar blocks of text. ```python from crawl4ai.extraction_strategy import CosineStrategy result = crawler.run( url="https://www.nbcnews.com/business", extraction_strategy=CosineStrategy( word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3 ) ) print(f"CosineStrategy result: {result}") ``` You can also pass other parameters like `semantic_filter` to extract specific content. ```python result = crawler.run( url="https://www.nbcnews.com/business", extraction_strategy=CosineStrategy( semantic_filter="inflation rent prices" ) ) print(f"CosineStrategy result with semantic filter: {result}") ``` ### Using LLMExtractionStrategy πŸ€– Time to bring in the big guns: `LLMExtractionStrategy` without instructions! This strategy uses a large language model to extract relevant information from the web page. ```python from crawl4ai.extraction_strategy import LLMExtractionStrategy import os result = crawler.run( url="https://www.nbcnews.com/business", extraction_strategy=LLMExtractionStrategy( provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY') ) ) print(f"LLMExtractionStrategy (no instructions) result: {result}") ``` You can also provide specific instructions to guide the extraction. ```python result = crawler.run( url="https://www.nbcnews.com/business", extraction_strategy=LLMExtractionStrategy( provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), instruction="I am interested in only financial news" ) ) print(f"LLMExtractionStrategy (with instructions) result: {result}") ``` ### Targeted Extraction 🎯 Let's use a CSS selector to extract only H2 tags! ```python result = crawler.run( url="https://www.nbcnews.com/business", css_selector="h2" ) print(f"CSS Selector (H2 tags) result: {result}") ``` ### Interactive Extraction πŸ–±οΈ Passing JavaScript code to click the 'Load More' button! ```python js_code = """ const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click(); """ result = crawler.run( url="https://www.nbcnews.com/business", js=js_code ) print(f"JavaScript Code (Load More button) result: {result}") ``` ### Using Crawler Hooks πŸ”— Let's see how we can customize the crawler using hooks! ```python import time from crawl4ai.web_crawler import WebCrawler from crawl4ai.crawler_strategy import * def delay(driver): print("Delaying for 5 seconds...") time.sleep(5) print("Resuming...") def create_crawler(): crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True) crawler_strategy.set_hook('after_get_url', delay) crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy) crawler.warmup() return crawler crawler = create_crawler() result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True) ``` check [Hooks](examples/hooks_auth.md) for more examples. ## Congratulations! πŸŽ‰ You've made it through the Crawl4AI Quickstart Guide! Now go forth and crawl the web like a pro! πŸ•ΈοΈ