firecrawl/examples/blog-articles/mastering-the-crawl-endpoint/mastering-the-crawl-endpoint.ipynb

1504 lines
4.5 MiB
Plaintext
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mastering Firecrawl's Crawl Endpoint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Web scraping has become even more essential as businesses race to convert unprecedented amounts of online data into LLM-friendly formats. Firecrawl streamlines this process with powerful automation and scalability. \n",
"\n",
"In this tutorial, we will focus on its web crawling feature - the `/crawl` endpoint, which allows you to scrape websites in their entirety. It can recursively traverse website sub-pages, handle dynamic JavaScript-based content, bypass any web blockers and return a clean output for LLM consumption. \n",
"\n",
"In this guide, we'll explore both basic and advanced features of the endpoint, and learn solutions to common web crawling challenges."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding Web Crawling vs. Scraping\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The terms _web crawling_ and _web scraping_ are used almost interchangeably but there are distinctions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What's the Difference?"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"source": [
"_Web scraping_ refers to extracting specific data from individual web pages like a Wikipedia article or a technical tutorial. It is primarily used when you need specific information from pages with _known URLs_.\n",
"\n",
"On the other hand, in _web crawling_, you systematically browse and discover web pages by following links. It focuses on website navigation and URL discovery. \n",
"\n",
"For example, let's say I want to build a chatbot that answers questions related to Stripe's documentation. I would need web crawling to discover and traverse all the pages in Stripe's documentation site, following links between different sections and guides. Then, web scraping would extract the actual content from each discovered page, cleaning and converting it into a format suitable for training the chatbot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How Firecrawl Combines Both"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl's `/crawl` endpoint combines both capabilities. It features:\n",
"1. URL analysis: Identifies links through sitemap or page traversal\n",
"2. Recursive traversal: Follows links to discover sub-pages\n",
"3. Content scraping: Extracts clean content from each page\n",
"4. Results compilation: Converts everything to structured data\n",
"\n",
"For example, when you pass the URL https://docs.stripe.com/api to the endpoint, it automatically discovers and crawls all documentation sub-pages. The endpoint returns the content in your preferred format - whether that's markdown, HTML, screenshots, links, or metadata. All you need to do is make a single API call and let Firecrawl handle the rest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web Crawling Basics with Firecrawl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl is a web scraping engine exposed as a REST API. This means you can use it from the command-line using cURL or using of its language SDKs for Python, Node, Go, or Rust. For the rest of the tutorial, we will focus on its Python SDK.\n",
"\n",
"To get started:\n",
"\n",
"1. Sign up at [firecrawl.dev](firecrawl.dev) and copy your API key from your account dashboard.\n",
"\n",
"2. Save the key as an environment variable with:\n",
"\n",
"```bash\n",
"export FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'\n",
"```\n",
"\n",
"or as a more permanent option, use a dot-env file:\n",
"\n",
"\n",
"```bash\n",
"$ touch .env\n",
"$ echo \"FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'\" >> .env\n",
"```\n",
"\n",
"Then, you can use the `load_dotenv` function of the `dotenv` library to load the variable into your environment."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from firecrawl import FirecrawlApp # pip install firecrawl-py\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"\n",
"app = FirecrawlApp()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once your API key is loaded, the `FirecrawlApp` class uses it to establish a connection with the Firecrawl API engine."
]
},
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACeQAAAeoCAYAAABOJ0KRAAAAAXNSR0IArs4c6QAAAGxlWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAKgAgAEAAAAAQAACeSgAwAEAAAAAQAAB6gAAAAANv7RIAAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAQABJREFUeAHs3QecVfWd9/GfwAwDUxiG3gcRUEBEaYq99xaxRY0mZlM2pu1u1k3ypGyyKZv2ZBOflE1isCQaY4u9K4gKokgVKUodBphhCjMDQxl8/t+rZzj3zDnn3jtzB0b4/J/X7L339PM+xX1efPf3O6yqqup9YyCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAQJsEOrVpbVZGAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAIGEAIE8bgQEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEsiBAIC8LiGwCAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQJ53AMIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIZEGAQF4WENkEAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgTyuAcQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQyIIAgbwsILIJBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBAjkcQ8ggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgggkAUBAnlZQGQTCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCBDI4x5AAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAIAsCBPKygMgmEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEECCQxz2AAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAQBYECORlAZFNIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIEAgj3sAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAgSwIEMjLAiKbQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQIBAHvcAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAlkQIJCXBUQ2gQACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggACBPO4BBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBLIgQCAvC4hsAgEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAECedwDCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCGRBgEBeFhDZBAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIE8rgHEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEMiCAIG8LCCyCQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQI5HEPIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIJAFAQJ5WUBkEwgggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggQyOMeQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQCALAgTysoDIJhBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBAgkMc9gAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggEAWBAjkZQGRTSCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCBAII97AAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAIEsCBDIywIim0AAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEECAQB73AAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAJZECCQlwVENoEAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAgTzuAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQSyIEAgLwuIbAIBBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABAnncAwgggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAghkQYBAXhYQ2QQCCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACBPK4BxBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBDIgkCXLGyDTSCAAAIdSqBme5NtqNllZdW73Odu27Rtl3XqdJgVdu1kRXmdrXdBjo3pn2fDene1wzrUkXMwCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAh9lgQ4RyHvm7Vp7Y11DVh0PO+wwK+nexQYV59jA4lwb1CPH+hTmmMvkMNpB4IkltbZgQ/g1HNE7z66eVNIOe2WTCHwgUN+4155eVmsvrthmmxt22/vvpyfjXhM2qDDXLhjbw04fXWQ5nXlBpCfHUggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAJhAodVVVWlGV0JWz07075431orr9udnY2l2MqIkq5287Q+NqpfXoolmZ2JwOfuXWOVDXtCVxlclGu/vHJo6DwmItAWgfqdTfaH2RX2ypr6tmwmsa4q6J0xojDxfsjpQjCvzaBsAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQOAQFOkSFvP3p/m7VTvvGYxustKcL5p3Qx44aQDBvf/qzr3iBXXv22u694RnZzq65al5up/gNHEJzn3+n1n7/WqXtjfDKlELbeW7ltkS478un9rNJw/Iz3QTLI4AAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCBziAodcIM+73muqd9q3nthg543uYZ8+qY83mU8EDqjAfz6+0ZZXNoYeQ4EL48244fDQeYfaxLvmVto/ltS0y2nv2L3Xfvxcud04ubddPL64XfbBRhFAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQOToFDvtzWU8trbcZrFQfn1eWsEDgIBe6c035hPD/XHfMq7YG3qv2T+I4AAggggAACCCCAAAII
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will crawl the https://books.toscrape.com/ website, which is built for web-scraping practice:\n",
"\n",
"![Books to Scrape practice website](attachment:image.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of writing dozens of lines of code with libraries like `beautifulsoup4` or `lxml` to parse HTML elements, handle pagination and data retrieval, Firecrawl's `crawl_url` endpoint lets you accomplish this in a single line:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"base_url = \"https://books.toscrape.com/\"\n",
"crawl_result = app.crawl_url(url=base_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is a dictionary with the following keys:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data'])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crawl_result.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we are interested in the status of the crawl job:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'completed'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crawl_result['status']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If it is completed, let's see how many pages were scraped:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1195"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crawl_result['total']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost 1200 pages (it took about 70 seconds on my machine; the speed vary based on your connection speed). Let's look at one of the elements of the `data` list:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"- [Home](../../../../index.html)\n",
"- [Books](../../books_1/index.html)\n",
"- Womens Fiction\n",
"\n",
"# Womens Fiction\n",
"\n",
"**17** results.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"**Warning!** This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.\n",
"\n",
"01. [![I Had a Nice Time And Other Lies...: How to find love & sh*t like that](../../../../media/cache/5f/72/5f72c8a0d5a7292e2929a354ec8a022f.jpg)](../../../i-had-a-nice-time-and-other-lies-how-to-find-love-sht-like-that_814/index.html)\n"
]
}
],
"source": [
"sample_page = crawl_result['data'][10]\n",
"markdown_content = sample_page['markdown']\n",
"\n",
"print(markdown_content[:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The page corresponds to Women's Fiction page:"
]
},
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACggAAAeUCAYAAACJow65AAAAAXNSR0IArs4c6QAAAGxlWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAKgAgAEAAAAAQAACgigAwAEAAAAAQAAB5QAAAAA5NitcgAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAQABJREFUeAHs3QecVfWd9/GfwAwDUxiG3gcRUEBEaYq99xaxRY0mZlM2pu1u1k3ypGyyKZv2ZBOflE1isCQaY4u9K4gKokgVKUodBphhCjMDQxl8/t+rZzj3zDnn3jtzB0b4/J/X7L339PM+xX1efPf3O6yqqup9YyCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAwEEl0OmgOhtOBgEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEgIEBLkREEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEDgIBQgIHoQXlVNCAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAgIAg9wACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACB6EAAcGD8KJySggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggQEOQeQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQOAgFCAgeBBeVE4JAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQKC3AMIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIHIQCBAQPwovKKSGAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCBAQJB7AAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAIGDUICA4EF4UTklBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBAgIcg8ggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgggcBAKEBA8CC8qp4QAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAUHuAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQOQgECggfhReWUEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEECAgyD2AAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAwEEoQEDwILyonBICCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACBAS5BxBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBA4CAUICB6EF5VTQgABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQICAIPcAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgehAAHBg/CickoIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIEBDkHkAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEDgIBQgIHgQXlROCQEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAECgtwDCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCByEAgQED8KLyikhgAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgggQECQewABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQACBg1CAgOBBeFE5JQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQICHIPIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIHAQChAQPAgvKqeEAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAFB7gEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEDkKBLgfhOXFKCCCAQKhAzfYm21Czy8qqd7nP3bZp2y7r1OkwK+zayYryOlvvghwb0z/PhvXuaoeFboGJCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgh8dAQ6dEDwmbdr7Y11DVnVPOyww6ykexcbVJxjA4tzbVCPHOtTmGMuI8RoB4EnltTagg3h13BE7zy7elJJO+yVTSLwgUB94157elmtvbhim21u2G3vv5+ejHtN2KDCXLtgbA87fXSR5XTmBZGeHEshgAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIdCSBw6qqqtKMzOz/w/7ifWutvG73ftnxiJKudvO0PjaqX95+2d+hspPP3bvGKhv2hJ7u4KJc++WVQ0PnMRGBtgjU72yyP8yusFfW1LdlM4l1VWHwjBGFifdDTheCgm0GZQMIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAAC+02gQ1cQ3G8KbkfvVu20bzy2wUp7uqDgCX3sqAEEBfenP/uKF9i1Z6/t3hue5e3smuHm5XaK38AhNPf5d2rt969V2t4Ir0wptJ3nVm5LhA2/fGo/mzQsP9NNsDwCCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAgdEgIBggH1N9U771hMb7LzRPezTJ/UJzOUnAgdG4D8f32jLKxtDd17gwoEzbjg8dN6hNvGuuZX2jyU17XLaO3bvtR8/V243Tu5tF48vbpd9sFEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBLIpQNmxCM2nltfajNcqIuYyGQEEOprAnXPaLxzoP9c75lXaA29V+yfxHQEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQACBDilAQDDmsjz2dq2t3bozZglmIYBARxBYWr7DHlnaPpUDw87vnvlbbVl5eEXHsOWZhgACCCCAAAIIIIAAAggggAACCCCAAAIIIIAAAggggAACCCCAAAIIIIDAgRCgxXAK9d/M2mL/ffmQFEsxGwEEDpTA+27HP3m2PK3dH3aY2VF98mzysALrU9jFCvM6W1XDHiuv3W1lNbvtjfX1trNJW0w9/vu5jfbH64Zbl05uowwEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBDqgwEc6IHjWyCI796geabM27N6bCAJtqN5pL62qM/1ONd6t2mm79uy13C4UW0xlxXwEDoTA66vr03qWLzu62K48tsS65kQ/y3v39rXnl9fZfa5CYHVjU+zp1O/aaw8vqLbpx5XELsdMBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQQQAABBBBAAAEEEEAAAQQOlMBHOiDY
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"![image.png](attachment:image.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl also includes page metadata in the element's dictionary as well:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'url': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',\n",
" 'title': '\\n Womens Fiction | \\n Books to Scrape - Sandbox\\n\\n',\n",
" 'robots': 'NOARCHIVE,NOCACHE',\n",
" 'created': '24th Jun 2016 09:29',\n",
" 'language': 'en-us',\n",
" 'viewport': 'width=device-width',\n",
" 'sourceURL': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',\n",
" 'statusCode': 200,\n",
" 'description': '\\n \\n',\n",
" 'ogLocaleAlternate': []}"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sample_page['metadata']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing we didn't mention is how Firecrawl handles pagination. If you scroll to the bottom of Books-to-Scrape, you will see that it has a \"next\" button:"
]
},
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABwwAAAOsCAYAAABXukIXAAAAAXNSR0IArs4c6QAAAGxlWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAKgAgAEAAAAAQAABwygAwAEAAAAAQAAA6wAAAAAbrv0BwAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAQABJREFUeAHsvdm2ZMlxpucxDyfOlGNlTZiIAghwkcSiWhQp6kJa/SD9BnoWvYZudU+pW6SWmmoKZAMgBoIF1JDzmU/Mo77PdkRVLTZ7iQSrCpkVtjPjRMTevt3Nf99h5mbmZl7bcJQ8EoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEYC8RqO9lr7PTiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAiEAikwzAfhEQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgURgjxFIh+EeD352PRFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIh2E+A4lAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAonAHiOQDsM9HvzseiKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQDsN8BhKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBPUYgHYZ7PPjZ9UQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgHYb5DCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCe4xAOgz3ePCz64lAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAOgzzGUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgE9hiBdBju8eBn1xOBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBdBjmM5AIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAI7DEC6TDc48HPricCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQC6TDMZyARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCAR2GME0mG4x4OfXU8EEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgE0mGYz0AikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAisMcIpMNwjwc/u54IJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIpMMwn4FEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEYI8RSIfhHg9+dj0RSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSIdhPgOJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJwB4jkA7DPR787HoikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikA7DfAYSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgT1GIB2Gezz42fVEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIB2G+QwkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAnuMQDoM93jws+uJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQDoM8xlIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBPYYgXQY7vHgZ9cTgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgXQY5jOQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCOwxAukw3OPBz64nAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAukwzGcgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEdhjBNJhuMeDn11PBBKBRCARSAQSgUQgEUgEEoFEIBFIBBKBRCARSAQSgUQgEUgEEoFEIBFIBNJhmM9AIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIrDHCKTDcI8HP7ueCCQCiUAikAgkAolAIpAIJAKJQCKQCCQCiUAikAgkAolAIpAIJAKJQCKQCDQTgtcPgel0WobDYZlMJmW5XJbNZvP6dSIpTgRecwRqtVppNpul1+uVwWBQut3ua96jJD8R+OoikHLzqzu22bPPD4GUa58flllTIvCqIZBy8FUbkaTnVUQg5eCrOCpJUyLwr0Mg5d+/Dr+8ez8RSHm4n+P+2V7XcDalt+mziLzin8/OzsrNzc0rTmWSlwjsHwJHR0fl3r17+9fx7HEi8IojkHLzFR+gJO+VRSDl2is7NElYIvAvQiDl4L8IriycCHyCQMrBT6DID4nAa4lAyr/XctiS6FcQgZSHr+CgfMEkpcPwCwb486z+6dOnEVVonScnJ+Xg4KC02+2i5z+PRCAR+HIRcK3FfD4vo9GoXF1dReNGGz569OjLJSRbSwQSgf8qAik3/6vQ5IVE4L9AIOXafwFJnkgEXnsEUg6+9kOYHfgSEUg5+CWCnU0lAl8wAin/vmCAs/qvNAIpD7/Sw/vP6lw6DP9ZMP32C+1WxrRarfLgwYPS6XR++0QlBYlAIhAIzGaz8uLFi7JYLEquvMmHIhF4NRBIuflqjENS8XoikHLt9Ry3pDoR+CwCKQc/i0Z+TgT+ZQikHPyX4ZWlE4FXCYGUf6/SaCQtrzsCKQ9f9xH8zeiv/2a35V1fJgLm3N6lIU1n4ZeJfLaVCPzzENCB72/Tw9+qv9k8EoFE4LeHQMrN3x722fJXA4GUa1+Nccxe7C8CKQf3d+yz558PAikHPx8cs5ZE4MtGIOXfl414tvdVRyDl4Vd9hP/p/qXD8J/G5ZU6OxwOgx7TkPpDzSMRSARePQT8bfob9dj9Zl89KpOiRGA/ENj9BlNu7sd4Zy+/GARSrn0xuGaticCXgUDKwS8D5Wzjq45AysGv+ghn/76KCKT8+yqOavbpt41AysPf9gh8+e2nw/DLx/xf3OJkMol73LMwj0QgEXh1Edj9Rne/2VeX0qQsEfhqI7D7De5+k1/t3mbvEoEvDoHdb2j3m/riWsqaE4FE4PNEYPeb3f2GP8+6s65EYJ8Q2P2Gdr+pfep79jUReB0R2P1Wd7/d17EPSXMi8CoisPtN7X5jryKNSdPnh0A6DD8/LL+wmpbLZdTdbre/sDay4kQgEfjXI7D7je5+s//6GrOGRCAR+E0Q2P0Gd7/J36SOvCcRSARK2f2Gdr+pxCQRSAReDwR2v9ndb/j1oDqpTARePQR2v6Hdb+rVozApSgQSgc8isPut7n67n72WnxOBROA3R2D3m9r9xn7zmvLO1wGBdBi+BqO02WyCylqt9hpQmyQmAvuLwO43uvvN7i8S2fNE4LeLwO43uPtN/napydYTgdcXgd1vaPeben17kpQnAvuFwO43u/sN71fvs7eJwOeHwO43tPtN
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"![image.png](attachment:image.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before moving on to sub-pages like `books.toscrape.com/category`, Firecrawl first scrapes all sub-pages from the homepage. Later, if a sub-page includes links to already scraped pages, they are ignored."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Configuration Options\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl offers several types of parameters to configure how the endpoint crawls over websites. We will outline them here with their use-cases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scrape Options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On real-world projects, you will tweak this parameter the most frequently. It allows you to control how a webpage's contents are saved. Firecrawl allows the following formats:\n",
"- Markdown - the default\n",
"- HTML\n",
"- Raw HTML (simple copy/paste of the entire webpage)\n",
"- Links\n",
"- Screenshot\n",
"\n",
"Here is an example request to scrape the Stripe API in four formats:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Crawl the first 5 pages of the stripe API documentation\n",
"stripe_crawl_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/api\",\n",
" params={\n",
" \"limit\": 5, # Only scrape the first 5 pages including the base-url\n",
" \"scrapeOptions\": {\n",
" \"formats\": [\"markdown\", \"html\", \"links\", \"screenshot\"]\n",
" }\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you specify multiple formats, each webpage's data contains separate keys for each format's content:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['html', 'links', 'markdown', 'metadata', 'screenshot'])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stripe_crawl_result['data'][0].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The value of the `screenshot` key is a temporary link to a PNG file stored on Firecrawl's servers and expires within 24 hours. Here is what it looks like for Stripe's API documentation homepage:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAAAXNSR0IArs4c6QAAIABJREFUeJzs3Xd4FOUWBvAzM1uz6b0RQu+9E3oRkKYISJUqTWyASFEUpSgWRLkKimKhKUV6R6SX0JIQSiAkpCebttnsZuvM/WPCsqZBYgrg+3t8uLOz33zzbQncvHv2DCMIAgHAv3A7JrFejYCqXgUAAAAAAAAAAMATh63qBQAAAAAAAAAAAADAswkBNAAAAAAAAAAAAABUCATQAAAAAAAAAAAAAFAhEEADAAAAAAAAAAAAQIVAAA0AAAAAAAAAAAAAFQIBNAAAAAAAAAAAAABUCATQAAAAAAAAAAAAAFAhEEADAAAAAAAAAAAAQIVAAA0AAAAAAAAAAAAAFQIBNAAAAAAAAAAAAABUCATQAAAAAAAAAAAAAFAhEEADAAAAAAAAAAAAQIVAAA0AAAAAAAAAAAAAFQIBNAAAAAAAAAAAAABUCElVLwCgHPA8X8K9LIsPWgAAAAAAAAAAAKpAFQTQ6vSMw0f+irh+8/qNW3v/3FT5C4BnhiAIgiCIG7Y9tnsZhhE3xHiaYRjbHgAAAAAAAAAAAKgE5RxA52i1YeGRfr4+tWvVKG7MpctXV63+vnzPC/81BaJn8U+GYTiOE1NmQRB4nrfPo8WdiKEBAAAAAAAAAAAqTXkG0K+/Pe/CxctEtGjBnBICaJvqQdXK8ezw32GfPotBM8dxHMfZjxHDaCKyWq1Wq1XswsEwjC2qrrrlAwAAAAAAAAAA/FeUZwAdH58objAlttytW6fWZ58sbt60sYuLczmeHf4LClQ9izXOUqm0hECZ4ziWZc1mM8MwLMvaMmjE0AAAAAAAAAAAABWtjAH0zt37w6/fSFdn+Pp6VwsMaNy4QYtmTRg2P847cfJMYlKyg1I5ZtSw73/8lYgcVarGjer//OvmgAC/QQP63Y66ezvqLsswkyeO3b33YEpqGhENeP65P3ftC4+4wbLMkMEDevfqZn9Gg8Hw+9adZ8+HEpG/n8/kCWMDAvzK4xmAp4Z9+szzvJg+y2SyAjkyH7aGiNhm02x7xGroFHWmmEHbqqHFP1mGkUg4uUwql0mr4mEBAAAAAAAAAAA8s5gCTXIfKUernfjq63EPip1FdWrX3PjL2heHv5KYmGzb6e7mumvHxs7d+4s33dxcs7Ky69SuOWbUsA8++lTcefHMkcnT3gqPiCQilUql0+lshzesX3ftdyvlMhkRxcTGzXxzrjo9w/6kn3+yuEvnjkUu8sg97Z+3NUT0TgfvGq4y2/5vQtNvphvEbQlLfo7SFr4OvWo4SVjKNfFzjyUpJMyXvQNK9YRApSlQ+MzzPMuyUqmUiMikEVKuMEHdich6aiERcZ2XEpEQd5zxaUFyVyLK0eq0Or1YBC3+WaAIWiqROKmULFvqsujbMYn1auBtAwAAAAAAAAAAUFBJvTKKtHLVdwXSZ5sismy7PVlZ2UWOse2xT5+J6MatqM1btovb78xbVCB9JqKFHyzT5GiLXMm5hPypLiTqC9/r5yip4y6v4SrP0Ft2R2l2RWmKnASeKIXTZ7H5Rv696ZH8lVV8+PdExDWdzDWdLJZC81dWCRk3xDHOTioJx9ofXuANabZYcnRFvGEAAAAAAADgGWC1WtuG9O47YFhVL+SJc+V67oz3oka+fuPo6ayqXgv8F4Xf0g2bEbnyx/gnaiooR6VuwXH+4mVxY8TwIbPenH4/Lj708rXQS1eIiB5EeeNfGdmuTSupVEL/7I3Qo1vnNq1bFJjQVoJaLTBg2NDBLs5OK79ek52tIaKNW7aNf2Xk7r0Hxcjbzc3153Wr/Xx9Xntjbujlq0ajceeufePGjigw4X2NKUVnCXSSJmrNl5J0wxq6cv8saR1cz7Wpt4KIErXmpadTryTrX6rvUtrnASqTLSa2j6FZu6sOMv4dmZr92WpdiYjk+b3F2WrdeYZj/B+WyTuqHDKzcxiGETtBi+892wYRWSxWg8msQC8OAAAAAAB4ImlytKtWrz13PtSQZ6hbt/brMyY3btSAiNLS1J9+/nVYeKRMLuvaueOsN6eL9TonTp394cdf78cluLu5Pte7+9TJ4ySShznA2XMX35qzkIg2//Z9rZo1CpxLEITtO/d+87/v8/IMG35eU7dOLXH/1WvhX//vh3sxsV6eHiOGDxk6ZNDjrzP00tU1P/wcfS9GJpM1blj/jZlTgqsHVfBz9hDDMBPGjXJwUFbaGR/T7CV345KMbi6SNcvq2b6UezFM+9naONsYRweueqBixEDv+rUcdHp+/Jybchmz4auG9vMUOIRlydNN2rKx08sDvB1VHBXvtx0p6kzzi308awc/cU9OJVi/NWX/8fyqR5mU8fOWd2jpPLCXp0xaxReOiok3bD+gvn1Pn5NrcVByTeurRg7y8fWSPcahFeX7TUlxScYlcwr+dfGYzl/N2fdXRkKyUW+wujpLOrZyGTHQRy5jvD2kQ/p6VQ9Q/PsVluNUUI5KHUDzPC9unD1/sXatGr17dhv64sChLw60v6tGcFCrls2IyGgy2Q5c9vF7vXp0JaIDh44WOXOXTh1GDHuRiJycnGa98x4RaTQ5oZevHj56XBww5IUBfr4+RDRu7IjQy1eJ6My5C4UDaLHquXOQY0Ra3nW1ISw1r6Vv0X+BuiklRKQ1WUv7JECVEB4Q65dZ5h/1+2zTV4mIv7hCSDorRtJs27msez37MRIJJx5eOHq2MRoRQAMAAAAAwBPqw48/PXP2Qt06tdzcXC9cvPzW7IU7tv7i7OS0YNHS8IjItm1aZmVlb/9zj6Oj6rVpkyJv3Jo7/0O5XN6hXesbN2//8tsWuUw2eeJYcSqDwfDp519zHGu18oVPZLVa33h7/uWrYUrlP36hzs7WzJq7yGQydQppFx5xY8UX3wQG+rdv2/px1mk0mmbNfZ+IRo14KTMza9eeA7H343f88UvFPV0FsCw7fcqESjvdY4pPNsYlGRVyNktjiYzSNamnsr/X1VlSr6YDEaVlmCKjdB+tiv18YW0Xp5LCHNshFotwO0Z/8ERmarppwWvVSzgkU2NmGBo12Kf8HtbTJzhQ4e8jy86xRN/P27In7cK1nI9m1VDIS905oLykZ5o/XBljMPLtWji7u0rvxOrPXs65E5P3zUd1uSpbFF2O0Hp5lDEBDw3TfvFDvMqBbdfcRS5jLkdo9x7L0Ggtb4wP9PWSjRzkXS4rLMepoByVOoDu3rXTjp17iSguLmHJ8i++XfPjlMnjhrwwwH7Mw7YGdv0NHJRFp8APi1sfVFB36tjO0VGVm6sTz3I/LkHc/+P6DT+u32B/rO0uG6tAoUl6lqGWfkqFhLmuNlxM1BUIoJO1ZoWE0Risx2NziaihJz4VeaIVKH8W02erlVco8j+/5cPWCDnxXOPx5OApps9EJCSdJWMW6dKskb8yztXEaxJKJVKrlWcYhud5juOKLoK24gMJAAAAAAB4EuVotbduRdWrW/vndas5jntz9oJz50Nv3ozy9vYKj4hs0bzp6q8+zcsz9Bs0fOfu/TOmTjxx6qybq8vrr03p36935I1bE159/dyFS7YAeu26X/V5eR3atTl99kLhcxmNpmyNZv0P36z+dp1YASY6evykTqebPHHslEmvXL0WPvW12X/u2lcggC5unTzPG43GXj26iinwmXMXExKTtNpcJydH+8P3Hzzy68Y/EhOSPDzchw99YdSIl8T9e/cd+nnDluTkVD9fb1vldUZG5pdff3fp8jWT0dS0ScM5s2ZWCwwgIovF8u3anw4cOpabq6tdq8bM6ZNbtWxmtVo7dOnr7uZ6cO/WEk5UpDXrfrVYrK9NHV+4jEkQhNVr1kul0mmTx5b+VaVTF7OJ6MU+npt3p50J1RQIoGsGKeZMqSZuf/1zwqmLmnNXNH27epQwof0hWRrL1AW3w27kmsyCTMpkZFnWb02+eVdnMgstGztOHuHvoORGzIwUBw+bETl+qG//Hh5HT2cd+DsjOc3k6S59eYB3SGsXInpnWXRsgmHu1KA1GxMH9fIc/Jznzbv6TbtSYxM
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Image\n",
"\n",
"Image(stripe_crawl_result['data'][0]['screenshot'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that specifying more formats to transform the page's contents can significantly slow down the process. \n",
"\n",
"Another time-consuming operation can be scraping the entire page contents instead of just the elements you want. For such scenarios, Firecrawl allows you to control which elements of a webpage are scraped using the `onlyMainContent`, `includeTags`, and `excludeTags` parameters.\n",
"\n",
"Enabling `onlyMainContent` parameter (disabled by default) excludes navigation, headers and footers:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"stripe_crawl_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/api\",\n",
" params={\n",
" \"limit\": 5,\n",
" \"scrapeOptions\": {\n",
" \"formats\": [\"markdown\", \"html\"], \n",
" \"onlyMainContent\": True,\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`includeTags` and `excludeTags` accepts a list of whitelisted/blacklisted HTML tags, classes and IDs:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Crawl the first 5 pages of the stripe API documentation\n",
"stripe_crawl_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/api\",\n",
" params={\n",
" \"limit\": 5,\n",
" \"scrapeOptions\": {\n",
" \"formats\": [\"markdown\", \"html\"],\n",
" \"includeTags\": [\"code\", \"#page-header\"],\n",
" \"excludeTags\": [\"h1\", \"h2\", \".main-content\"],\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Crawling large websites can take a long time and when appropriate, these small tweaks can have a big impact on the runtime."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### URL Control"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apart from scraping configurations, you have four options to specify URL patterns to include or exclude during crawling:\n",
"\n",
"- `includePaths` - targeting specific sections\n",
"- `excludePaths` - avoiding unwanted content\n",
"- `allowBackwardLinks` - handling cross-references\n",
"- `allowExternalLinks` - managing external content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a sample request that uses these parameters:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total pages crawled: 134\n"
]
}
],
"source": [
"# Example of URL control parameters\n",
"url_control_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/\",\n",
" params={\n",
" # Only crawl pages under the /payments path\n",
" \"includePaths\": [\"/payments/*\"],\n",
" # Skip the terminal and financial-connections sections\n",
" \"excludePaths\": [\"/terminal/*\", \"/financial-connections/*\"],\n",
" # Allow crawling links that point to already visited pages\n",
" \"allowBackwardLinks\": False,\n",
" # Don't follow links to external domains\n",
" \"allowExternalLinks\": False,\n",
" \"scrapeOptions\": {\n",
" \"formats\": [\"html\"]\n",
" }\n",
" }\n",
")\n",
"\n",
"# Print the total number of pages crawled\n",
"print(f\"Total pages crawled: {url_control_result['total']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we're crawling the Stripe documentation website with specific URL control parameters:\n",
"\n",
"- The crawler starts at \"https://docs.stripe.com/\" and only crawls pages under the `\"/payments/*\"` path\n",
"- It explicitly excludes the `\"/terminal/*\"` and `\"/financial-connections/*\"` sections\n",
"- By setting allowBackwardLinks to false, it won't revisit already crawled pages\n",
"- External links are ignored (`allowExternalLinks: false`)\n",
"- The scraping is configured to only capture HTML content\n",
"\n",
"This targeted approach helps focus the crawl on relevant content while avoiding unnecessary pages, making the crawl more efficient and focused on the specific documentation sections we need."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another critical parameter is `maxDepth`, which lets you control how many levels deep the crawler will traverse from the starting URL. For example, a `maxDepth` of 2 means it will crawl the initial page and pages linked from it, but won't go further.\n",
"\n",
"Here is another sample request on the Stripe API docs:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total pages crawled: 99\n"
]
}
],
"source": [
"# Example of URL control parameters\n",
"url_control_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/\",\n",
" params={\n",
" \"limit\": 100,\n",
" \"maxDepth\": 2,\n",
" \"allowBackwardLinks\": False,\n",
" \"allowExternalLinks\": False,\n",
" \"scrapeOptions\": {\"formats\": [\"html\"]},\n",
" },\n",
")\n",
"\n",
"# Print the total number of pages crawled\n",
"print(f\"Total pages crawled: {url_control_result['total']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: When a page has pagination (e.g. pages 2, 3, 4), these paginated pages are not counted as additional depth levels when using `maxDepth`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performance & Limits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `limit` parameter, which we've used in previous examples, is essential for controlling the scope of web crawling. It sets a maximum number of pages that will be scraped, which is particularly important when crawling large websites or when external links are enabled. Without this limit, the crawler could potentially traverse an endless chain of connected pages, consuming unnecessary resources and time.\n",
"\n",
"While the limit parameter helps control the breadth of crawling, you may also need to ensure the quality and completeness of each page crawled. To make sure all desired content is scraped, you can enable a waiting period to let pages fully load. For example, some websites use JavaScript to handle dynamic content, have iFrames for embedding content or heavy media elements like videos or GIFs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stripe_crawl_result = app.crawl_url(\n",
" url=\"https://docs.stripe.com/api\",\n",
" params={\n",
" \"limit\": 5,\n",
" \"scrapeOptions\": {\n",
" \"formats\": [\"markdown\", \"html\"],\n",
" \"waitFor\": 1000, # wait for a second for pages to load\n",
" \"timeout\": 10000, # timeout after 10 seconds\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above code also sets the `timeout` parameter to 10000 milliseconds (10 seconds), which ensures that if a page takes too long to load, the crawler will move on rather than getting stuck.\n",
"\n",
"Note: `waitFor` duration applies to all pages the crawler encounters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the while, it is important to keep the limits of your plan in mind:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Plan | /scrape (requests/min) | /crawl (requests/min) | /search (requests/min) |\n",
"| --- | --- | --- | --- |\n",
"| Free | 10 | 1 | 5 |\n",
"| Hobby | 20 | 3 | 10 |\n",
"| Standard | 100 | 10 | 50 |\n",
"| Growth | 1000 | 50 | 500 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Asynchronous Crawling in Firecrawl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even after following the tips and best practices from the previous section, the crawling process can be significantly long for large websites with thousands of pages. To handle this efficiently, Firecrawl provides asynchronous crawling capabilities that allow you to start a crawl and monitor its progress without blocking your application. This is particularly useful when building web applications or services that need to remain responsive while crawling is in progress."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Asynchronous programming in a nutshell"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's understand asynchronous programming with a real-world analogy:\n",
"\n",
"Asynchronous programming is like a restaurant server taking multiple orders at once. Instead of waiting at one table until the customers finish their meal before moving to the next table, they can take orders from multiple tables, submit them to the kitchen, and handle other tasks while the food is being prepared. \n",
"\n",
"In programming terms, this means your code can initiate multiple operations (like web requests or database queries) and continue executing other tasks while waiting for responses, rather than processing everything sequentially. \n",
"\n",
"This approach is particularly valuable in web crawling, where most of the time is spent waiting for network responses - instead of freezing the entire application while waiting for each page to load, async programming allows you to process multiple pages concurrently, dramatically improving efficiency."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `async_crawl_url` method"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl offers an intuitive asynchronous crawling method via `async_crawl_url`:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'success': True, 'id': 'c4a6a749-3445-454e-bf5a-f3e1e6befad7', 'url': 'https://api.firecrawl.dev/v1/crawl/c4a6a749-3445-454e-bf5a-f3e1e6befad7'}\n"
]
}
],
"source": [
"app = FirecrawlApp()\n",
"\n",
"crawl_status = app.async_crawl_url(\"https://docs.stripe.com\")\n",
"\n",
"print(crawl_status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It accepts the same parameters and scrape options as `crawl_url` but returns a crawl status dictionary. \n",
"\n",
"We are mostly interested in the crawl job `id` and can use it to check the status of the process using `check_crawl_status`:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29\n"
]
}
],
"source": [
"checkpoint = app.check_crawl_status(crawl_status['id'])\n",
"\n",
"print(len(checkpoint['data']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`check_crawl_status` returns the same output as `crawl_url` but only includes the pages scraped so far. You can run it multiple times and see the number of scraped pages increasing. \n",
"\n",
"If you want to cancel the job, you can use `cancel_crawl` passing the job id:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'status': 'cancelled'}\n"
]
}
],
"source": [
"final_result = app.cancel_crawl(crawl_status['id'])\n",
"\n",
"print(final_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Benefits of asynchronous crawling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many advantages of using the `async_crawl_url` over `crawl_url`:\n",
"\n",
"- You can create multiple crawl jobs without waiting for each to complete.\n",
"- You can monitor progress and manage resources more effectively.\n",
"- Perfect for batch processing or parallel crawling tasks.\n",
"- Applications can remain responsive while crawling happens in background\n",
"- Users can monitor progress instead of waiting for completion\n",
"- Allows for implementing progress bars or status updates\n",
"- Easier to integrate with message queues or job schedulers\n",
"- Can be part of larger automated workflows\n",
"- Better suited for microservices architectures\n",
"\n",
"In practice, you almost always use asynchronous crawling for large websites. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving Crawled Content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"When crawling large websites, it's important to save the results persistently. Firecrawl provides the crawled data in a structured format that can be easily saved to various storage systems. Let's explore some common approaches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Local file storage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The simplest approach is saving to local files. Here's how to save crawled content in different formats:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"\n",
"\n",
"def save_crawl_results(crawl_result, output_dir=\"firecrawl_output\"):\n",
" # Create output directory if it doesn't exist\n",
" Path(output_dir).mkdir(parents=True, exist_ok=True)\n",
"\n",
" # Save full results as JSON\n",
" with open(f\"{output_dir}/full_results.json\", \"w\") as f:\n",
" json.dump(crawl_result, f, indent=2)\n",
"\n",
" # Save just the markdown content in separate files\n",
" for idx, page in enumerate(crawl_result[\"data\"]):\n",
" # Create safe filename from URL\n",
" filename = (\n",
" page[\"metadata\"][\"url\"].split(\"/\")[-1].replace(\".html\", \"\") or f\"page_{idx}\"\n",
" )\n",
"\n",
" # Save markdown content\n",
" if \"markdown\" in page:\n",
" with open(f\"{output_dir}/{filename}.md\", \"w\") as f:\n",
" f.write(page[\"markdown\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is what the above function does:\n",
"1. Creates an output directory if it doesn't exist\n",
"2. Saves the complete crawl results as a JSON file with proper indentation\n",
"3. For each crawled page:\n",
" - Generates a filename based on the page URL\n",
" - Saves the markdown content to a separate .md file"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"app = FirecrawlApp()\n",
"\n",
"crawl_result = app.crawl_url(url=\"https://docs.stripe.com/api\", params={\"limit\": 10})\n",
"\n",
"save_crawl_results(crawl_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is a basic function that requires modifications for other scraping formats."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Database storage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more complex applications, you might want to store the results in a database. Here's an example using SQLite:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"\n",
"\n",
"def save_to_database(crawl_result, db_path=\"crawl_results.db\"):\n",
" conn = sqlite3.connect(db_path)\n",
" cursor = conn.cursor()\n",
"\n",
" # Create table if it doesn't exist\n",
" cursor.execute(\n",
" \"\"\"\n",
" CREATE TABLE IF NOT EXISTS pages (\n",
" url TEXT PRIMARY KEY,\n",
" title TEXT,\n",
" content TEXT,\n",
" metadata TEXT,\n",
" crawl_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n",
" )\n",
" \"\"\"\n",
" )\n",
"\n",
" # Insert pages\n",
" for page in crawl_result[\"data\"]:\n",
" cursor.execute(\n",
" \"INSERT OR REPLACE INTO pages (url, title, content, metadata) VALUES (?, ?, ?, ?)\",\n",
" (\n",
" page[\"metadata\"][\"url\"],\n",
" page[\"metadata\"][\"title\"],\n",
" page.get(\"markdown\", \"\"),\n",
" json.dumps(page[\"metadata\"]),\n",
" ),\n",
" )\n",
"\n",
" conn.commit()\n",
"\n",
" print(f\"Saved {len(crawl_result['data'])} pages to {db_path}\")\n",
" conn.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function creates a SQLite database with a `pages` table that stores the crawled data. For each page, it saves the URL (as primary key), title, content (in markdown format), and metadata (as JSON). The crawl date is automatically added as a timestamp. If a page with the same URL already exists, it will be replaced with the new data. This provides a persistent storage solution that can be easily queried later."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saved 9 pages to crawl_results.db\n"
]
}
],
"source": [
"save_to_database(crawl_result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's query the database to double-check:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('https://docs.stripe.com/api/errors', 'Errors | Stripe API Reference', '{\"url\": \"https://docs.stripe.com/api/errors\", \"title\": \"Errors | Stripe API Reference\", \"language\": \"en-US\", \"viewport\": \"width=device-width, initial-scale=1\", \"sourceURL\": \"https://docs.stripe.com/api/errors\", \"statusCode\": 200, \"description\": \"Complete reference documentation for the Stripe API. Includes code snippets and examples for our Python, Java, PHP, Node.js, Go, Ruby, and .NET libraries.\", \"ogLocaleAlternate\": []}')\n"
]
}
],
"source": [
"# Query the database\n",
"conn = sqlite3.connect(\"crawl_results.db\")\n",
"cursor = conn.cursor()\n",
"cursor.execute(\"SELECT url, title, metadata FROM pages\")\n",
"print(cursor.fetchone())\n",
"conn.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cloud storage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For production applications, you might want to store results in cloud storage. Here's an example using AWS S3:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"from datetime import datetime\n",
"\n",
"\n",
"def save_to_s3(crawl_result, bucket_name, prefix=\"crawls\"):\n",
" s3 = boto3.client(\"s3\")\n",
" timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
"\n",
" # Save full results\n",
" full_results_key = f\"{prefix}/{timestamp}/full_results.json\"\n",
" s3.put_object(\n",
" Bucket=bucket_name,\n",
" Key=full_results_key,\n",
" Body=json.dumps(crawl_result, indent=2),\n",
" )\n",
"\n",
" # Save individual pages\n",
" for idx, page in enumerate(crawl_result[\"data\"]):\n",
" if \"markdown\" in page:\n",
" page_key = f\"{prefix}/{timestamp}/pages/{idx}.md\"\n",
" s3.put_object(Bucket=bucket_name, Key=page_key, Body=page[\"markdown\"])\n",
" print(f\"Successfully saved {len(crawl_result['data'])} pages to {bucket_name}/{full_results_key}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is what the function does:\n",
"- Takes a crawl result dictionary, S3 bucket name, and optional prefix as input\n",
"- Creates a timestamped folder structure in S3 to organize the data\n",
"- Saves the full crawl results as a single JSON file\n",
"- For each crawled page that has markdown content, saves it as an individual `.md` file\n",
"- Uses boto3 to handle the AWS S3 interactions\n",
"- Preserves the hierarchical structure of the crawl data\n",
"\n",
"For this function to work, you must have `boto3` installed and your AWS credentials saved inside the `~/.aws/credentials` file with the following format:\n",
"\n",
"```bash\n",
"[default]\n",
"aws_access_key_id = your_access_key\n",
"aws_secret_access_key = your_secret_key\n",
"region = your_region\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, you can execute the function provided that you already have an S3 bucket to store the data:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Successfully saved 9 pages to sample-bucket-1801/stripe-api-docs/20241118_142945/full_results.json\n"
]
}
],
"source": [
"save_to_s3(crawl_result, \"sample-bucket-1801\", \"stripe-api-docs\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Incremental saving with async crawls"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using async crawling, you might want to save results incrementally sa they come in:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"\n",
"def save_incremental_results(app, crawl_id, output_dir=\"firecrawl_output\"):\n",
" Path(output_dir).mkdir(parents=True, exist_ok=True)\n",
" processed_urls = set()\n",
"\n",
" while True:\n",
" # Check current status\n",
" status = app.check_crawl_status(crawl_id)\n",
"\n",
" # Save new pages\n",
" for page in status[\"data\"]:\n",
" url = page[\"metadata\"][\"url\"]\n",
" if url not in processed_urls:\n",
" filename = f\"{output_dir}/{len(processed_urls)}.md\"\n",
" with open(filename, \"w\") as f:\n",
" f.write(page.get(\"markdown\", \"\"))\n",
" processed_urls.add(url)\n",
"\n",
" # Break if crawl is complete\n",
" if status[\"status\"] == \"completed\":\n",
" print(f\"Saved {len(processed_urls)} pages.\")\n",
" break\n",
"\n",
" time.sleep(5) # Wait before checking again"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is what the function does:\n",
"- Creates an output directory if it doesn't exist\n",
"- Maintains a set of processed URLs to avoid duplicates\n",
"- Continuously checks the crawl status until completion\n",
"- For each new page found, saves its markdown content to a numbered file\n",
"- Sleeps for 5 seconds between status checks to avoid excessive API calls\n",
"\n",
"Let's use it while the app crawls Books-to-Scrape website:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Start the crawl\n",
"crawl_status = app.async_crawl_url(url=\"https://books.toscrape.com/\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# Save results incrementally\n",
"save_incremental_results(app, crawl_status[\"id\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saved 705 pages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawling Using the LangChain Integration\n",
"- Integration with popular frameworks\n",
"- Combining with other Firecrawl features\n",
"- Building automated workflows"
]
},
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACRQAAAQoCAYAAABxM8BNAAAAAXNSR0IArs4c6QAAAGxlWElmTU0AKgAAAAgABAEaAAUAAAABAAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAIdpAAQAAAABAAAATgAAAAAAAACQAAAAAQAAAJAAAAABAAKgAgAEAAAAAQAACRSgAwAEAAAAAQAABCgAAAAA37QWhwAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAQABJREFUeAHsfQngdsd097wRS0QoSSwRKpFE1K5N7S2xNEXtmib4iApBidbWSCxVYmlrLbp8BLULlQoSBFGJKu1XihIRSxIShIglKvK+3537POfemblnZs6ZOTP33ud/H/65s5w55/f7nblz73OfeZ9n247mpZbX5imQk9Zt2/h65MTjR/OPmCD2lBPMm4EEfpRT3BvPr/S8enIJpiSRolAuri6Gz1Ep4F3gpVBagRJra2gdKRFPVCPfXBcNsjgjKEC5thDcyJvMdIqUgj3ZPMlnvopHM0/05VLqWmxGN+l6/LfmvjHm+OmVffN2nmzq6OvTTEffqRCEHSMlJMSVQrWUHpTYJWyoecjVjYJ9qO04k4TP1V1Hx8Ed0rhFhMAicZ0+PYs6QrPtJ3G1PI1U8REQhuPqsU25iRYO2LkbnuldV0LB5ZHgAh2SnYZt8nqW4ooK4DRm62H4G5OHASNYpPMNz+ckrgXmTpAsoZOiRxJXQuy5muiZEbvH2hTNqDzCZwsv0zFted541lS+PK/p1jxdm6s9/UFEOqhmJDdHPB40aLW40tD0VpQ1tbdutOTmjBvADGaU09wYcyzNgYEgr8gNz9Y5D176aIQY0mT5nw23NeoO74DYNjVosphOq9LxSIQ14DpoSHQcGAaYS4XaORB76dqqCuiLXOgD503ShXtBT+CuT16xRxHM3MACkgB7GVJagVKremnci/90BSqsNyRwU8FBAmsaLSeNqcZky5IXvS2S8lI0l3sA2bPEzNNKW7NFNhbP21Rw8FD7rH3zdrNY+tintfs0095KPNjVfrkPmPUYiVeIK8V/KT0osaVtODnI1U0ae0l/m8i1Xf+QRTCJK+KnZD64vn3wkrhyg0vY+whI+DZ8mHrU20ikAciuoiYPg15SsZL0SdgkeaYAkNJmbB4U7qNzndpmovWzH7Fn0pQkbICNXulC91lzOBcoaaDykF35w9pScHNtqDy5fkP20prpWGPwCHGEvq3ElXuNScpZIEigC9KRdVxdMsa/YnB5JumcpZRnMBd44yY2ZDLcGqwsLAgxpMkj5LjNLJ4O1CDHYKfjKLGqsZcOs2woSkzO5IfpDUG1PjSuFWfyossALH3Suyhrx3PjL3UJBfTNnpvJ8W8AJZgtPhYFFgWmp0DOzfX02IyLyF25x0WzRPcpYOZpFvPfBOwjNaP2DaMjqnxoPhZ5uDtiMkJcKaKW0IMSt6RNriaS2MbWV06LepOcHAkxZPHV47G3i5ITINMXQrHzyOLajRqhECIhBMfVou5mIiESjRuXB8VzBXl7GEKbQlJ49iDyS5Kajc0lpkYyVyfXHc9khzGkdfoB/vJUMK435BzuY/RGou162AaLB5xj6oAmMbsp9lM5SmMvoVnPBc5sHHVoExw+Ythq4g9HG47Nbel55nqSHZ+iQzKXlGBCdFeYDQBGUSiE5UbCfbLOFhKBSgIZypCp8GPhcIjZVbsmoLyYixjHbOTZDghUK2wm0h8/LxuKCLmYrUnOpiK9SWhu31LExVtxI5ReM8TeixBzE1sIZzuvxwAumsCGQJGLiJ5h4Fhsto2h9mbGrLjebKaAC6spKLBcV2SzACu2rNeVtyVXcqqaeZqFriZgORlG87RhdER07DRp7i26sohnvxOJB9R+7/Ge3HPPfDgejzYPi1xN5sHSj1KWf60zacVHn085czKJe12K/sSteyhwknhGIxcyoBBihI5x30obibRswvKGM+FsMAkb472x/OGj0ltL61ObD0eJJO7IoClzTNVjUzhx+Lu2cK0NarGeD1vhKWpQB0M80M1omkWRyq8EGWnNbC7IouWQ4LxXk8bqQBlUbS6D7lEb4sry4HVcpR03MAq47MitcJeMIIO/07dDPoECV7b150Nzueagmkc4491465gZRLkhgLKRZzsYgjKxF3A/DKhb1oGWDUW4PJvTCptsUj7M1mNgvE+RFL/al/abOtbEEsNn2o5c1udc7GJRbQFYa1E73sgpyAtPSSAlQknR4ZwqGYPCcbFZFBBXYJnU4pIyHJo3qoxhW9J0CjN1yZfc1DPzOWldTaBy9Ef3tKG0snTtNIF7vixv+GDOA2ncg2xr7rlX+6G5LHvc2/bou0p83Fxbc+fAVHjDuZUzJ+eoRbduERMxO45cgogOFM5bbRORlklAWkRtpElgE5H2SskjEp3dVE2XipxABCluaC7AeezBLICZ0RGoAWSUP3RugSNcZ10d4Dq8BSRgrUegVw1dJHLg5rUGbl8MKe2GnNyz2oegufbQTbM2tPsRDHuGfIY2Y7QwpGLB2wHvzQoFKOTWWCdKRUi/l5vqHOomRkQys3vyXDpSfWGA2SBkFPsBaIluiQ4v0Djg5YkhhjzTUQhvpmsPc0+zEWzZUOTRaOOaYeMN9+GztoexrihcX3q86QvKuX5cXJR6SkyK34iNPvd8712N8zLiRTsJ5KXt9kXpXbPi9cO2dimUQIoypUR30w31UvEoXBebRQExBZaJLCYl01HoxpXpauPNpzJLl5zJTTUzp5PW1QQqR390TxtKK0vXTpMC72M4D6KzSDAH5557Ug/5mbCLmreadJOhaCiy8xI65+aeDL6SIZxjOVrNTRPONJ0bt27acEh2g/pCiPdW3ECklcmUtBcXKwltHHJdh/Lo2vrqRXn7ggbaJTgF3A+6pPh7cUsFGCAftwGj5dVgXKhVosM11tUArsFVQIwYxOUdggJahWym0MfhRME7Nm8aH+zMxtlR5nZpzjROOP7SrXQleyQtn5SBvQvRkgSUcI4kIgwpp3oNYx3GqdpCIAUmk+bhEQ3FvCYEvDxDm+a4hX9s2R6UVyCkGJMERxSsCW4DbAldTsBlQxFBs40y0Zt4uA+h3c0r3PEgIGwggjocuZh8fsBf6JiKPeST2afPQdjvoYc65yTdG3Ax9JjkokNnNA9LN4FU1MmJjgQwJ5NrqvtKxXVjLfVFAXEFlskrLinDIeV6wnC30aZTmKlLvmSnmJnTSWtrApWVYFRvG0orS9NOE7j/z/LWD6Y8hO6t65Z8517ph+J1WfKi+TTheZm29SZyhPMsZ+7ORZdurSJMs7lwQqlwiDoOYrznuJEoxsmRwKpmSGn5CVYmvJFI466iQVCgVWdOHgnuURMp7mNgRwllNEposQk6pEior6+au/l4FK69Kf7mNoaa95z7EAlNqDmh8qFgGp0z+70bfSUI6VmSt2R+KDmk2NBVw711nLSjXGd4iKqtHZ9g1ADRQJfPZcKQzhUNb2ceL7hgzItDfHRv4frpe6wSmGkeqaEsh4UrUb2BUIPDKCKowr3IgOJNUW4eBOJMIg5TcEZcephlNiNBtzXg5zDPM5kvwwcK1E67sellgAUaKJgofsAfHCl+wXY51ldAL0ylVyFk8atPtEBEqm6byr+ApMVcbvw6tEyyYnNnJMfJt4cwFajrk48f+PH1T6S9NMzkPExEHwkYpTUOYfTrz0VV4ITgQggRnVLfxl8v+WJ3qSZqE3q4zI8+zgjfuVfy4fg4TOlRQZMp5Teej272eokCL6/BzDsgX3Gt/ESnrFE8wzavKXOxkQZqDNIcvv6NRCmzZ3uAQE6XjYXDz43KkNEdSqsX2kSkg+fwdsEX18ENuK5LcvCECDZL8SbxCAYLdgY5pHSWjEbSIgX0SGPs1WYIwuUL19uh5Wa2uPxDLGNahsZK9FFyw+FDwTQ6Z+L7tiEX2iqBaVqSs3R+hrx5LTSVIj4bJ1PjZSJO4UjjE/Ec6c7FqMfTcJqRiGUKdsqjOYIfbVKMB5EuxYyEEeGLNBnhwr2GYfEiid8aRTXUnkAcrKZwHnem
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl has integrations with popular open-source libraries like LangChain and other platforms.\n",
"\n",
"![Tools Firecrawl integrates with](attachment:image.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we will see how to use the LangChain integration to build a basic QA chatbot on the [LangChain Community Integrations](https://python.langchain.com/docs/integrations/providers/) website.\n",
"\n",
"Start by installing LangChain and its related libraries:\n",
"\n",
"```bash\n",
"$ pip install langchain langchain_community langchain_anthropic langchain_openai\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, add your `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` as variables to your `.env` file. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, import the `FireCrawlLoader` class from the document loaders module and initialize it:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"from langchain_community.document_loaders.firecrawl import FireCrawlLoader\n",
"\n",
"load_dotenv()\n",
"\n",
"loader = FireCrawlLoader(\n",
" url=\"https://python.langchain.com/docs/integrations/providers/\",\n",
" mode=\"crawl\",\n",
" params={\"limit\": 5, \"scrapeOptions\": {\"onlyMainContent\": True}},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The class can read your Firecrawl API key automatically since we are loading the variables using `load_dotenv()`.\n",
"\n",
"To start the crawl, you can call the `load()` method of the loader object and the scraped contents will be turned into LangChain compatible documents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Start the crawl\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is chunking:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"# Add text splitting before creating the vector store\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
"\n",
"# Split the documents\n",
"split_docs = text_splitter.split_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above, we split the documents into smaller chunks using the `RecursiveCharacterTextSplitter`. This helps make the text more manageable for processing and ensures better results when creating embeddings and performing retrieval. The chunk size of 1000 characters with 100 character overlap provides a good balance between context preservation and granularity."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"from langchain_chroma import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain_community.vectorstores.utils import filter_complex_metadata\n",
"\n",
"# Create embeddings for the documents\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# Create a vector store from the loaded documents\n",
"docs = filter_complex_metadata(docs)\n",
"vector_store = Chroma.from_documents(docs, embeddings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Moving on, we created a vector store using Chroma and OpenAI embeddings. The vector store allows us to perform semantic search and retrieval on our documents. We also filter out any complex metadata that might cause issues during storage.\n",
"\n",
"The final step is building the QA chain where we use Claude 3.5 Sonnet as the language model:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain_anthropic import ChatAnthropic\n",
"\n",
"# Initialize the language model\n",
"llm = ChatAnthropic(model=\"claude-3-5-sonnet-20240620\", streaming=True)\n",
"\n",
"# Create a QA chain\n",
"qa_chain = RetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=vector_store.as_retriever(),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can ask questions about our documents:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'query': 'What is the main topic of the website?', 'result': \"The main topic of the website is LangChain's integrations with Hugging Face. The page provides an overview of various LangChain components that can be used with Hugging Face models and services, including:\\n\\n1. Chat models\\n2. LLMs (Language Models)\\n3. Embedding models\\n4. Document loaders\\n5. Tools\\n\\nThe page focuses on showing how to use different Hugging Face functionalities within the LangChain framework, such as embedding models, language models, datasets, and other tools.\"}\n"
]
}
],
"source": [
"# Example question\n",
"query = \"What is the main topic of the website?\"\n",
"answer = qa_chain.invoke(query)\n",
"\n",
"print(answer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This section demonstrated a rough process for building a basic RAG pipeline for content scraped using Firecrawl. For this version, we only used 10 pages from the LangChain documentation and as the information increases, the pipeline would need more work. To scale this pipeline, we would need to consider factors like chunking strategy, embedding model selection, vector store optimization, and prompt engineering to handle larger document collections effectively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Throughout this guide, we've explored Firecrawl's `/crawl` endpoint and its capabilities for web scraping at scale. From basic usage to advanced configurations, we covered URL control, performance optimization, and asynchronous operations. We also examined practical implementations including data storage solutions and integration with frameworks like LangChain.\n",
"\n",
"The endpoint's ability to handle JavaScript content, pagination, and various output formats makes it a versatile tool for modern web scraping needs. Whether building documentation chatbots or gathering training data, Firecrawl provides a robust foundation. By leveraging the configuration options and best practices discussed, you can build efficient and scalable web scraping solutions tailored to your specific requirements."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
}
},
"nbformat": 4,
"nbformat_minor": 2
}