firecrawl/examples/blog-articles/mastering-scrape-endpoint/mastering-scrape-endpoint.ipynb

1896 lines
3.1 MiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Complete Guide To Web Scraping Using Firecrawl's Scrape Endpoint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Traditional web scraping offers unique challenges. Relevant information is often scattered across multiple pages containing complex elements like code blocks, iframes, and media. JavaScript-heavy websites and authentication requirements add additional complexity to the scraping process.\n",
"\n",
"Even after successfully scraping, the content requires specific formatting to be useful for downstream processes like data engineering or training AI and machine learning models. \n",
"\n",
"Firecrawl addresses these challenges by providing a specialized scraping solution. Its `/scrape` endpoint offers features like JavaScript rendering, automatic content extraction, bypassing blockers and flexible output formats that make it easier to collect high-quality information and training data at scale.\n",
"\n",
"In this guide, we'll explore how to effectively use Firecrawl's `/scrape` endpoint to extract structured data from static and dynamic websites. {prompt: finish this paragraph based on the table of contents I have}\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firecrawl's scraping engine is exposed as a REST API, so you can use command-line tools like cURL to use it. However, for a more comfortable experience, better flexibility and control, I recommend using one of its SDKs for Python, Node, Rust or Go. This tutorial will focus on the Python version.\n",
"\n",
"To get started, please make sure to:\n",
"\n",
"1. Sign up at [firecrawl.dev]().\n",
"2. Choose a plan (the free one will work fine for this tutorial).\n",
"\n",
"Once you sign up, you will be given an API token or you can copy it from your [dashboard](https://www.firecrawl.dev/app). The best way to save your key is by using a `.env` file, ideal for the purposes of this article:\n",
"\n",
"```bash\n",
"$ touch .env\n",
"$ echo \"FIRECRAWL_API_KEY='YOUR_API_KEY'\" >> .env\n",
"```\n",
"\n",
"Now, let's install Firecrawl Python SDK, `python-dotenv` to read `.env` files, and {add other packages used in the rest of the tutorial}:\n",
"\n",
"\n",
"```bash\n",
"$ pip install firecrawl-py python-dotenv {other-packages-used}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Scraping Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scraping with Firecrawl starts by creating an instance of the `FirecrawlApp` class:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from firecrawl import FirecrawlApp\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"\n",
"app = FirecrawlApp()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you use the `load_dotenv()` function, the app can automatically use your loaded API key to establish a connection with the scraping engine. Then, scraping any URL takes a single line of code:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://arxiv.org\"\n",
"data = app.scrape_url(url)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the response format returned by `scrape_url` method:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'arXiv.org e-Print archiveopen searchopen navigation menucontact arXivsubscribe to arXiv mailings',\n",
" 'language': 'en',\n",
" 'ogLocaleAlternate': [],\n",
" 'viewport': 'width=device-width, initial-scale=1',\n",
" 'msapplication-TileColor': '#da532c',\n",
" 'theme-color': '#ffffff',\n",
" 'sourceURL': 'https://arxiv.org',\n",
" 'url': 'https://arxiv.org/',\n",
" 'statusCode': 200}"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['metadata']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response `metadata` includes basic information like the page title, viewport settings and a status code. \n",
"\n",
"Now, let's look at the scraped contents, which is converted into `markdown` by default:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"arXiv is a free distribution service and an open-access archive for nearly 2.4 million\n",
"scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
"Materials on this site are not peer-reviewed by arXiv.\n",
"\n",
"\n",
"Subject search and browse:\n",
"\n",
"Physics\n",
"\n",
"Mathematics\n",
"\n",
"Quantitative Biology\n",
"\n",
"Computer Science\n",
"\n",
"Quantitative Finance\n",
"\n",
"Statistics\n",
"\n",
"Electrical Engineering and Systems Scienc"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Markdown\n",
"\n",
"Markdown(data['markdown'][:500])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response can include several other formats that we can request when scraping a URL. Let's try requesting multiple formats at once to see what additional data we can get back:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"data = app.scrape_url(\n",
" url, \n",
" params={\n",
" 'formats': [\n",
" 'html', \n",
" 'rawHtml', \n",
" 'links', \n",
" 'screenshot',\n",
" ]\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is what these formats scrape:\n",
"\n",
" - **HTML**: The raw HTML content of the page.\n",
" - **rawHtml**: The unprocessed HTML content, exactly as it appears on the page.\n",
" - **links**: A list of all the hyperlinks found on the page.\n",
" - **screenshot**: An image capture of the page as it appears in a browser.\n",
"\n",
"The HTML format is useful for developers who need to analyze or manipulate the raw structure of a webpage. The `rawHtml` format is ideal for cases where the exact original HTML content is required, such as for archival purposes or detailed comparison. The links format is beneficial for SEO specialists and web crawlers who need to extract and analyze all hyperlinks on a page. The screenshot format is perfect for visual documentation, quality assurance, and capturing the appearance of a webpage at a specific point in time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passing more than one scraping format to `params` adds additional keys to the response:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['rawHtml', 'screenshot', 'metadata', 'html', 'links'])"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's display the screenshot Firecrawl took of arXiv.org:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAAAXNSR0IArs4c6QAAIABJREFUeJzs3XlcTfn/B/DPvbfbor2IUkkoUROiFIr2LFPJNsg2Mg0SjXwNZhDZK8tMGPsSWQYVocVSlJJkIg1KWqR937v3/v74fOf87rRpxm0M39fzMY95nPs5n/M5n3PO7Xr0vu/eH5a2tjYBAAAAAAAAAAAAABA19seeAAAAAAAAAAAAAAB8nhCABgAAAAAAAAAAAIAugQA0AAAAAAAAAAAAAHQJBKABAAAAAAAAAAAAoEsgAA0AAAAAAAAAAAAAXQIBaAAAAAAAAAAAAADoEghAAwAAAAAAAAAAAECXQAAaAAAAAAAAAAAAALoEAtAAAAAAAAAAAAAA0CUQgAYAAAAAAAAAAACALoEANAAAAAAAAAAAAAB0CQSgAQAAAAAAAAAAAKBLIAANAAAAAAAAAAAAAF1C7J8/5eDBg42MjIyMjIYOHdq7d2/hXWlpaUlJSY8fP05ISCgoKPjn5wYAAAAAAAAAAAAAosLS1tb+Z840cODAWbNmTZgwQV5evjP909LSQkJCLl68WF5e3vWzAwAAAAAAAAAAAAAR+ycC0AMGDFi+fLm9vb1wY21t7cOHD1NTU/l8Pm2RkpIaPnz40KFDhbvV1NT88ssvR48era2t7ep5AgAAAAAAAAAAAIAIdW0AWkVFZeXKlS4uLkxLYmJibGxsfHz848eP2zykW7duI0eOHDly5OjRo3V1dWljWVlZQEBAUFBQ100VAAAAAAAAAAAAAESrCwPQY8eO9ff3pwU3+Hz+pUuXduzYUVJS0vkRDA0NN27caGBgQF/Gx8d7enr+pREAAAAAAAAAAAAA4GPpqgD0vHnzfvjhB7r99OnTNWvWPHv27G+Mw2Kxpk+f7u3traCgQAgpKCiYNm1abm6uqOcLAAAAAAAAAAAAACLWJQFod3d3b29vQkhDQ8OOHTuOHz/+gQMqKir+8MMPjo6OhJDS0tKpU6dmZWWJaLIAAAAAAAAAAAAA0CVEH4AeN27c4cOH6fqBM2fOfPr0qahGnjlzpo+PD4vFyszMdHZ2rq6uFtXIAAAAAAAAAAAAACByIg5AGxgYBAcHS0pKNjQ0bNiwYevWrSIcvKCgYMeOHX5+foSQ2NjYefPmiXBwAAAAAAAAAAAAABAtMRGOpaqqevToUUlJydra2tmzZ/P5fBEOTl25cqVv375Lly4dM2bM+vXrN27cKPJTAAAAAAAAAAAAAHyIATzeGCmpPiyWuKxst7y8jz2dD1Xbu3djdfUbPv9uXV0Gh/OXjhVlAHrdunVKSkqEEA8PjydPnhgYGBBCfvvtN2dnZxaLtWzZstaHpKen37x5k26Li4t/++23rfukpqbeunXr0KFDgwcPJoQEBAQYGRmZmprOmTPn4sWLf29tQwAAAAAAAAAAAICuME9KauSgQbpOTnL6+vIGBiwW62PP6EMJBIKK336rfPZsxMWLCS9fHq+t7fyxbFFNYtiwYfb29oSQCxcu3Llzp+Vp2OylS5e2aJw8eTI9hJKQkPjmm29a9Jk+fbqVlVWLxqVLl1ZVVRFCkAENAAAAAAAA8NkIDw93dHSk2ywW6/Hjx9OmTWP2JiQkmJiYiPykbDbbycmpvb3i4uIXL15MTEzs2bNn6716enoxMTGEkAMHDri4uIhkPiIcijF//vwdO3Z0puf9+/d1dHREe/au0OZT09HRiY2N/fXXX0V4IuZZ7NmzZ+bMmX/pWBkZmW3btt27dy8hIeHu3btM2mVKSkrv3r1FOMkOZPxZQkKCCAefNm3azz//3N5e5tZ1xVv6w3XdU2Cz2Rv79p327bc2589rzpyp8MUXn0H0mX4mKxgaas6caXvp0jQ3t/Xa2mx2ZwPLIgtAL1++nBBSXV29ffv2Njs0NzdH/1lhYWGLPnV1dXv27ElJSZkwYUJmZuaePXvevXsnEAhadCsvL9+3bx8hZOjQoWZmZqK6BAAAAAAAAAD4iGJjY5lf8/X19ZuampiXOjo63bp1S05OFvlJBwwY0EF0rG/fvlpaWiYmJgUFBSI/NfxtbT61kSNHpqamijbWuWrVqr997Jo1a8TExKysrExMTObOnTtz5sxJkyYRQqysrPLz80U4yY6Zm5v3+0NXfIXTHubWeXt7X7169R8770c3R1zcyNFRd8mSjz2RLqS7bNnw8eNnS0h0sr9oAtD6+vqjRo0ihBw+fLisrKx1BxaLJSYmdujPtLS0WnwDwGazt27deuTIkWfPnu3YsSMwMFBMTKzNaPqRI0dKS0sJIa2TpgEAAAAAAADgUyQcgDYzM7tw4QITLzMzM0tISGhqahozZsy1a9du3boVHBzcOivZzMwsMjLy9u3b7u7u0dHRurq6AwcODA8P37p1a3BwMCHE2dk5Ojo6Njb2/Pnzmpqa0tLShw8fHjJkyLlz5wghLQaXlZXdv3+/rKxseHi4iorK8+fPe/XqRU+UlJTUp0+fFmdfvXr1pk2b6LakpGRqaqqqqqpwh3Xr1sXExNy7d2/37t3i4uKEEDk5uQMHDiQmJkZFRY0bN064s7GxcUxMTK9evQIDA5ctW7Z///4LFy4cOXKEHti/f/9z585FRUWFh4ePHz+eEHLv3j06JWtr61evXsnLyxNCpk6dGhAQwIzZrVu37du3R0dH3717d+7cucxNi46OjoyMXLFiBdNz4cKFsbGx4eHhU6dOffToEW10c3OLioqKiYnZtm0bl8sVnq2amtqZM2eio6NjYmJokqK7u/vWrVvpXldXV39/fxo48vb2Pnbs2JUrVwICAui1pKamLlq06Pjx4+Hh4Uycx9bWNjw8PDIy8ty5c7q6uoSQefPmbdu27cqVK2vWrBF+atSoUaMWL15sZmZ2/Phxpqe3t3ebQ82ZM8fPz8/Pz+/SpUunTp2ysLAIDAyMiIhwdXUVvqj9+/d37979+vXr9MaqqKicO3cuISHh4MGDdOZ9+vQ5depUdHR0RESEsbFxi/dDv3794uPjGxoaCCFZWVlTp06NjIwkhERHR6uqqurq6l6/fn3lypUnTpyIjIw0NTXdvXv35cuX6Y1q8xHTQ5YvX3748OGIiAgbGxv6+MLDw+kTtLOzI50gKysbEBAQFRV18+ZNLy8vGp1r/fZu83Ti4uL+/v70J6hfv360v4GBQUhISExMTFRUlIWFRYtbt3PnzokTJ3b+itr06NGj+fPnHzx48M6dOxMnTly/fv2pU6cuX76sqKjY3hu740MIIQ4ODpGRkfHx8e7u7rSl9cdLi/fSe+nw+ab6+noeHp3p/EkbtGKF2cCB/Tq3BKBoAtCzZs0ihNTW1p44caLNDgKBoLm52fTP0tPTW2Q3i4uLGxkZOTs7r1ixwt7eXk5OTl1dvb3FDH/55RdCyOjRo9XV1UVyFQAAAAAAAADwESUmJiorK9Ngn6mpaWxsbFFREa0IYWpqeu/evR49euzevXvVqlWWlpZhYWFMqI5isVg7d+7cvn37uHHjuFyumpqaQCBoamrS0NBISEiYMWOGgoLCli1bFixYMGbMmNTU1OXLl9fU1Pj6+qakpEyfPr314FVVVUuWLCkuLnZwcGj9Z9ythYaGjh8/XkxMjBBiYWGRlpYmnOhqZWVlYWFhY2NjaWnZp08fWkFi5cqVhYWFxsbGK1eu3Lt3r4yMDO2soaGxa9cud3f3d+/eNTc3m5mZeXh4TJ06VUlJicap9+7de+HCBWtr6yVLlmzbtq1nz54PHjwYNmwYjVynpqbS7eHDh9+/f5+Zw/LlyyUlJW1sbCZOnOjq6mpsbMxisbZv3+7r62tjY5OTk6OiokLzvj08PKZNm+bo6GhtbU2jN5aWljNmzJg8ebKFhYWUlFSLdbzc3NwSEhKsrKwcHBy0tbWVlZXbvEX0Wtzc3JycnLp37+7s7EwI4fF4Kioq8+bNmzFjxqJFi7S0tHr27Lljxw4PDw8bG5vz58/TGHpTU5OlpeWKFSu2bNnCPDVm5Pv37x88ePDWrVvz5s1jeu7cubPNoXg83pgxYzZs2DB58uRevXo5OTk
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Image\n",
"\n",
"Image(data['screenshot'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the screenshot is cropped to fit a certain viewport. For most pages, it is better to capture the entire screen by using the `screenshot@fullPage` format:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB4AAAAbSCAIAAABH8X/ZAAAAAXNSR0IArs4c6QAAIABJREFUeJzs3XlcTfn/B/DPvbfbor2IUkkoUROiFIr2LFPJNsg2Mg0SjXwNZhDZK8tMGPsSWQYVocVSlJJkIg1KWqR937v3/v74fOf87rRpxm0M39fzMY95nPs5n/M5n3PO7Xr0vu/eH5a2tjYBAAAAAAAAAAAAABA19seeAAAAAAAAAAAAAAB8nhCABgAAAAAAAAAAAIAugQA0AAAAAAAAAAAAAHQJBKABAAAAAAAAAAAAoEsgAA0AAAAAAAAAAAAAXQIBaAAAAAAAAAAAAADoEghAAwAAAAAAAAAAAECXQAAaAAAAAAAAAAAAALoEAtAAAAAAAAAAAAAA0CUQgAYAAAAAAAAAAACALoEANAAAAAAAAAAAAAB0CQSgAQAAAAAAAAAAAKBLIAANAAAAAAAAAAAAAF1C7J8/5eDBg42MjIyMjIYOHdq7d2/hXWlpaUlJSY8fP05ISCgoKPjn5wYAAAAAAAAAAAAAosLS1tb+Z840cODAWbNmTZgwQV5evjP909LSQkJCLl68WF5e3vWzAwAAAAAAAAAAAAAR+ycC0AMGDFi+fLm9vb1wY21t7cOHD1NTU/l8Pm2RkpIaPnz40KFDhbvV1NT88ssvR48era2t7ep5AgAAAAAAAAAAAIAIdW0AWkVFZeXKlS4uLkxLYmJibGxsfHz848eP2zykW7duI0eOHDly5OjRo3V1dWljWVlZQEBAUFBQ100VAAAAAAAAAAAAAESrCwPQY8eO9ff3pwU3+Hz+pUuXduzYUVJS0vkRDA0NN27caGBgQF/Gx8d7enr+pREAAAAAAAAAAAAA4GPpqgD0vHnzfvjhB7r99OnTNWvWPHv27G+Mw2Kxpk+f7u3traCgQAgpKCiYNm1abm6uqOcLAAAAAAAAAAAAACLWJQFod3d3b29vQkhDQ8OOHTuOHz/+gQMqKir+8MMPjo6OhJDS0tKpU6dmZWWJaLIAAAAAAAAAAAAA0CVEH4AeN27c4cOH6fqBM2fOfPr0qahGnjlzpo+PD4vFyszMdHZ2rq6uFtXIAAAAAAAAAAAAACByIg5AGxgYBAcHS0pKNjQ0bNiwYevWrSIcvKCgYMeOHX5+foSQ2NjYefPmiXBwAAAAAAAAAAAAABAtMRGOpaqqevToUUlJydra2tmzZ/P5fBEOTl25cqVv375Lly4dM2bM+vXrN27cKPJTAAAAAAAAAAAAAHyIATzeGCmpPiyWuKxst7y8jz2dD1Xbu3djdfUbPv9uXV0Gh/OXjhVlAHrdunVKSkqEEA8PjydPnhgYGBBCfvvtN2dnZxaLtWzZstaHpKen37x5k26Li4t/++23rfukpqbeunXr0KFDgwcPJoQEBAQYGRmZmprOmTPn4sWLf29tQwAAAAAAAAAAAICuME9KauSgQbpOTnL6+vIGBiwW62PP6EMJBIKK336rfPZsxMWLCS9fHq+t7fyxbFFNYtiwYfb29oSQCxcu3Llzp+Vp2OylS5e2aJw8eTI9hJKQkPjmm29a9Jk+fbqVlVWLxqVLl1ZVVRFCkAENAAAAAAAA8NkIDw93dHSk2ywW6/Hjx9OmTWP2JiQkmJiYiPykbDbbycmpvb3i4uIXL15MTEzs2bNn6716enoxMTGEkAMHDri4uIhkPiIcijF//vwdO3Z0puf9+/d1dHREe/au0OZT09HRiY2N/fXXX0V4IuZZ7NmzZ+bMmX/pWBkZmW3btt27dy8hIeHu3btM2mVKSkrv3r1FOMkOZPxZQkKCCAefNm3azz//3N5e5tZ1xVv6w3XdU2Cz2Rv79p327bc2589rzpyp8MUXn0H0mX4mKxgaas6caXvp0jQ3t/Xa2mx2ZwPLIgtAL1++nBBSXV29ffv2Njs0NzdH/1lhYWGLPnV1dXv27ElJSZkwYUJmZuaePXvevXsnEAhadCsvL9+3bx8hZOjQoWZmZqK6BAAAAAAAAAD4iGJjY5lf8/X19ZuampiXOjo63bp1S05OFvlJBwwY0EF0rG/fvlpaWiYmJgUFBSI/NfxtbT61kSNHpqamijbWuWrVqr997Jo1a8TExKysrExMTObOnTtz5sxJkyYRQqysrPLz80U4yY6Zm5v3+0NXfIXTHubWeXt7X7169R8770c3R1zcyNFRd8mSjz2RLqS7bNnw8eNnS0h0sr9oAtD6+vqjRo0ihBw+fLisrKx1BxaLJSYmdujPtLS0WnwDwGazt27deuTIkWfPnu3YsSMwMFBMTKzNaPqRI0dKS0sJIa2TpgEAAAAAAADgUyQcgDYzM7tw4QITLzMzM0tISGhqahozZsy1a9du3boVHBzcOivZzMwsMjLy9u3b7u7u0dHRurq6AwcODA8P37p1a3BwMCHE2dk5Ojo6Njb2/Pnzmpqa0tLShw8fHjJkyLlz5wghLQaXlZXdv3+/rKxseHi4iorK8+fPe/XqRU+UlJTUp0+fFmdfvXr1pk2b6LakpGRqaqqqqqpwh3Xr1sXExNy7d2/37t3i4uKEEDk5uQMHDiQmJkZFRY0bN064s7GxcUxMTK9evQIDA5ctW7Z///4LFy4cOXKEHti/f/9z585FRUWFh4ePHz+eEHLv3j06JWtr61evXsnLyxNCpk6dGhAQwIzZrVu37du3R0dH3717d+7cucxNi46OjoyMXLFiBdNz4cKFsbGx4eHhU6dOffToEW10c3OLioqKiYnZtm0bl8sVnq2amtqZM2eio6NjYmJokqK7u/vWrVvpXldXV39/fxo48vb2Pnbs2JUrVwICAui1pKamLlq06Pjx4+Hh4Uycx9bWNjw8PDIy8ty5c7q6uoSQefPmbdu27cqVK2vWrBF+atSoUaMWL15sZmZ2/Phxpqe3t3ebQ82ZM8fPz8/Pz+/SpUunTp2ysLAIDAyMiIhwdXUVvqj9+/d37979+vXr9MaqqKicO3cuISHh4MGDdOZ9+vQ5depUdHR0RESEsbFxi/dDv3794uPjGxoaCCFZWVlTp06NjIwkhERHR6uqqurq6l6/fn3lypUnTpyIjIw0NTXdvXv35cuX6Y1q8xHTQ5YvX3748OGIiAgbGxv6+MLDw+kTtLOzI50gKysbEBAQFRV18+ZNLy8vGp1r/fZu83Ti4uL+/v70J6hfv360v4GBQUhISExMTFRUlIWFRYtbt3PnzokTJ3b+itr06NGj+fPnHzx48M6dOxMnTly/fv2pU6cuX76sqKjY3hu740MIIQ4ODpGRkfHx8e7u7rSl9cdLi/fSe+nw+ab6+noeHp3p/EkbtGKF2cCB/Tq3BKBoAtCzZs0ihNTW1p44caLNDgKBoLm52fTP0tPTW2Q3i4uLGxkZOTs7r1ixwt7eXk5OTl1dvb3FDH/55RdCyOjRo9XV1UVyFQAAAAAAAADwESUmJiorK9Ngn6mpaWxsbFFREa0IYWpqeu/evR49euzevXvVqlWWlpZhYWFMqI5isVg7d+7cvn37uHHjuFyumpqaQCBoamrS0NBISEiYMWOGgoLCli1bFixYMGbMmNTU1OXLl9fU1Pj6+qakpEyfPr314FVVVUuWLCkuLnZwcGj9Z9ythYaGjh8/XkxMjBBiYWGRlpYmnOhqZWVlYWFhY2NjaWnZp08fWkFi5cqVhYWFxsbGK1eu3Lt3r4yMDO2soaGxa9cud3f3d+/eNTc3m5mZeXh4TJ06VUlJicap9+7de+HCBWtr6yVLlmzbtq1nz54PHjwYNmwYjVynpqbS7eHDh9+/f5+Zw/LlyyUlJW1sbCZOnOjq6mpsbMxisbZv3+7r62tjY5OTk6OiokLzvj08PKZNm+bo6GhtbU2jN5aWljNmzJg8ebKFhYWUlFSLdbzc3NwSEhKsrKwcHBy0tbWVlZXbvEX0Wtzc3JycnLp37+7s7EwI4fF4Kioq8+bNmzFjxqJFi7S0tHr27Lljxw4PDw8bG5vz58/TGHpTU5OlpeWKFSu2bNnCPDVm5Pv37x88ePDWrVvz5s1jeu7cubPNoXg83pgxYzZs2DB58uRevXo5OTk
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = app.scrape_url(\n",
" url,\n",
" params={\n",
" \"formats\": [\n",
" \"screenshot@fullPage\",\n",
" ]\n",
" }\n",
")\n",
"\n",
"Image(data['screenshot'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a bonus, the `/scrape` endpoint can handle PDF links as well:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"arXiv:2411.09833v1 \\[math.DG\\] 14 Nov 2024\n",
"EINSTEIN METRICS ON THE FULL FLAG F(N).\n",
"MIKHAIL R. GUZMAN\n",
"Abstract.LetM=G/Kbe a full flag manifold. In this work, we investigate theG-\n",
"stability of Einstein metrics onMand analyze their stability types, including coindices,\n",
"for several cases. We specifically focus onF(n) = SU(n)/T, emphasizingn= 5, where\n",
"we identify four new Einstein metrics in addition to known ones. Stability data, including\n",
"coindex and Hessian spectrum, confirms that these metrics on"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pdf_link = \"https://arxiv.org/pdf/2411.09833.pdf\"\n",
"data = app.scrape_url(pdf_link)\n",
"\n",
"Markdown(data['markdown'][:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further Configuration Options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, `scrape_url` converts everything it sees on a webpage to one of the specified formats. To control this behavior, Firecrawl offers the following parameters:\n",
"- `onlyMainContent`\n",
"- `includeTags`\n",
"- `excludeTags`\n",
"\n",
"`onlyMainContent` excludes the navigation, footers, headers, etc. and is set to True by default. \n",
"\n",
"`includeTags` and `excludeTags` can be used to whitelist/blacklist certain HTML elements:"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"[Help](https://info.arxiv.org/help) \\| [Advanced Search](https://arxiv.org/search/advanced)\n",
"\n",
"arXiv is a free distribution service and an open-access archive for nearly 2.4 million\n",
"scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
"Materials on this site are not peer-reviewed by arXiv.\n",
"\n",
"\n",
"[arXiv Operational Status](https://status.arxiv.org)\n",
"\n",
"Get status notifications via\n",
"[email](https://subscribe.sorryapp.com/24846f03/email/new)\n",
"or [slack](https://subscribe.sorryapp.com/24846f03/slack/new)"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://arxiv.org\"\n",
"\n",
"data = app.scrape_url(url, params={\"includeTags\": [\"p\"], \"excludeTags\": [\"span\"]})\n",
"\n",
"Markdown(data['markdown'][:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`includeTags` and `excludeTags` also support referring to HTML elements by their `#id` or `.class-name`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Structured Data Extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scraping clean, LLM-ready data is the core philosophy of Firecrawl. However, certain web pages with their complex structures can interfere with this philosophy when scraped in their entirety. For this reason, Firecrawl offers two scraping methods for better structured outputs:\n",
"\n",
"1. Natural language extraction - Use prompts to extract specific information and have an LLM structure the response\n",
"2. Manual structured data extraction - Define JSON schemas to have an LLM scrape data in a predefined format\n",
"\n",
"In this section, we will cover both methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scraping using natural language"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To illustrate natural language scraping, let's try extracting all news article links that may be related to the 2024 US presidential election from the New York Times:"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://nytimes.com\"\n",
"\n",
"data = app.scrape_url(\n",
" url,\n",
" params={\n",
" 'formats': ['markdown', 'extract', 'screenshot'],\n",
" 'extract': {\n",
" 'prompt': \"Return a list of links of news articles that may be about the 2024 US presidential election\"\n",
" }\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To enable this feature, you are required to pass the `extract` option to the list of `formats` and provide a prompt in a dictionary to a separate `extract` key.\n",
"\n",
"Once scraping finishes, the response will include a new `extract` key:"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'news_articles': [{'title': 'Harris Loss Has Democrats Fighting Over How to Talk About Transgender Rights',\n",
" 'link': 'https://www.nytimes.com/2024/11/20/us/politics/presidential-campaign-transgender-rights.html'},\n",
" {'title': 'As Democrats Question How to Win Back Latinos, Ruben Gallego Offers Answers',\n",
" 'link': 'https://www.nytimes.com/2024/11/20/us/politics/ruben-gallego-arizona-latino-voters-democrats.html'},\n",
" {'title': 'A Key to Trumps Win: Heavy Losses for Harris Across the Map',\n",
" 'link': 'https://www.nytimes.com/interactive/2024/11/19/us/politics/voter-turnout-election-trump-harris.html'},\n",
" {'title': 'Trump Promises Clean Water. Will He Clean Up Forever Chemicals?',\n",
" 'link': 'https://www.nytimes.com/2024/11/20/climate/trump-pfas-lead-clean-water.html'},\n",
" {'title': 'Two Apartment Buildings Were Planned. Only One Went Up. What Happened?',\n",
" 'link': 'https://www.nytimes.com/interactive/2024/11/19/nyregion/affordable-housing-nyc-rent.html'},\n",
" {'title': 'Trump Selects Mehmet Oz to Run Centers for Medicare and Medicaid Services',\n",
" 'link': 'https://www.nytimes.com/2024/11/19/us/politics/trump-dr-oz-medicare-medicaid.html'},\n",
" {'title': 'The Final Push for Ukraine?',\n",
" 'link': 'https://www.nytimes.com/2024/11/20/briefing/ukraine-russia-trump.html'}]}"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['extract']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Due to the nature of this scraping method, the returned output can have arbitrary structure as we can see above. It seems the above output has the following format:\n",
"\n",
"```python\n",
"{\n",
" \"news_articles\": [\n",
" {\"title\": \"article_title\", \"link\": \"article_url\"},\n",
" ...\n",
" ]\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This LLM-based extraction can have endless applications, from extracting specific data points from complex websites to analyzing sentiment across multiple news sources to gathering structured information from unstructured web content.\n",
"\n",
"To improve the accuracy of the extraction and give additional instructions, you have the option to include a system prompt to the underlying LLM:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"data = app.scrape_url(\n",
" url,\n",
" params={\n",
" 'formats': ['markdown', 'extract'],\n",
" 'extract': {\n",
" 'prompt': \"Find any mentions of specific dollar amounts or financial figures and return them with their context and article link.\",\n",
" 'systemPrompt': \"You are a helpful assistant that extracts numerical financial data\"\n",
" }\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above, we are dictating that the LLM must act as an assistant that extracts numerical financial data. Let's look at its response:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'financial_data': [{'amount': 121200000,\n",
" 'context': 'René Magritte became the 16th artist whose work broke the nine-figure threshold at auction when his painting sold for $121.2 million.',\n",
" 'article_link': 'https://www.nytimes.com/2024/11/19/arts/design/magritte-surrealism-christies-auction.html'},\n",
" {'amount': 5000000,\n",
" 'context': 'Benjamin Netanyahu offers $5 million for each hostage freed in Gaza.',\n",
" 'article_link': 'https://www.nytimes.com/2024/11/19/world/middleeast/israel-5-million-dollars-hostage.html'}]}"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['extract']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output shows the LLM successfully extracted two financial data points from the articles.\n",
"\n",
"The LLM not only identified the specific amounts but also provided relevant context and source article links for each figure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scraping with a predefined schema"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While natural language scraping is powerful for exploration and prototyping, production systems typically require more structured and deterministic approaches. LLM responses can vary between runs of the same prompt, making the output format inconsistent and difficult to reliably parse in automated workflows.\n",
"\n",
"For this reason, Firecrawl allows you to pass a predefined schema to guide the LLM's output when transforming the scraped content. To facilitate this feature, Firecrawl uses Pydantic models. \n",
"\n",
"In the example below, instead of extracting the entire NY Times homepage, we will extract only news article links, their titles with some additional details:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"\n",
"class IndividualArticle(BaseModel):\n",
" title: str = Field(description=\"The title of the news article\")\n",
" subtitle: str = Field(description=\"The subtitle of the news article\")\n",
" url: str = Field(description=\"The URL of the news article\")\n",
" author: str = Field(description=\"The author of the news article\")\n",
" date: str = Field(description=\"The date the news article was published\")\n",
" read_duration: int = Field(description=\"The estimated time it takes to read the news article\")\n",
" topics: list[str] = Field(description=\"A list of topics the news article is about\")\n",
"\n",
"class NewsArticlesSchema(BaseModel):\n",
" news_articles: list[IndividualArticle] = Field(\n",
" description=\"A list of news articles extracted from the page\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above, we define a Pydantic schema that specifies the structure of the data we want to extract. The schema consists of two models:\n",
"\n",
"`IndividualArticle` defines the structure for individual news articles with fields for:\n",
"- `title`: The article headline\n",
"- `subtitle`: Secondary headline text\n",
"- `url`: Link to the full article\n",
"- `author`: Article writer's name\n",
"- `date`: Publication date\n",
"- `read_duration`: Estimated reading time in minutes\n",
"- `topics`: List of article subject categories\n",
"\n",
"`NewsArticlesSchema` acts as a container model that holds a list of `IndividualArticle` objects, representing multiple articles extracted from the page. If we don't use this container model, Firecrawl will only return the first news article it finds.\n",
"\n",
"Each model field uses Pydantic's `Field` class to provide descriptions that help guide the LLM in correctly identifying and extracting the requested data. This structured approach ensures consistent output formatting.\n",
"\n",
"The next step is passing this schema to the `extract` parameter of `scrape_url`:"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://nytimes.com\"\n",
"\n",
"structured_data = app.scrape_url(\n",
" url,\n",
" params={\n",
" \"formats\": [\"extract\", \"screenshot\"],\n",
" \"extract\": {\n",
" \"schema\": NewsArticlesSchema.model_json_schema(),\n",
" \"prompt\": \"Extract the following data from the NY Times homepage: news article title, url, author, date, read_duration for all news articles\",\n",
" \"systemPrompt\": \"You are a helpful assistant that extracts news article data from NY Times.\",\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While passing the schema, we call its `model_json_schema()` method to automatically convert it to valid JSON. Let's look at the output:"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'news_articles': [{'title': 'How Google Spent 15 Years Creating a Culture of Concealment',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/technology/google-antitrust-employee-messages.html',\n",
" 'author': 'David Streitfeld',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 9,\n",
" 'topics': []},\n",
" {'title': 'World Leaders Seek Stability With China as President Biden Exits the Stage',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/us/politics/trump-world-leaders-china-xi-stability.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 5,\n",
" 'topics': []},\n",
" {'title': 'Linda McMahon Is Chosen by Trump to Be Education Secretary',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/19/us/politics/linda-mcmahon-education-secretary-trump.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 4,\n",
" 'topics': []},\n",
" {'title': 'Trump Selects Mehmet Oz to Run Centers for Medicare and Medicaid Services',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/19/us/politics/trump-dr-oz-medicare-medicaid.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 2,\n",
" 'topics': []},\n",
" {'title': 'Trump Promises Clean Water. Will He Clean Up Forever Chemicals?',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/climate/trump-pfas-lead-clean-water.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'Medicaid May Face Big Cuts and Work Requirements',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/health/medicaid-cuts-republican-congress.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'As Donald Trump Pushes Peace, Russia Intensifies Assaults on Ukraine',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/world/europe/russia-ukraine-war-attacks-trump.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'U.S. Closes Its Kyiv Embassy, Warning of Significant Air Attack',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/world/europe/us-embassy-ukraine-warning.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 4,\n",
" 'topics': []},\n",
" {'title': 'The Secret to the Best Turkey Came From a Reader',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/19/dining/best-thanksgiving-turkey-recipe.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 4,\n",
" 'topics': []},\n",
" {'title': 'A Key to Trumps Win: Heavy Losses for Harris Across the Map',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/interactive/2024/11/19/us/politics/voter-turnout-election-trump-harris.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 2,\n",
" 'topics': []},\n",
" {'title': 'As Democrats Question How to Win Back Latinos, Ruben Gallego Offers Answers',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/us/politics/ruben-gallego-arizona-latino-voters-democrats.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'Flying Above the Bombs, a Lebanese Airline Becomes an Unlikely National Hero',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/world/middleeast/middle-east-airlines-lebanon.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 5,\n",
" 'topics': []},\n",
" {'title': 'Is the Northeast Entering Its Wildfire Era?',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/nyregion/new-york-wildfires-drought.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'Harris Loss Has Democrats Fighting Over How to Talk About Transgender Rights',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/us/politics/presidential-campaign-transgender-rights.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 7,\n",
" 'topics': []},\n",
" {'title': 'House Republicans Target McBride With Capitol Bathroom Bill',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/19/us/politics/sarah-mcbride-nancy-mace-capitol-bathroom-bill.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 5,\n",
" 'topics': []},\n",
" {'title': 'Two Apartment Buildings Were Planned. Only One Went Up. What Happened?',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/interactive/2024/11/19/nyregion/affordable-housing-nyc-rent.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 0,\n",
" 'topics': []},\n",
" {'title': 'Long Tied to Russia, Georgias Winemakers Tip a Glass to the West',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/world/europe/georgia-russia-wine-sanctions.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 5,\n",
" 'topics': []},\n",
" {'title': 'A Dissidents Final Act of Protest Stuns Iran',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/19/world/middleeast/iran-kianoosh-sanjari-suicide.html',\n",
" 'author': '',\n",
" 'date': '2024-11-19',\n",
" 'read_duration': 6,\n",
" 'topics': []},\n",
" {'title': 'The Reintroduction of Daniel Craig',\n",
" 'subtitle': '',\n",
" 'url': 'https://www.nytimes.com/2024/11/20/movies/daniel-craig-queer.html',\n",
" 'author': '',\n",
" 'date': '2024-11-20',\n",
" 'read_duration': 9,\n",
" 'topics': []}]}"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"structured_data['extract']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, the response fields exactly match the fields we set during schema definition:\n",
"\n",
"```python\n",
"{\n",
" \"news_articles\": [\n",
" {...}, # Article 1\n",
" {...}, # Article 2,\n",
" ...\n",
" ]\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When creating the scraping schema, these best practices can go a long way in ensuring reliable and accurate data extraction:\n",
"\n",
"1. Keep field names simple and descriptive\n",
"2. Use clear field descriptions that guide the LLM\n",
"3. Break complex data into smaller, focused fields\n",
"4. Include validation rules where possible\n",
"5. Consider making optional fields that may not always be present\n",
"6. Test the schema with a variety of content examples\n",
"7. Iterate and refine based on extraction results\n",
"\n",
"To follow these best practices, the following Pydantic tips can help:\n",
"\n",
"1. Use `Field(default=None)` to make fields optional\n",
"2. Add validation with `Field(min_length=1, max_length=100)` \n",
"3. Create custom validators with @validator decorator\n",
"4. Use `conlist()` for list fields with constraints\n",
"5. Add example values with `Field(example=\"Sample text\")`\n",
"6. Create nested models for complex data structures\n",
"7. Use computed fields with `@property` decorator\n",
"\n",
"If you follow all these tips, your schema can become quite sophisticated like below: "
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"from typing import Optional, List\n",
"from datetime import datetime\n",
"\n",
"\n",
"class Author(BaseModel):\n",
" # Required field - must be provided when creating an Author\n",
" name: str = Field(\n",
" ...,\n",
" min_length=1,\n",
" max_length=100,\n",
" description=\"The full name of the article author\",\n",
" )\n",
"\n",
" # Optional field - can be None or omitted\n",
" title: Optional[str] = Field(\n",
" None, description=\"Author's title or role, if available\"\n",
" )\n",
"\n",
"\n",
"class NewsArticle(BaseModel):\n",
" # Required field - must be provided when creating a NewsArticle\n",
" title: str = Field(\n",
" ...,\n",
" min_length=5,\n",
" max_length=300,\n",
" description=\"The main headline or title of the news article\",\n",
" example=\"Breaking News: Major Scientific Discovery\",\n",
" )\n",
"\n",
" # Required field - must be provided when creating a NewsArticle\n",
" url: str = Field(\n",
" ...,\n",
" description=\"The full URL of the article\",\n",
" example=\"https://www.nytimes.com/2024/01/01/science/discovery.html\",\n",
" )\n",
"\n",
" # Optional field - can be None or omitted\n",
" authors: Optional[List[Author]] = Field(\n",
" default=None, description=\"List of article authors and their details\"\n",
" )\n",
"\n",
" # Optional field - can be None or omitted\n",
" publish_date: Optional[datetime] = Field(\n",
" default=None, description=\"When the article was published\"\n",
" )\n",
"\n",
" # Optional field with default empty list\n",
" financial_amounts: List[float] = Field(\n",
" default_factory=list,\n",
" max_length=10,\n",
" description=\"Any monetary amounts mentioned in the article in USD\",\n",
" )\n",
"\n",
" @property\n",
" def is_recent(self) -> bool:\n",
" if not self.publish_date:\n",
" return False\n",
" return (datetime.now() - self.publish_date).days < 7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The schema above defines two key data models for news article data:\n",
"\n",
"Author - Represents article author information with:\n",
"- `name` (required): The author's full name \n",
"- `title` (optional): The author's role or title\n",
"\n",
"NewsArticle - Represents a news article with:\n",
"- `title` (required): The article headline (5-300 chars)\n",
"- `url` (required): Full article URL\n",
"- `authors` (optional): List of Author objects\n",
"- `publish_date` (optional): Article publication datetime\n",
"- `financial_amounts` (optional): List of monetary amounts in USD\n",
"\n",
"The NewsArticle model includes an `is_recent` property that checks if the article was published within the last 7 days."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, web scraping process becomes much easier and more powerful if you combine it with structured data models that validate and organize the scraped information. This allows for consistent data formats, type checking, and easy access to properties like checking if an article is recent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch Operations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Up to this point, we have been focusing on scraping pages one URL at a time. In reality, you will work with multiple, perhaps, thousands of URLs that need to be scraped in parallel. This is where batch operations become essential for efficient web scraping at scale. Batch operations allow you to process multiple URLs simultaneously, significantly reducing the overall time needed to collect data from multiple web pages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Batch Scraping with `batch_scrape_urls`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `batch_scrape_urls` method lets you scrape multiple URLs at once.\n",
"\n",
"Let's scrape all the news article links we obtained from our previous schema extraction example."
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"articles = structured_data['extract']['news_articles']\n",
"article_links = [article['url'] for article in articles]\n",
"\n",
"class ArticleSummary(BaseModel):\n",
" title: str = Field(description=\"The title of the news article\")\n",
" summary: str = Field(description=\"A short summary of the news article\")\n",
"\n",
"batch_data = app.batch_scrape_urls(article_links, params={\n",
" \"formats\": [\"extract\"],\n",
" \"extract\": {\n",
" \"schema\": ArticleSummary.model_json_schema(),\n",
" \"prompt\": \"Extract the title of the news article and generate its brief summary\",\n",
" }\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is what is happening in the codeblock above:\n",
"\n",
"- We extract the list of news articles from our previous structured data result\n",
"- We create a list of article URLs by mapping over the articles and getting their 'url' field\n",
"- We define an `ArticleSummary` model with title and summary fields to structure our output\n",
"- We use `batch_scrape_urls()` to process all article URLs in parallel, configuring it to:\n",
" - Extract data in structured format\n",
" - Use our `ArticleSummary` schema\n",
" - Generate titles and summaries based on the article content\n",
"\n",
"The response from `batch_scrape_urls()` is a bit different:"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data'])"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_data.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It contains the following keys:\n",
"\n",
"- `success`: Boolean indicating if the batch request succeeded\n",
"- `status`: Current status of the batch job\n",
"- `completed`: Number of URLs processed so far\n",
"- `total`: Total number of URLs in the batch\n",
"- `creditsUsed`: Number of API credits consumed\n",
"- `expiresAt`: When the results will expire\n",
"- `data`: The extracted data for each URL\n",
"\n",
"Let's focus on the `data` key where the actual content is stored:"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(batch_data['data'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The batch processing completed successfully with 19 articles. Let's examine the structure of the first article:"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['extract', 'metadata'])"
]
},
"execution_count": 147,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_data['data'][0].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response format here matches what we get from individual `scrape_url` calls."
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'title': 'Ukrainian Forces Face Increasing Challenges Amidst Harsh Winter Conditions', 'summary': 'As the war in Ukraine enters its fourth winter, conditions are worsening for Ukrainian soldiers who find themselves trapped on the battlefield, surrounded by Russian forces. Military commanders express concerns over dwindling supplies and increasingly tough situations. The U.S. has recently allowed Ukraine to use American weapons for deeper strikes into Russia, marking a significant development in the ongoing conflict.'}\n"
]
}
],
"source": [
"print(batch_data['data'][0]['extract'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scraping was performed according to our specifications, extracting the metadata, the title and generating a brief summary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Asynchronous batch scraping with `async_batch_scrape_urls`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scraping the 19 NY Times articles in a batch took about 10 seconds on my machine. While that's not much, in practice, we cannot wait around as Firecrawl batch-scrapes thousands of URLs. For these larger workloads, Firecrawl provides an asynchronous batch scraping API that lets you submit jobs and check their status later, rather than blocking until completion. This is especially useful when integrating web scraping into automated workflows or processing large URL lists.\n",
"\n",
"This feature is available through the `async_batch_scrape_urls` method and it works a bit differently:"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [],
"source": [
"batch_scrape_job = app.async_batch_scrape_urls(\n",
" article_links,\n",
" params={\n",
" \"formats\": [\"extract\"],\n",
" \"extract\": {\n",
" \"schema\": ArticleSummary.model_json_schema(),\n",
" \"prompt\": \"Extract the title of the news article and generate its brief summary\",\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using `async_batch_scrape_urls` instead of the synchronous version, the response comes back immediately rather than waiting for all URLs to be scraped. This allows the program to continue executing while the scraping happens in the background."
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'success': True,\n",
" 'id': '77a94b62-c676-4db2-b61b-4681e99f4704',\n",
" 'url': 'https://api.firecrawl.dev/v1/batch/scrape/77a94b62-c676-4db2-b61b-4681e99f4704'}"
]
},
"execution_count": 151,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_scrape_job"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response contains an ID belonging the background task that was initiated to process the URLs under the hood. \n",
"\n",
"You can use this ID later to check the job's status with `check_batch_scrape_status` method:"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['success', 'status', 'total', 'completed', 'creditsUsed', 'expiresAt', 'data', 'error', 'next'])"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_scrape_job_status = app.check_batch_scrape_status(batch_scrape_job['id'])\n",
"\n",
"batch_scrape_job_status.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the job finished scraping all URLs, its `status` will be set to `completed`:"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'completed'"
]
},
"execution_count": 155,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_scrape_job_status['status']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at how many pages were scraped:"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"19"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"batch_scrape_job_status['total']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response always includes the `data` field, whether the job is complete or not, with the content scraped so far. It has `error` and `next` fields to indicate if any errors occurred during scraping and whether there are more results to fetch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scraping JavaScript-based Dynamic Websites"
]
},
{
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAPrCAYAAABIz3V1AAAAAXNSR0IArs4c6QAAADhlWElmTU0AKgAAAAgAAYdpAAQAAAABAAAAGgAAAAAAAqACAAQAAAABAAAEsKADAAQAAAABAAAD6wAAAAAxam4+AABAAElEQVR4Aey9B6BvSVXmWyfe1H1DZzrS0E2yYQAHZdQn8iSISHiAwgACKjkp6KgjKghIcAAjoyI+QQFFgo7SJAlCDzhiAiUIdNM5d98cT5zv961ae+9/OuGG7tvtqXvPv3alVWutWpXWXlV7bPy+j18cK6Us6m9l/mIZV94FFbCvkmP6IzyBrwf8WeWZOrhPv0McFVW3IN8IZET1x0BIjjrSLQ7LmInCYgEKKuzxRQBUIE2elT+MjwGog2gtOq+oRcEOHgi+87V4Br6D5aI4+cUjBQL+UjhG/WMV/oTKLOp5QTRR/3zFZ5hHe/SQTjUdN2bedCL6011nl/OdvAOPKlxpyqSgLUN9PvyrNLUpYBw8M+49yCtXH35tOT2pWBdT5DBgRK6QmYCd5cbHe8MZj4/8ti5kmzp68WvD0EqbjCvT2MSY/YXK30XDivZKmLRnt4ourkvSmQD6/HH1icQNPOg3dBPwCVd9eySKRvJlB+uD1w0m78yuyjNzJ/mniuYV0dJA3w8HH+EDspouZbmV/cRROfS4pNwkkCX8rlwNyLjKkW5ZGZA/2t2jRw90y0KHT8Gzlp5u5oDc0uM2yXpqdPT7rrR2ICySGvIHP0fkijyVp8l32inau63fGevPuICljBBFOdOWpFCstmnKPzLC/3QTnWfkK53bW2HiYJXLkSgcoajBkQaujrZxW4g/yAS40yez/efmQnbIblyH1N3UpbQsF+DJXAuAV6WDcbs3Hzir/o58Uj77bsBSGfBuUVc4YJu6Wo3z1jyNuNRwA14J0OJ26DIwK2p8aojCA7Kq+sbGeiXDMDv4NWCGPERbACTomnc9DuhpCSBNftWdMg18xc8vzJYN69eXc887r2zburXs3rOnXH/9DWXHzh1lYXbOckBWeA/uEwhjdd32SL7HmNm2Q8OzLFRxIWg+IT/+G486kOPEEfxquRiXFNFxE1PTZf26dWV+bqHMzB4qC3Pzyi8JVl8c0/g1j5w2EKJg1uWQWEkY3D0OagyBJwuuBl4JR9ELb0/edlI57bTTVWyx3HDjjWXXzp2trKncwvy8+Qj6B2YOCm4pM/v3C8ZimdywwX0r234RGVDdwMaNKxx/4ha0u158yVL9G5+aLFOid2pyWgu1yTKh8NjUVJlUHDSMCRb8mNafgav45NSEykwqKPrm5/QHb8QT8WZ8bKJMi3fwjzrnDh0qMzMzZU50LEIP+fRH/54YnxB82j8kcMOG9WWLZGVcdW7fsUO82GWcp6fXlamJcY9NwEjH2AwOtCtwF+aEC3EKG1clpUtaXL/yzM7O+o8wfZO2XRSOOc65nOCE2LSyadjUpT/PYfXZYxXVqszExESZFC8Zu+DNPHCVDzc1OVlOOOEE839e+B44dLDMqT9QL93f7UVG04AXtCxKkBbE69n5WeNJO9MGtMvU5ETZsmWz+HyoHNh/QDCRWfUxOfeFBfHe0qY6EAL9h4+0yZzaZl51T6/fUDaqv06q7eEV6bOqb+bQrNKBFfh7HBM8IRM0QRfsAXd+/Ni2Ef2adiYLbQMv5tWfZudm9DfnepBf06L61yNnqh+ZOaQ2grZNGzcZLrJG/N69+9xeRNIGC+qnCyojyfXcYr4rjb5NP58QfYFRtNn7fvXZ5SUv/sly2hlnlh3bbykPechDy0/8+I+WV7/61aqzlLe86Q3ld976++Uzn/m08Xv8459Qnv70p5SXv/xlZe/+2bJrx03lZS97WTnr7LPL85/3/PLmN7+5bDph02iYao+3vOXXy9v+4I/Kxz/2kXLGXc4qV1z+jfImlTth06by4hf/VDn9jLPL9u03lYd+3/eUZz3rWeWXX/krZdu2reXFL3qh5HS+fOrTnyqXfPYz5fIrry5nnnlu2bz5JFG25tY4sMaBNQ7cPhz43E0x3jPuz2uuw2eNw7qGX8Z8wvZrLLMIqcfan9TM4AmVmrz4WUHYCzTlZzJhA7DglYEmERPFki8WC8J/pMupj8kcInEQnK7dEGVMb3obO/wpFpHD04iF2U3FPPc58Ftyn9EUTuwBAAVdKogbdLGMI15lxyonFo1RX2YvQxznzYZ4zaQPXigJc3HuDF00FJH8JW0YZJe5HX7AOfFJOcrwSHRgaR99TV7Fd2ltedvkOKoP2SIseLOtc0HHIgvZT4TYzPdvkL2oV8mEc1SRO8bA3AQinIVoNIcW8SY2ZJ7Fe8okMfAo8gVi3efkHSmpUOymB8Qod1v8GgchkLLkdusg0au8AtNOIiGNgV38+wePhMsYifOGWgWajUtP4chzLH5pkwUNIHXv6yoSt576lsEnZb7bwh4v+8qNkvPsF/QH5p15MYbyTf7KJ8LJWeqc1EZnUpvDOTZ52hCyITXjBWOUS/rSJ9+oMceTc6cl6cMtrUNq8E6vjYf8hJ20gFpDGxn6Ue3jWQutfUJOvOFUwwV8NnGqAWWLyvO8amc8UEbwMFg+8O9Htq1lbHHeZVH03HLzLWXXrl3l0KGZsm/vHm+sF9kAa6yw4kJ1SI0RhV1fC+dwnywZ0K76y7g2t4I/zgCkOs1kVef+JaGn7XvaEVS0yT5wYL+VR0E+kZXPtV0DUssby7ihKWvT0iqldqAOxka31SS+/tReSNCJW7aUc6Tkg9cHDh4se3btFt7taLkgBcg+KSXWb1xXtm7eXKbWTUm+N5f9+/YJx31WLsXcBh9Za8nXH20ETMuGfFgceMSmnv5OX0dZNCmFi/Ehj5QM4/qjb41JCTM1OVWmpECaRGFlOkqZnkbpJXUBAMQC81Jpc1JMzKmdUYzsl5LNihsUOChQBBNnfoOPn/ViQ+XBlf5KmUMzs4Z3UJt+4sal/EjHmAAhVjbpGYUIcMEDGsZRvjhPlgifthJQWKCyol9wcQu0DcrUPseYn20EbMpTdyqhAvugBYwMHzoArj8rLdXGE+LdpGVOmcQTys/OzZZ9ajt4aphkppwclMaTg/5JmNAMLusm11umiJ8Q7Olp1aFxb4vkaJ2Uhvv27yu3qs/t3C05Em2ef1AUClpIhepw+6o+2nfTlNtpRsqigwcWyrqywcpklGooxA5KYeR2EADaKV5YuMUAKNqDG/bdrqSZYx6/4Tvl5+RbzqSIpMisFnmM77QESlrWfItqexwKQPIwF+xFWSu6pxQX/GPsILOz+odH8ua6kTUxiuYFqlIFbm2lL7rvLpZDB6XgUx+DX5PCB2Uada6XAm1GzwCfOaQ2kjzNSJ5V1HHI85f+5QvlggsucH76DnKyfv068e5gwKSvqI8Dh3ahDpSlwNy3f6+Vu+ukLBwX79cp70H1+XnhwktOWINMU+5EKcT+9tOfLrulxH3EIx9eHvUDj/TfO975TsV/xgqscZUxH8wPIalghPt98t226eefd67HsxtuuKm215HVf5973qNcdvnlnkcG6eyl74ILzi+XXnp5hx9t+kkoyMXoW27ZPjT9eOFfLx4t/r3x2c5r6be1fPe2w39M/nuNoG5tfU8db4MvwY8cj0mnfXIfsFJ90pHk1+qkTviqW/892YQ/Osy0BdLhxzRG2BZYgsIEPQuQZVwsBkdn6qb37RNUqL+C/jBw++PUCit1/UVHlUuQyg8/EudBfPsBUDAWWP0pEVZ6g0Pz4CTzPRNJipm3AdMPlWVq14KiyXg7PvTjeDuicphV97aJgRx3RA3iaBQHo1fFgxT5KBTS6OdcYCng8UGdoTfvMtVQHtcnzxF5dH6hv9n8ZH2AFqLDm0849XTm0RQltSztR7q6qdJK1v17kU1UrTitAryBEgg427o+mKClf2xwot6+9FpQa28vvnvyiJ5ea9agKetv6zw2T2wo7SorbVGiiAUhG6pRaKl/TIjQqLYKRUuxpcpBbTiwnJllw6fGy811AO79zTG5G4t8DuWYInOzlhvZ9F3e6V1IwGn55w0VebTh
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"Out in the wild, most of the websites you encounter will be dynamic, meaning their content is generated on-the-fly using JavaScript rather than being pre-rendered on the server. These sites often require user interaction like clicking buttons or typing into forms before displaying their full content. Traditional web scrapers that only look at the initial HTML fail to capture this dynamic content, which is why browser automation capabilities are essential for comprehensive web scraping.\n",
"\n",
"Firecrawl supports dynamic scraping by default. In the parameters of `scrape_url` or `batch_scrape_url`, you can define necessary actions to reach the target state of the page you are scraping. As an example, we will build a scraper that will extract the following information from `https://weather.com`:\n",
"\n",
"- Current Temperature\n",
"- Temperature High\n",
"- Temperature Low\n",
"- Humidity\n",
"- Pressure\n",
"- Visibility\n",
"- Wind Speed\n",
"- Dew Point\n",
"- UV Index\n",
"- Moon Phase\n",
"\n",
"These details are displayed for every city you search through the website:\n",
"\n",
"![A screenshot of the weather.com website for London](attachment:image.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike websites such as Amazon where you can simply modify the URL's search parameter (e.g. `?search=`), weather.com presents a unique challenge. The site generates dynamic and unique IDs for each city, making traditional URL manipulation techniques ineffective. To scrape weather data for any given city, you must simulate the actual user journey: visiting the homepage, interacting with the search bar, entering the city name, and selecting the appropriate result from the dropdown list. This multi-step interaction process is necessary because of how weather.com structures its dynamic content delivery (at this point, I urge to visit the website and visit a few city pages).\n",
"\n",
"Fortunately, Firecrawl natively supports such interactions through the `actions` parameter. It accepts a list of dictionaries, where each dictionary represents one of the following interactions:\n",
"\n",
"- Waiting for the page to load\n",
"- Clicking on an element\n",
"- Writing text in input fields\n",
"- Scrolling up/down\n",
"- Take a screenshot at the current state\n",
"- Scrape the current state of the webpage\n",
"\n",
"Let's define the actions we need for weather.com:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"actions = [\n",
" {\"type\": \"wait\", \"milliseconds\": 3000},\n",
" {\"type\": \"click\", \"selector\": 'input[id=\"LocationSearch_input\"]'},\n",
" {\"type\": \"write\", \"text\": \"London\"},\n",
" {\"type\": \"screenshot\"},\n",
" {\"type\": \"wait\", \"milliseconds\": 1000},\n",
" {\"type\": \"click\", \"selector\": \"button[data-testid='ctaButton']\"},\n",
" {\"type\": \"wait\", \"milliseconds\": 3000},\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's examine how we choose the selectors, as this is the most technical aspect of the actions. Using browser developer tools, we inspect the webpage elements to find the appropriate selectors. For the search input field, we locate an element with the ID \"LocationSearch_input\". After entering a city name, we include a 3-second wait to allow the dropdown search results to appear. At this stage, we capture a screenshot for debugging to verify the text input was successful. \n",
"\n",
"The final step involves clicking the first matching result, which is identified by a button element with the `data-testid` attribute \"ctaButton\". Note that if you're implementing this in the future, these specific attribute names may have changed - you'll need to use browser developer tools to find the current correct selectors.\n",
"\n",
"Now, let's define a Pydantic schema to guide the LLM:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"class WeatherData(BaseModel):\n",
" location: str = Field(description=\"The name of the city\")\n",
" temperature: str = Field(description=\"The current temperature in degrees Fahrenheit\")\n",
" temperature_high: str = Field(description=\"The high temperature for the day in degrees Fahrenheit\")\n",
" temperature_low: str = Field(description=\"The low temperature for the day in degrees Fahrenheit\")\n",
" humidity: str = Field(description=\"The current humidity as a percentage\")\n",
" pressure: str = Field(description=\"The current air pressure in inches of mercury\")\n",
" visibility: str = Field(description=\"The current visibility in miles\")\n",
" wind_speed: str = Field(description=\"The current wind speed in miles per hour\")\n",
" dew_point: str = Field(description=\"The current dew point in degrees Fahrenheit\")\n",
" uv_index: str = Field(description=\"The current UV index\")\n",
" moon_phase: str = Field(description=\"The current moon phase\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's pass these objects to `scrape_url`:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://weather.com\"\n",
"\n",
"data = app.scrape_url(\n",
" url,\n",
" params={\n",
" \"formats\": [\"screenshot\", \"markdown\", \"extract\"],\n",
" \"actions\": actions,\n",
" \"extract\": {\n",
" \"schema\": WeatherData.model_json_schema(),\n",
" \"prompt\": \"Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase\",\n",
" },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scraping only happens once all actions are performed. Let's see if it was successful by looking at the `extract` key:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'location': 'London, England, United Kingdom',\n",
" 'temperature': '33°',\n",
" 'temperature_high': '39°',\n",
" 'temperature_low': '33°',\n",
" 'humidity': '79%',\n",
" 'pressure': '29.52in',\n",
" 'visibility': '10 mi',\n",
" 'wind_speed': '5 mph',\n",
" 'dew_point': '28°',\n",
" 'uv_index': '0 of 11',\n",
" 'moon_phase': 'Waning Gibbous'}"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['extract']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All details are accounted for! But, for illustration, we need to take a closer look at the response structure when using JS-based actions:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['markdown', 'screenshot', 'actions', 'metadata', 'extract'])"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response has a new actions key:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'screenshots': ['https://service.firecrawl.dev/storage/v1/object/public/media/screenshot-16bf71d8-dcb5-47eb-9af4-5fa84195b91d.png'],\n",
" 'scrapes': []}"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['actions']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The actions array contained a single screenshot-generating action, which is reflected in the output above.\n",
"\n",
"Let's look at the screenshot:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAB4AAAAQ4CAIAAABnsVYUAAAAAXNSR0IArs4c6QAAIABJREFUeJzs3WdcFFcXB+CzjYWl9947ooKCCDZUFLFiib3XWGJJTIzGN8YYNcXYYjdqTKyJXUTBgg0rKqKAIh2k9162vB9GN4iARFjW8n9+fti5c+fOmXVd4cydc1k0ahUBAAAAAAAAAAAAADQ3trwDAAAAAAAAAAAAAIAPExLQAAAAAAAAAAAAACATSEADAAAAAAAAAAAAgEwgAQ0AAAAAAAAAAAAAMoEENAAAAAAAAAAAAADIBBLQAAAAAAAAAAAAACATSEADAAAAAAAAAAAAgEwgAQ0AAAAAAAAAAAAAMoEENAAAAAAAAAAAAADIBBLQAAAAAAAAAAAAACATSEADAAAAAAAAAAAAgEwgAQ0AAAAAAAAAAAAAMsGVdwAAIEOH5/rrqyvLO4o32BR878jtJ5a66mY66lejkyXyjgcAAAAAAAAAAJoLEtAAHzIvOxMTLVV5R/EGJ8NizHXUujuZx2bmD2xve/LeM3lHBAAAAAAAAAAAzQMlOABA/hyNdHKKyztYG1VWi+QdCwAAAAAAAAAANBskoAFA/qLTctqa61VUC1UUFeQdCwAAAAAAAAAANBuU4AAA+UvKKVpxPFTeUQAAAAAAAAAAQDPDDGgAAAAAAAAAAAAAkAkkoAEAAAAAAAAAAABAJlCCAwAAoKm6OpiO8HRqb2mgIeBrKCvqqyvLOyIAeCdkFpYWlFbkl1bciUs/cvvJtacp8o4IAAAAAKClIQENAADwlgw1lD/v6zGuizMyzgBQJ311Zeb7oaOt8dw+bpmFpX9de7w28HZ6Qam8QwMAAAAAaCFIQAMAAPxnOqpKS/07zfNzZzaFIjGXg6pWANAQoUisr668sL/Hwv4ea8/cWX3qRk5xubyDAgAAAACQOSSgAQAA/pvZvdr9MqanksK//4dyOezYzPxrT1Jux6YVlFYUllcWllVWCUVyDRMA5EmBy1EX8NWV+Joqih2sjbo6mFnra0j3ft6vw8xe7WbvCdpzJUKuYQIAAAAAyBwS0AAAAI3FYbH2fNp/XBdnIhJLJGwWi4j+vhX957VHGXigHgBqqBKKsovKsovKKJPuxqVvDr5nqKEysVuboR3smS8QJQXu7hn92lkYzP/zvEgikXe8AAAAAACyggQ0AABAo2gI+Ge+Gu5lZ8JsslmsO3FpP526mZRTJO/QAOA9kF5QsvrkjUM3or4e5Nne0oBpnOPbvq253sA1/xSUVco7QAAAAAAAmUDBSgAAgDdjsyhw0QgvOxOhSMy0rDlze9buIGSfAeA/ScgumPH72XWBd5hNoUjcxcE04MvhbJa8IwMAAAAAkA0koAE+ZGLxGx7pjX6esyvk4eOU7JaKqA546hjeCzun9fW0NWYWGyyuqJq648yhG1HyDgoA3lf7QyNn7j5XVF7F5bCFInEne5MdU/vKOygAAAAAAJlACQ6Aj8Kf1x5N2BpQq3HZ0C5EtPzotc2Tejub6sopNID3wLw+bpO924olEib7PG7zqdS8YnkHBQDvt7tx6RO2nv5z1gBVRQWxRDKle9uI5KyNQWHyjgsAAAAAoJlhBjTAh0zycnqxvrqyt5N5N0czZtPbydzbybybo6makgIRlVcJ5RomwDvNQkf917E+TNFnoUj82R/ByD4DQLNIyS2at/e8SPxiRdO143xMtFTlHRQAAAAAQDPDDGiADxmLXlSU9G1j5dvGiohYo1cr8rghS0cz7RHJWUTEYrHOP0q4E5fmYq7fz9WG2RX1PCc4IsFYS6Wfi42Az5PfRQDI2YrhXTlslkRCLBYtO3JNviVrAOADE5GctepE6P+GdJZIiMNmrRrhPX7raXkHBQAAAADQnJCABviQSRpXYHn50etF5ZXM65ClY7ydzPZefTRxW0BrU93c4vLP/jgf+t04a31NGQcL8C6yM9Aa3akVk32+9iQlKCJe3hEBwDvN2UR3oJutjb6mrYFWYVllbGb+w6TMgzciK6pF9R1y8t4zn9aWnrbGRDSmc6uVJ0Kfpue1bNTwsWtnod/XxcZcV81QQ6WiWhiZmvM0LTckKim9oFTeoQEAAMCHAAloACAfZ4tfxvQ4fDNqyeEr9xLS3awMZu4+p68uCFs56Wp0Sq/VBzcGhW0Y30veYQLIwXw/dzaLJZGQWCJZG3hH3uHAxyts5SS3b/bIOwpoiAKXM9/PfXhHR2mLkgLXQEO5s73JUA+HZf9cvZeQUd+x68/e8bQdLJEQm8X6zNdtzh/BLRU1fNRYRJO92y4d3MlCV71m+9AOREQiseTwzajfgsJuxabJLUSoh7qAP6i93RB3OwdjbSNNVVVFBXlH1CjFFVXP84qfpOUeu/v01L1nhWWVLXNeMx11T1sjV3N9F3N9Fwt9fXXlljnv28koKAlPygxPynqQmHkz5nlKXlELB7DAz335sK6qSs35oSour1p25Oq6s3ebcUwAeL8gAQ3wIZOW4GiYl52xlZ6GmY46EQlF4rvx6eVVQgUOx3f1oYpqIRGFJ2bJPliAd9Fgd3siYrHo9L3YlNyW/gUAAN4jq0d6M2stJOYUHgqNSsguYLFYA9rZ9nS2MFBX3jbFb+K205GpOXUeG5dZEBge19fFmvnaQQIaWoC5jtrf8wZ3sDZ6mpabll+iIeCn5Bax2SxbAy2mw42Y1OVHr8dkYD7+u0VfXXnJIK/pPV0Uee/f7/KqigoORtoORtr+bnYV1cLtFx+sPnkzs1C2E+3n9XHr6mBWJRQ9SMpce/ZOeGJmVlGZTM/YRPrqyi7m+i7mekPc7UZ5OV18nLgp+F5LBrBooGfzZp+JSFVJYdFATySgAT5m799/WgAga2KxhIjYbBaxWIoKPG8nc2t9DXkHBSAHHawNDTSURWIJh806fDNK3uFAS7j8vzGn78f+eua2LAbXEPAvfDN60YGQi5GJshgf5KhPWysm+7z94oOdl8Kl7WHx6RvP3d01o5+JluqyoV1GbDwuqac41uGbUX1drEViiZGmSntLgwamSwM0nZWexpVvx5poqUam5mw5f+/i40QXc30jTRV3a6NdIQ+VFHjKfJ6Luf6OaX6LDoTcjnu3ZkBLDixmjV4t7yjko5+r9YHZg9QEfHkH0gwUedx5fdwndW0zbuvpU/eeyeIUWiqKxxcMvZ+YOXT9MVmMLyOZhaVBEfHSsm8bJ/QKWTraf+3RFpswfjIsZkLXNnwepxnHrKwWnQyLacYBAeC9gwQ0ANTWykSXiHRUBcxahWWV1ViEED5O/m52RMRhs3KKy1GSFQAaMLNXOyK68DixZvaZkVtSPv/P84fmDrbS0/BtY3XuYd2l5CNTcwrKKjQEikQ0qL0tEtAgU2n5xS5f78otKWfujRWUVT5Nz+tgbXjs7lN9deUp3m2vP01deSI04MvhM3xcb8elqSvx/d3srkQnJ+YUyjv2jxSLRcuGdP52SBdWo55vfG+oCfgnPh/27ZGrK0+E1nd/7u0Yaao8+mmq/9qj156kNOe4LW7u3vPdHM2SNsyyX7g9s7Al5m7P2HVuxq5zLXAiAPioIAENALUZaCgP7eBw9M4T96V/dLQx+vtW9KKBnp/37SDvuABaWntLQ+ZFSGSSvGMBgHdXGzM9Y01VIloTcKvODonZhafuxQxxt+9dfwKa+aphyv60szSQZbzwsRvWwaFHK/M7cWl/XH1ERAUvp1XeiUvvbG9ybdm4ympRSFTSpG5tKoWiSd3aPEjM6NnKYpCbXWB4XL+f/27MKa4vG1dRLfRZdbBmo4ORdvSa6Z+sP37kzhMiUuHzFvbvOMLT0VRbjcdhx2cV7L8e+fPpm1UicdOvUVVRYdOk3h2sjS48Svx834VqkdjFXO/B6imv99x/PXLsllNNP6OsjfZqtWxoF3lHIRMsFq34pGtqbhHzgWwWWiqKj36aqj19fXMNKF9XopM1pq0r/P1z0882S5eOBwB4vyABDfBx8XYyV+CypZsmWqreTuam2mpMuTFvJ3OmEvTuGX3VlBSCIuL3h0Z6O5kzVSkBPja6agLmReTzbHn
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import Image\n",
"\n",
"Image(data['actions']['screenshots'][0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The image shows the stage where the scraper just typed the search query. \n",
"\n",
"Now, we have to convert this whole process into a function that works for any given city:"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"from typing import Optional, Dict, Any\n",
"\n",
"\n",
"class WeatherData(BaseModel):\n",
" location: str = Field(description=\"The name of the city\")\n",
" temperature: str = Field(\n",
" description=\"The current temperature in degrees Fahrenheit\"\n",
" )\n",
" temperature_high: str = Field(\n",
" description=\"The high temperature for the day in degrees Fahrenheit\"\n",
" )\n",
" temperature_low: str = Field(\n",
" description=\"The low temperature for the day in degrees Fahrenheit\"\n",
" )\n",
" humidity: str = Field(description=\"The current humidity as a percentage\")\n",
" pressure: str = Field(description=\"The current air pressure in inches of mercury\")\n",
" visibility: str = Field(description=\"The current visibility in miles\")\n",
" wind_speed: str = Field(description=\"The current wind speed in miles per hour\")\n",
" dew_point: str = Field(description=\"The current dew point in degrees Fahrenheit\")\n",
" uv_index: str = Field(description=\"The current UV index\")\n",
" moon_phase: str = Field(description=\"The current moon phase\")\n",
"\n",
"\n",
"def scrape_weather_data(app: FirecrawlApp, city: str) -> Optional[WeatherData]:\n",
" try:\n",
" # Define the actions to search for the city\n",
" actions = [\n",
" {\"type\": \"wait\", \"milliseconds\": 3000},\n",
" {\"type\": \"click\", \"selector\": 'input[id=\"LocationSearch_input\"]'},\n",
" {\"type\": \"write\", \"text\": city},\n",
" {\"type\": \"wait\", \"milliseconds\": 1000},\n",
" {\"type\": \"click\", \"selector\": \"button[data-testid='ctaButton']\"},\n",
" {\"type\": \"wait\", \"milliseconds\": 3000},\n",
" ]\n",
"\n",
" # Perform the scraping\n",
" data = app.scrape_url(\n",
" \"https://weather.com\",\n",
" params={\n",
" \"formats\": [\"extract\"],\n",
" \"actions\": actions,\n",
" \"extract\": {\n",
" \"schema\": WeatherData.model_json_schema(),\n",
" \"prompt\": \"Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase\",\n",
" },\n",
" },\n",
" )\n",
"\n",
" # Return the extracted weather data\n",
" return WeatherData(**data[\"extract\"])\n",
"\n",
" except Exception as e:\n",
" print(f\"Error scraping weather data for {city}: {str(e)}\")\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code is the same but it is wrapped inside a function. Let's test it on various cities:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"cities = [\"Tashkent\", \"New York\", \"Tokyo\", \"Paris\", \"Istanbul\"]\n",
"data_full = []\n",
"\n",
"for city in cities:\n",
" weather_data = scrape_weather_data(app, city)\n",
" data_full.append(weather_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert the data for all cities into a DataFrame now:"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>location</th>\n",
" <th>temperature</th>\n",
" <th>temperature_high</th>\n",
" <th>temperature_low</th>\n",
" <th>humidity</th>\n",
" <th>pressure</th>\n",
" <th>visibility</th>\n",
" <th>wind_speed</th>\n",
" <th>dew_point</th>\n",
" <th>uv_index</th>\n",
" <th>moon_phase</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Tashkent, Uzbekistan</td>\n",
" <td>48</td>\n",
" <td>54</td>\n",
" <td>41</td>\n",
" <td>81</td>\n",
" <td>30.30</td>\n",
" <td>2.5</td>\n",
" <td>2</td>\n",
" <td>43</td>\n",
" <td>0</td>\n",
" <td>Waning Gibbous</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>New York City, NY</td>\n",
" <td>48°</td>\n",
" <td>49°</td>\n",
" <td>39°</td>\n",
" <td>93%</td>\n",
" <td>29.45 in</td>\n",
" <td>4 mi</td>\n",
" <td>10 mph</td>\n",
" <td>46°</td>\n",
" <td>0 of 11</td>\n",
" <td>Waning Gibbous</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Tokyo, Tokyo Prefecture, Japan</td>\n",
" <td>47°</td>\n",
" <td>61°</td>\n",
" <td>48°</td>\n",
" <td>95%</td>\n",
" <td>29.94 in</td>\n",
" <td>10 mi</td>\n",
" <td>1 mph</td>\n",
" <td>45°</td>\n",
" <td>0 of 11</td>\n",
" <td>Waning Gibbous</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Paris, France</td>\n",
" <td>34°</td>\n",
" <td>36°</td>\n",
" <td>30°</td>\n",
" <td>93%</td>\n",
" <td>29.42 in</td>\n",
" <td>2.4 mi</td>\n",
" <td>11 mph</td>\n",
" <td>33°</td>\n",
" <td>0 of 11</td>\n",
" <td>Waning Gibbous</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Istanbul, Türkiye</td>\n",
" <td>47°</td>\n",
" <td>67°</td>\n",
" <td>44°</td>\n",
" <td>79%</td>\n",
" <td>29.98 in</td>\n",
" <td>8 mi</td>\n",
" <td>4 mph</td>\n",
" <td>41°</td>\n",
" <td>0 of 11</td>\n",
" <td>Waning Gibbous</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" location temperature temperature_high \\\n",
"0 Tashkent, Uzbekistan 48 54 \n",
"1 New York City, NY 48° 49° \n",
"2 Tokyo, Tokyo Prefecture, Japan 47° 61° \n",
"3 Paris, France 34° 36° \n",
"4 Istanbul, Türkiye 47° 67° \n",
"\n",
" temperature_low humidity pressure visibility wind_speed dew_point uv_index \\\n",
"0 41 81 30.30 2.5 2 43 0 \n",
"1 39° 93% 29.45 in 4 mi 10 mph 46° 0 of 11 \n",
"2 48° 95% 29.94 in 10 mi 1 mph 45° 0 of 11 \n",
"3 30° 93% 29.42 in 2.4 mi 11 mph 33° 0 of 11 \n",
"4 44° 79% 29.98 in 8 mi 4 mph 41° 0 of 11 \n",
"\n",
" moon_phase \n",
"0 Waning Gibbous \n",
"1 Waning Gibbous \n",
"2 Waning Gibbous \n",
"3 Waning Gibbous \n",
"4 Waning Gibbous "
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Convert list of WeatherData objects into dictionaries\n",
"data_dicts = [city.model_dump() for city in data_full]\n",
"\n",
"# Convert list of dictionaries into DataFrame\n",
"df = pd.DataFrame(data_dicts)\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have successfully scraped weather data from multiple cities using and organized it into a structured DataFrame. This demonstrates how we can efficiently collect and analyze data generated by dynamic websites for further analysis and monitoring."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this comprehensive guide, we've explored Firecrawl's `/scrape` endpoint and its powerful capabilities for modern web scraping. We covered:\n",
"\n",
"- Basic scraping setup and configuration options\n",
"- Multiple output formats including HTML, markdown, and screenshots\n",
"- Structured data extraction using both natural language prompts and Pydantic schemas\n",
"- Batch operations for processing multiple URLs efficiently\n",
"- Advanced techniques for scraping JavaScript-heavy dynamic websites\n",
"\n",
"Through practical examples like extracting news articles from the NY Times and weather data from weather.com, we've demonstrated how Firecrawl simplifies complex scraping tasks while providing flexible output formats suitable for data engineering and AI/ML pipelines.\n",
"\n",
"The combination of LLM-powered extraction, structured schemas, and browser automation capabilities makes Firecrawl a versatile tool for gathering high-quality web data at scale, whether you're building training datasets, monitoring websites, or conducting research."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
}
},
"nbformat": 4,
"nbformat_minor": 2
}