firecrawl/examples/gpt-4.1-web-crawler/README.md

# GPT-4.1 Web Crawler

A smart web crawler powered by GPT-4.1 that intelligently searches websites to find specific information based on user objectives.

## Features

- Intelligently maps website content using semantic search
- Ranks website pages by relevance to your objective
- Extracts structured information using GPT-4.1
- Returns results in clean JSON format

## Prerequisites

- Python 3.8+
- Firecrawl API key
- OpenAI API key (with access to GPT-4.1 models)

## Installation

1. Clone this repository:

   ```
   git clone https://github.com/yourusername/gpt-4.1-web-crawler.git
   cd gpt-4.1-web-crawler
   ```

2. Install the required dependencies:

   ```
   pip install -r requirements.txt
   ```

3. Set up environment variables:
   ```
   cp .env.example .env
   ```
   Then edit the `.env` file and add your API keys.

## Usage

Run the script:

```
python gpt-4.1-web-crawler.py
```

The program will prompt you for:

1. The website URL to crawl
2. Your specific objective (what information you want to find)

Example:

```
Enter the website to crawl: https://example.com
Enter your objective: Find the company's leadership team with their roles and short bios
```

The crawler will then:

1. Map the website
2. Identify the most relevant pages
3. Scrape and analyze those pages
4. Return structured information if the objective is met

## How It Works

1. **Mapping**: The crawler uses Firecrawl to map the website structure and find relevant pages based on search terms derived from your objective.

2. **Ranking**: GPT-4.1 analyzes the URLs to determine which pages are most likely to contain the information you're looking for.

3. **Extraction**: The top pages are scraped and analyzed to extract the specific information requested in your objective.

4. **Results**: If found, the information is returned in a clean, structured JSON format.

## License

[MIT License](LICENSE)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
Add examples/gpt-4.1-crawler 2025-04-15 00:48:19 +05:30			`# GPT-4.1 Web Crawler`

			`A smart web crawler powered by GPT-4.1 that intelligently searches websites to find specific information based on user objectives.`

			`## Features`

			`- Intelligently maps website content using semantic search`
			`- Ranks website pages by relevance to your objective`
			`- Extracts structured information using GPT-4.1`
			`- Returns results in clean JSON format`

			`## Prerequisites`

			`- Python 3.8+`
			`- Firecrawl API key`
			`- OpenAI API key (with access to GPT-4.1 models)`

			`## Installation`

			`1. Clone this repository:`

			```
			`git clone https://github.com/yourusername/gpt-4.1-web-crawler.git`
			`cd gpt-4.1-web-crawler`
			```

			`2. Install the required dependencies:`

			```
			`pip install -r requirements.txt`
			```

			`3. Set up environment variables:`
			```
			`cp .env.example .env`
			```
			Then edit the `.env` file and add your API keys.

			`## Usage`

			`Run the script:`

			```
			`python gpt-4.1-web-crawler.py`
			```

			`The program will prompt you for:`

			`1. The website URL to crawl`
			`2. Your specific objective (what information you want to find)`

			`Example:`

			```
			`Enter the website to crawl: https://example.com`
			`Enter your objective: Find the company's leadership team with their roles and short bios`
			```

			`The crawler will then:`

			`1. Map the website`
			`2. Identify the most relevant pages`
			`3. Scrape and analyze those pages`
			`4. Return structured information if the objective is met`

			`## How It Works`

			`1. Mapping: The crawler uses Firecrawl to map the website structure and find relevant pages based on search terms derived from your objective.`

			`2. Ranking: GPT-4.1 analyzes the URLs to determine which pages are most likely to contain the information you're looking for.`

			`3. Extraction: The top pages are scraped and analyzed to extract the specific information requested in your objective.`

			`4. Results: If found, the information is returned in a clean, structured JSON format.`

			`## License`

			`[MIT License](LICENSE)`

			`## Contributing`

			`Contributions are welcome! Please feel free to submit a Pull Request.`