crawl4ai/docs/examples/llm_extraction_openai_pricing.py

from crawl4ai import LLMConfig
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
import asyncio
import os
import json
from pydantic import BaseModel, Field

url = "https://openai.com/api/pricing/"


class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def main():
    # Use AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
                llm_config=LLMConfig(provider="groq/llama-3.1-70b-versatile", api_token=os.getenv("GROQ_API_KEY")),
                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
                instruction="From the crawled content, extract all mentioned model names along with their "
                "fees for input and output tokens. Make sure not to miss anything in the entire content. "
                "One extracted model JSON format should look like this: "
                '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
            ),
        )
        print("Success:", result.success)
        model_fees = json.loads(result.extracted_content)
        print(len(model_fees))

        with open(".data/data.json", "w", encoding="utf-8") as f:
            f.write(result.extracted_content)


asyncio.run(main())
feat(browser): add standalone CDP browser launch and lxml extraction strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai 2025-03-07 20:55:56 +08:00			`from crawl4ai import LLMConfig`
refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes. 2025-03-04 18:23:55 +08:00			`from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy`
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`import asyncio`
refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes. 2025-03-04 18:23:55 +08:00			`import os`
			`import json`
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`from pydantic import BaseModel, Field`
chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes. 2025-03-04 18:23:55 +08:00			`url = "https://openai.com/api/pricing/"`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
			`class OpenAIModelFee(BaseModel):`
			`model_name: str = Field(..., description="Name of the OpenAI model.")`
			`input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00			`output_fee: str = Field(`
			`..., description="Fee for output token for the OpenAI model."`
			`)`

Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`async def main():`
			`# Use AsyncWebCrawler`
			`async with AsyncWebCrawler() as crawler:`
			`result = await crawler.arun(`
			`url=url,`
			`word_count_threshold=1,`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00			`extraction_strategy=LLMExtractionStrategy(`
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),`
refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage. 2025-03-05 14:17:04 +08:00			`llm_config=LLMConfig(provider="groq/llama-3.1-70b-versatile", api_token=os.getenv("GROQ_API_KEY")),`
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`schema=OpenAIModelFee.model_json_schema(),`
			`extraction_type="schema",`
Apply Ruff Corrections 2025-01-13 19:19:58 +08:00			`instruction="From the crawled content, extract all mentioned model names along with their "`
			`"fees for input and output tokens. Make sure not to miss anything in the entire content. "`
			`"One extracted model JSON format should look like this: "`
			`'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',`
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`),`
			`)`
			`print("Success:", result.success)`
			`model_fees = json.loads(result.extracted_content)`
			`print(len(model_fees))`

			`with open(".data/data.json", "w", encoding="utf-8") as f:`
			`f.write(result.extracted_content)`

Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
Fix #340 example llm_extraction (#358) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well. 2024-12-24 12:56:07 +01:00			`asyncio.run(main())`