mirror of
				https://github.com/mendableai/firecrawl.git
				synced 2025-10-22 05:24:37 +00:00 
			
		
		
		
	Nick: tutorials
This commit is contained in:
		
							parent
							
								
									de7e1f501b
								
							
						
					
					
						commit
						18450b5f9a
					
				
							
								
								
									
										95
									
								
								tutorials/data-extraction-using-llms.mdx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										95
									
								
								tutorials/data-extraction-using-llms.mdx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,95 @@ | ||||
| --- | ||||
| title: "Extract website data using LLMs" | ||||
| description: "Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code." | ||||
| 'og:image': "/images/og.png" | ||||
| 'twitter:image': "/images/og.png" | ||||
| --- | ||||
| 
 | ||||
| ## Setup | ||||
| 
 | ||||
| Install our python dependencies, including groq and firecrawl-py.  | ||||
| 
 | ||||
| ```bash | ||||
| pip install groq firecrawl-py | ||||
| ``` | ||||
| 
 | ||||
| ## Getting your Groq and Firecrawl API Keys | ||||
| 
 | ||||
| To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev).    | ||||
| 
 | ||||
| ## Load website with Firecrawl | ||||
| 
 | ||||
| To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy. | ||||
| 
 | ||||
| Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc. | ||||
| 
 | ||||
| ```python | ||||
| from firecrawl import FirecrawlApp  # Importing the FireCrawlLoader | ||||
| 
 | ||||
| url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/" | ||||
| 
 | ||||
| firecrawl = FirecrawlApp( | ||||
|     api_key="fc-YOUR_FIRECRAWL_API_KEY", | ||||
| ) | ||||
| page_content = firecrawl.scrape_url(url=url,  # Target URL to crawl | ||||
|     params={ | ||||
|         "pageOptions":{ | ||||
|             "onlyMainContent": True # Ignore navs, footers, etc. | ||||
|         } | ||||
|     }) | ||||
| print(page_content) | ||||
| ``` | ||||
| 
 | ||||
| Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction. | ||||
| 
 | ||||
| ## Extraction and Generation | ||||
| 
 | ||||
| Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content. | ||||
| 
 | ||||
| We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results. | ||||
| 
 | ||||
| ```python | ||||
| import json | ||||
| from groq import Groq | ||||
| 
 | ||||
| client = Groq( | ||||
|     api_key="gsk_YOUR_GROQ_API_KEY",  # Note: Replace 'API_KEY' with your actual Groq API key | ||||
| ) | ||||
| 
 | ||||
| # Here we define the fields we want to extract from the page content | ||||
| extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"] | ||||
| 
 | ||||
| completion = client.chat.completions.create( | ||||
|     model="llama3-8b-8192", | ||||
|     messages=[ | ||||
|         { | ||||
|             "role": "system", | ||||
|             "content": "You are a legal advisor who extracts information from documents in JSON." | ||||
|         }, | ||||
|         { | ||||
|             "role": "user", | ||||
|             # Here we pass the page content and the fields we want to extract | ||||
|             "content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}" | ||||
|         } | ||||
|     ], | ||||
|     temperature=0, | ||||
|     max_tokens=1024, | ||||
|     top_p=1, | ||||
|     stream=False, | ||||
|     stop=None, | ||||
|     # We set the response format to JSON object | ||||
|     response_format={"type": "json_object"} | ||||
| ) | ||||
| 
 | ||||
| 
 | ||||
| # Pretty print the JSON response | ||||
| dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4) | ||||
| 
 | ||||
| print(dataExtracted) | ||||
| ``` | ||||
| 
 | ||||
| ## And Voila! | ||||
| 
 | ||||
| You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website. | ||||
| 
 | ||||
| If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev). | ||||
							
								
								
									
										91
									
								
								tutorials/rag-llama3.mdx
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										91
									
								
								tutorials/rag-llama3.mdx
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,91 @@ | ||||
| --- | ||||
| title: "Build a 'Chat with website' using Groq Llama 3" | ||||
| description: "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot." | ||||
| --- | ||||
| 
 | ||||
| ## Setup | ||||
| 
 | ||||
| Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py.  | ||||
| 
 | ||||
| ```bash | ||||
| pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py | ||||
| ``` | ||||
| 
 | ||||
| We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer. | ||||
| 
 | ||||
| ## Load website with Firecrawl | ||||
| 
 | ||||
| To be able to get all the data from a website and make sure it is in the cleanest format, we will use FireCrawl. Firecrawl integrates very easily with Langchain as a document loader. | ||||
| 
 | ||||
| Here is how you can load a website with FireCrawl: | ||||
| 
 | ||||
| ```python | ||||
| from langchain_community.document_loaders import FireCrawlLoader  # Importing the FireCrawlLoader | ||||
| 
 | ||||
| url = "https://firecrawl.dev" | ||||
| loader = FireCrawlLoader( | ||||
|     api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key | ||||
|     url=url,  # Target URL to crawl | ||||
|     mode="crawl"  # Mode set to 'crawl' to crawl all accessible subpages | ||||
| ) | ||||
| docs = loader.load() | ||||
| ``` | ||||
| 
 | ||||
| ## Setup the Vectorstore | ||||
| 
 | ||||
| Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore. | ||||
| We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it. | ||||
|   | ||||
| ```python | ||||
| from langchain_community.embeddings import OllamaEmbeddings | ||||
| from langchain_text_splitters import RecursiveCharacterTextSplitter | ||||
| from langchain_community.vectorstores import FAISS | ||||
| 
 | ||||
| text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) | ||||
| splits = text_splitter.split_documents(docs) | ||||
| vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings()) | ||||
| ``` | ||||
| 
 | ||||
| ## Retrieval and Generation | ||||
| 
 | ||||
| Now that our documents  are loaded and the vectorstore is setup, we can, based on user's question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model. | ||||
| 
 | ||||
| 
 | ||||
| ```python | ||||
| question = "What is firecrawl?" | ||||
| docs = vectorstore.similarity_search(query=question) | ||||
| ``` | ||||
| 
 | ||||
| ## Generation | ||||
| Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded. | ||||
| 
 | ||||
| ```python | ||||
| from groq import Groq | ||||
| 
 | ||||
| client = Groq( | ||||
|     api_key="YOUR_GROQ_API_KEY", | ||||
| ) | ||||
| 
 | ||||
| completion = client.chat.completions.create( | ||||
|     model="llama3-8b-8192", | ||||
|     messages=[ | ||||
|         { | ||||
|             "role": "user", | ||||
|             "content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}" | ||||
|         } | ||||
|     ], | ||||
|     temperature=1, | ||||
|     max_tokens=1024, | ||||
|     top_p=1, | ||||
|     stream=False, | ||||
|     stop=None, | ||||
| ) | ||||
| 
 | ||||
| print(completion.choices[0].message) | ||||
| ``` | ||||
| 
 | ||||
| ## And Voila! | ||||
| 
 | ||||
| You have now built a 'Chat with your website' bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website. | ||||
| 
 | ||||
| If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev). | ||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user
	 Nicolas
						Nicolas