mirror of
https://github.com/mendableai/firecrawl.git
synced 2025-08-04 06:49:31 +00:00

This reverts commit 6d325b7ce7af912b326369eace62f89f897b536b, reversing changes made to 3d5704b73e6c4802f0344dc2e17042af9b6de0f5.
scrapeURL
New URL scraper for Firecrawl
Signal flow
flowchart TD;
scrapeURL-.->buildFallbackList;
buildFallbackList-.->scrapeURLWithEngine;
scrapeURLWithEngine-.->parseMarkdown;
parseMarkdown-.->wasScrapeSuccessful{{Was scrape successful?}};
wasScrapeSuccessful-."No".->areEnginesLeft{{Are there engines left to try?}};
areEnginesLeft-."Yes, try next engine".->scrapeURLWithEngine;
areEnginesLeft-."No".->NoEnginesLeftError[/NoEnginesLeftError/]
wasScrapeSuccessful-."Yes".->asd;
Differences from WebScraperDataProvider
- The job of
WebScraperDataProvider.validateInitialUrl
has been delegated to the zod layer abovescrapeUrl
. WebScraperDataProvider.mode
has no equivalent, onlyscrape_url
is supported.- You may no longer specify multiple URLs.
- Built on
v1
definitons, instead ofv0
. - PDFs are now converted straight to markdown using LlamaParse, instead of converting to just plaintext.
- DOCXs are now converted straight to HTML (and then later to markdown) using mammoth, instead of converting to just plaintext.
- Using new JSON Schema OpenAI API -- schema fails with LLM Extract will be basically non-existant.