mirror of https://github.com/mendableai/firecrawl.git synced 2025-10-24 14:30:49 +00:00

History

Nicolas f3aa32863f Revert "Merge branch 'nsc/crawl-n--1-fixes'"

This reverts commit 6d325b7ce7af912b326369eace62f89f897b536b, reversing
changes made to 3d5704b73e6c4802f0344dc2e17042af9b6de0f5.

2024-12-03 20:53:14 -03:00

engines

Revert "Merge branch 'nsc/crawl-n--1-fixes'"

2024-12-03 20:53:14 -03:00

lib

fix(scrapeURL/pdf): retry

2024-11-12 22:26:36 +01:00

transformers

Nick: fixed /extract without a schema

2024-12-03 12:08:15 -03:00

error.ts

fix(scrapeURL/fire-engine): fast fail on chrome error

2024-11-28 18:41:48 +01:00

index.ts

fix(scrapeURL/fire-engine): fast fail on chrome error

2024-11-28 18:41:48 +01:00

README.md

WebScraper refactor into scrapeURL (#714 )

2024-11-07 20:57:33 +01:00

scrapeURL.test.ts

fix(scrapeURL, logger): remove buggy ArrayTransport that causes memory leak

2024-11-11 10:27:55 +01:00

README.md

`scrapeURL`

New URL scraper for Firecrawl

Signal flow

flowchart TD;
    scrapeURL-.->buildFallbackList;
    buildFallbackList-.->scrapeURLWithEngine;
    scrapeURLWithEngine-.->parseMarkdown;
    parseMarkdown-.->wasScrapeSuccessful{{Was scrape successful?}};
    wasScrapeSuccessful-."No".->areEnginesLeft{{Are there engines left to try?}};
    areEnginesLeft-."Yes, try next engine".->scrapeURLWithEngine;
    areEnginesLeft-."No".->NoEnginesLeftError[/NoEnginesLeftError/]
    wasScrapeSuccessful-."Yes".->asd;

Differences from `WebScraperDataProvider`

The job of WebScraperDataProvider.validateInitialUrl has been delegated to the zod layer above scrapeUrl.
WebScraperDataProvider.mode has no equivalent, only scrape_url is supported.
You may no longer specify multiple URLs.
Built on v1 definitons, instead of v0.
PDFs are now converted straight to markdown using LlamaParse, instead of converting to just plaintext.
DOCXs are now converted straight to HTML (and then later to markdown) using mammoth, instead of converting to just plaintext.
Using new JSON Schema OpenAI API -- schema fails with LLM Extract will be basically non-existant.

README.md

scrapeURL

Signal flow

Differences from WebScraperDataProvider

`scrapeURL`

Differences from `WebScraperDataProvider`