firecrawl

mirror of https://github.com/mendableai/firecrawl.git synced 2025-11-13 08:47:43 +00:00

Author	SHA1	Message	Date
Gergő Móricz	5e760aacbb	fixes	2025-06-20 11:28:22 +02:00
Gergő Móricz	3cf22f9167	feat(scrape): support Google Docs	2025-06-20 11:12:05 +02:00
Gergő Móricz	f8983fffb7	Concurrency limit refactor + `maxConcurrency` parameter (FIR-2191) (#1643 )	2025-06-20 10:45:36 +02:00
Gergő Móricz	a8e3c29664	feat(scrape, extract): creditsUsed, tokensUsed fields (FIR-2336) (#1683 ) * fix(scrape): log FIRE-1 credits billed on failures properly * fix dumb thinbgs * feat(scrape, extract): creditsUsed fields * fix(extract): call it tokensUsed * Trigger Build * dumb mistake, search does separate billing	2025-06-18 21:49:20 +02:00
Gergő Móricz	fbd81b4168	fix(scrape): log FIRE-1 credits billed on failures properly (FIR-2331) (#1682 ) * fix(scrape): log FIRE-1 credits billed on failures properly * fix dumb thinbgs	2025-06-18 21:47:58 +02:00
Gergő Móricz	ebc1de9d60	feat(crawl-status): refactor to work after a redis flush (#1664 )	2025-06-18 18:58:04 +02:00
Thomas Kosmas	199115c7be	stop testing new mu	2025-06-18 00:48:50 +03:00
Thomas Kosmas	f46f845efc	fix: send the request to new mu version before the main one to achieve better sync	2025-06-17 20:37:45 +03:00
Thomas Kosmas	ee7b29b3f6	feat: Test mu v3 (#1678 ) * Test mu v3 * fix env	2025-06-17 20:13:19 +03:00
Gergő Móricz	5ca8e2e98e	feat(index): store short titles and descriptions (#1677 )	2025-06-17 19:09:07 +02:00
devin-ai-integration[bot]	9710bdffc0	Improve URL filtering error messages with specific denial reasons (FIR-2352) (#1676 ) * Improve URL filtering error messages with specific denial reasons - Add FilterResult and FilterLinksResult interfaces for structured error reporting - Define DenialReason enum with specific, human-readable error messages - Update filterURL method to return structured results with denial reasons - Update filterLinks method to collect and return denial reasons for each URL - Modify error handling in queue-worker.ts to use specific denial reasons - Add comprehensive tests for different URL filtering scenarios - Maintain backward compatibility while improving error specificity Fixes: Misleading 'includePaths/excludePaths rules' error now shows actual denial reason (robots.txt, exclude patterns, depth limits, etc.) Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev> * Fix test compilation error for FilterLinksResult interface - Update crawler.test.ts to use filteredLinks.links.length instead of filteredLinks.length - Update test expectations to use filteredLinks.links array - Resolves TypeScript compilation error preventing CI from passing Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: mogery@sideguide.dev <mogery@sideguide.dev>	2025-06-17 19:00:29 +02:00
Nicolas	c6482eaf2d	Nick: prevent additional logging on /extract scrapes	2025-06-13 18:17:17 -03:00
Gergő Móricz	ea321b4936	fix search test timeouts	2025-06-13 17:42:55 +02:00
Thomas Kosmas	38c5795282	feat(vertex): fix vertex ai provider bug and update model references to use "gemini-2.5-pro" (#1668 )	2025-06-13 18:29:03 +03:00
Gergő Móricz	0bf23071ff	feat(index): add domain splitting for improved map querying (#1666 )	2025-06-13 15:22:45 +02:00
Gergő Móricz	07224b8cd4	feat: use index in search and extract (#1660 )	2025-06-13 12:30:28 +02:00
Gergő Móricz	f296342731	feat(index): remove unused columns (#1662 )	2025-06-12 16:51:40 +02:00
Gergő Móricz	89e42b1137	fix(api): remove query parameter sanitization that was breaking extracts (#1661 )	2025-06-12 15:37:45 +02:00
Gergő Móricz	3c03d07051	feat: add credits_billed everywhere (FIR-2286) (#1655 ) * feat: add credits_billed everywhere also a bit of logging improvement for logJob * fix(queue-worker): db auth check before doing rpc for crawl/batch_scrape	2025-06-11 23:06:55 +02:00
Nicolas	bf3b2a359a	Improve concurrency limit email notifications (#1658 ) * Update email_notification.ts * Update email_notification.ts * Update email_notification.ts	2025-06-11 17:14:54 -03:00
Pulkit Saini	255be2a2ff	Fix PLAYWRIGHT_MICROSERVICE_URL env var to use /scrape endpoint (#1654 ) The correct environment variable should be PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape instead of PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html	2025-06-11 16:53:32 +02:00
Gergő Móricz	19dd086eb3	improve auto recharge logging	2025-06-11 16:26:06 +02:00
Nicolas	9964d11c20	Update v1.ts	2025-06-09 17:13:56 -03:00
Gergő Móricz	4659155b76	remove logs	2025-06-06 00:25:28 +02:00
Gergő Móricz	3b6be76d3e	debug(index): time insights	2025-06-06 00:03:34 +02:00
Gergő Móricz	8ef3e8484a	feat(gcs-jobs): ditch exists check to cut lookup time in half (#1641 )	2025-06-05 23:43:30 +02:00
Gergő Móricz	6d1b9bf1fe	debug(api/scrape): more logging	2025-06-05 22:52:41 +02:00
Gergő Móricz	0c7f864ea4	debug(api/scrape): increased logging to diagnose scrape fluke length	2025-06-05 22:51:25 +02:00
Gergő Móricz	1de0ae392c	Index testing improvements (FIR-2214) (#1637 ) * feat(api/tests/scrape): index improvements * fix(api/test/scrape): add waits to allow batch insert to happen * fix: ...	2025-06-05 22:10:06 +02:00
Gergő Móricz	78580f65df	feat(webhook): refactor callWebhook and add logWebhook (FIR-2218) (#1629 ) * feat(webhook): refactor callWebhook and add logWebhook * feat(queue-worker): fix crawl pre-finishing logic (#1628) * feat(ci): verify typescript errors * fix(ci): * feat(api/tests): add webhook tests + refactor batch scrape lib (#1630) * feat(api/tests): add webhook tests + refactor batch scrape lib * fix(ci): * feat(webhook/log): insert queue	2025-06-05 22:04:22 +02:00
Gergő Móricz	f050b169e2	feat(api/index): port queryIndexAtSplitLevel to RPC (FIR-2241) (#1640 ) * feat(api/index): port queryIndexAtSplitLevel to RPC * Update apps/api/src/services/index.ts	2025-06-05 22:02:41 +02:00
Gergő Móricz	a08d52e45d	feat(scrapeURL/index): don't put results by "dumb" engines into the index	2025-06-05 22:01:29 +02:00
Thomas Kosmas	af88218fad	feat: update mu (#1639 ) * update to mu v2 * feat(ci): add RUNPOD_MUV2_POD_ID * stupid change to make CI run --------- Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>	2025-06-05 22:27:00 +03:00
Nicolas	6ca551a887	Merge branch 'main' of https://github.com/mendableai/firecrawl	2025-06-05 15:39:50 -03:00
Nicolas	8c40271796	Update map.ts	2025-06-05 15:39:48 -03:00
Thomas Kosmas	4bf64d2c01	feat(scraper): runpod v2 parallel testing (#1636 ) * feat(scraper): runpod v2 parallel testing * fix catch	2025-06-05 20:28:01 +03:00
Ademílson Tonato	6e1f8d6c10	feat(api): propagate integration field in queue worker job processing	2025-06-05 16:39:20 +01:00
Ademílson Tonato	71caf8ae57	Merge pull request #1632 from mendableai/feat/api-integration-parameter feat(api): add integration field to jobs and update related controllers and types	2025-06-05 11:44:02 +01:00
Gergő Móricz	34a18a2d2f	feat(changeTracking): support tags (FIR-1940) (#1631 ) * feat(changeTracking): support tags * test 408 fixes	2025-06-04 23:50:54 +02:00
Gergő Móricz	6e63528b61	fix(crawl-redis): bad logic	2025-06-04 20:46:04 +02:00
Ademílson Tonato	4c49bb9fc6	refactor: remove unnecessary logs and set integration as default to null	2025-06-04 19:29:47 +01:00
Ademílson Tonato	57a0aed484	feat(api): add integration field to jobs and update related controllers and types	2025-06-04 19:17:57 +01:00
Gergő Móricz	a05c4ae97d	feat(api): GET /crawl/ongoing (FIR-2189) (#1620 ) * feat(api): GET /crawl/ongoing * fix: routers in wrong order * feat(api/crawl/ongoing): return more details --------- Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-06-04 18:14:23 +02:00
Gergő Móricz	077c5dd8ec	feat(api/tests): add webhook tests + refactor batch scrape lib (#1630 ) * feat(api/tests): add webhook tests + refactor batch scrape lib * fix(ci):	2025-06-04 16:11:47 +02:00
Gergő Móricz	0f394a10c6	feat(queue-worker): fix crawl pre-finishing logic (#1628 )	2025-06-04 15:19:48 +02:00
Gergő Móricz	8dd5bf7bd9	feat(api/tests/scrape): Playwright test improvements (#1626 ) * feat(api/tests/scrape): verify that proxy works on Playwright * debug: logs * remove logs * feat(playwright): add contentType relaying * fix tests * debug * fix json	2025-06-04 01:24:19 +02:00
Gergő Móricz	95f204aab7	Index (FIR-2177) (#1605 ) * poc progress * poc * url splits and better url normalization * feat(index): integrate into map * fix on selfhost * feat: modifiers * separate index supa logic * debug * fix language comparison * feat: dontStoreInCache * feat(index): some rudimentary testing * feat: use url split columns * feat(queue-worker/kickoff): use index links to kickoff crawl * feat(scrapeURL/index): behaviour on non-200 index entries * feat/added benchmark for scrapes * feat(map): ignoreIndex * feat(index): batch insert * fix(api/tests/scrape): fix index test to work with batching * disable cacheable lookup for self hosting tests * feat(js-sdk): dontStoreInCache * chore(js-sdk): bump * feat(index): FIRECRAWL_INDEX_WRITE_ONLY * feat(api/test): index envs * map benchmarks * cleanup * further fixes * clean up on map * remove extraneous log * workflow test run * asd * improve fns * try again * wow i'm an idiot * ok fixed * wth * revert * async saving to index * feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624) * feat(selfhost): deploy a playwright image (#1625) * Testing improvements (FIR-2209) (#1623) * yeet ad blocking tests until further notice * feat: re-enable billing tests * more timeout * cache issues with billing test * weird thing * fix(api/tests/scrape/status): propagation time * stupid * no log * sws --------- Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com> Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>	2025-06-03 21:30:19 +02:00
Gergő Móricz	406d696667	Testing improvements (FIR-2209) (#1623 ) * yeet ad blocking tests until further notice * feat: re-enable billing tests * more timeout * cache issues with billing test * weird thing * fix(api/tests/scrape/status): propagation time * stupid * no log * sws	2025-06-03 21:16:36 +02:00
Ademílson Tonato	41897139da	feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624 )	2025-06-03 18:16:46 +02:00
Nicolas	e108ff3525	Update search.ts	2025-06-02 23:46:55 -03:00

1 2 3 4 5 ...

2263 Commits