firecrawl

mirror of https://github.com/mendableai/firecrawl.git synced 2025-10-21 21:13:25 +00:00

Author	SHA1	Message	Date
Gergő Móricz	b1eaecfdb0	fix 2	2024-11-20 20:19:16 +01:00
Gergő Móricz	e2ddc6c65c	fix handling of badly formatted URLs	2024-11-20 20:18:40 +01:00
Gergő Móricz	79a75e088a	feat(crawl): allowSubdomain	2024-11-19 18:38:59 +01:00
Gergő Móricz	350d00d27a	fix(crawler): treat XML files as sitemaps (temporarily)	2024-11-15 20:09:20 +01:00
Nicolas	7f084c6c43	Nick:	2024-11-14 17:44:32 -05:00
Gergő Móricz	0310cd2afa	fix(crawl): redirect rebase	2024-11-13 21:38:44 +01:00
Gergő Móricz	fbabc779f5	fix(crawler): relative URL handling on non-start pages (#893 ) * fix(crawler): relative URL handling on non-start pages * fix(crawl): further fixing	2024-11-12 18:20:53 +01:00
Gergő Móricz	8d467c8ca7	`WebScraper` refactor into `scrapeURL` (#714 ) * feat: use strictNullChecking * feat: switch logger to Winston * feat(scrapeURL): first batch * fix(scrapeURL): error swallow * fix(scrapeURL): add timeout to EngineResultsTracker * fix(scrapeURL): report unexpected error to sentry * chore: remove unused modules * feat(transfomers/coerce): warn when a format's response is missing * feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support * (add note) * feat(scrapeURL): wip readme * feat(scrapeURL): LLM extract * feat(scrapeURL): better warnings * fix(scrapeURL/engines/fire-engine;playwright): fix screenshot * feat(scrapeURL): add forceEngine internal option * feat(scrapeURL/engines): scrapingbee * feat(scrapeURL/transformars): uploadScreenshot * feat(scrapeURL): more intense tests * bunch of stuff * get rid of WebScraper (mostly) * adapt batch scrape * add staging deploy workflow * fix yaml * fix logger issues * fix v1 test schema * feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions * scrapeURL: v0 backwards compat * logger fixes * feat(scrapeurl): v0 returnOnlyUrls support * fix(scrapeURL/v0): URL leniency * fix(batch-scrape): ts non-nullable * fix(scrapeURL/fire-engine/chromecdp): fix wait action * fix(logger): remove error debug key * feat(requests.http): use dotenv expression * fix(scrapeURL/extractMetadata): extract custom metadata * fix crawl option conversion * feat(scrapeURL): Add retry logic to robustFetch * fix(scrapeURL): crawl stuff * fix(scrapeURL): LLM extract * fix(scrapeURL/v0): search fix * fix(tests/v0): grant larger response size to v0 crawl status * feat(scrapeURL): basic fetch engine * feat(scrapeURL): playwright engine * feat(scrapeURL): add url-specific parameters * Update readme and examples * added e2e tests for most parameters. Still a few actions, location and iframes to be done. * fixed type * Nick: * Update scrape.ts * Update index.ts * added actions and base64 check * Nick: skipTls feature flag? * 403 * todo * todo * fixes * yeet headers from url specific params * add warning when final engine has feature deficit * expose engine results tracker for ScrapeEvents implementation * ingest scrape events * fixed some tests * comment * Update index.test.ts * fixed rawHtml * Update index.test.ts * update comments * move geolocation to global f-e option, fix removeBase64Images * Nick: * trim url-specific params * Update index.ts --------- Co-authored-by: Eric Ciarla <ericciarla@yahoo.com> Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com> Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2024-11-07 20:57:33 +01:00
rafaelsideguide	367af9512f	added iframe links to extractLinksFromHTML	2024-10-31 10:53:47 -03:00
Thomas Kosmas	acde353e56	skipTlsVerification on robots.txt scraping	2024-10-23 01:07:03 +03:00
rafaelsideguide	180801225b	fix/check files on crawl	2024-10-14 15:44:45 -03:00
Gergő Móricz	e7f267b6fe	Merge branch 'main' into v1-webscraper	2024-08-23 17:21:54 +02:00
Gergő Móricz	8e3c2b2855	fix(crawler): verify URL	2024-08-22 23:30:19 +02:00
Gergő Móricz	fbbc3878f1	fix(crawler): make sure includes/excludes is an array	2024-08-22 13:18:26 +02:00
Gergő Móricz	55009e51f5	fix: filter out invalid URLs from crawl links	2024-08-21 20:49:25 +02:00
rafaelsideguide	e1c9cbf709	bug fixed. crawl should not stop if sitemap url is invalid	2024-08-20 09:11:58 -03:00
Gergő Móricz	aabfaf0ac5	clean up crawl-status, fix db ddos	2024-08-16 23:29:39 +02:00
Gergo Moricz	86e136beca	feat: crawl to scrape conversion	2024-08-13 20:51:43 +02:00
rafaelsideguide	8568b61015	bugfix for sitemaps	2024-08-02 11:03:01 -03:00
rafaelsideguide	f48ff36b32	added .inc files and forced lower case comparison	2024-07-31 09:28:43 -03:00
Nicolas	e5b797549e	Merge branch 'main' into feat/scrape-monitoring	2024-07-25 16:21:02 -04:00
rafaelsideguide	e720e1bacf	Merge remote-tracking branch 'origin/main' into feat/logger	2024-07-25 09:49:27 -03:00
Gergo Moricz	7cd9bf92e3	feat: scrape event logging to DB	2024-07-24 14:31:25 +02:00
Rafael Miller	5e728c1a4d	Update apps/api/src/scraper/WebScraper/crawler.ts no need for regex Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>	2024-07-24 08:33:00 -03:00
rafaelsideguide	6208ecdbc0	added logger	2024-07-23 17:30:46 -03:00
rafaelsideguide	a684bd3c5d	added regex for links in sitemap	2024-07-23 09:07:23 -03:00
rafaelsideguide	5c02dbe20c	fix(isFile): added .tiff extension	2024-07-18 17:07:21 -03:00
Gergo Moricz	f0e95ce399	fix(WebCrawler): filter out file URLs when taking URLs from sitemap	2024-07-18 21:49:37 +02:00
Nicolas	e098e88ea7	Nick:	2024-07-12 22:02:08 -04:00
rafaelsideguide	9ad06fdf56	added fire-engine fallback for getting sitemaps	2024-07-09 16:07:53 -03:00
Nicolas	90c54c32fd	Nick: refactor	2024-07-03 18:01:17 -03:00
rafaelsideguide	4d6e25619b	minor spacing and comment stuff	2024-07-01 16:05:34 -03:00
Jeff Pereira	a5fb45988c	new feature allowExternalContentLinks	2024-06-28 17:23:40 -07:00
Nicolas	90b7fff366	Update crawler.ts	2024-06-24 16:52:01 -03:00
rafaelsideguide	3ebdf93342	removed console.logs	2024-06-24 16:43:12 -03:00
Nicolas	56d42d9c9b	Nick:	2024-06-24 16:33:07 -03:00
rafaelsideguide	21d29de819	testing crawl with new.abb.com case many unnecessary console.logs for tracing the code execution	2024-06-24 16:25:07 -03:00
Eric Ciarla	b1eb608295	Merge branch 'main' into feat/maxDepthRelative	2024-06-15 16:50:27 -04:00
Eric Ciarla	34e37c5671	Add unit tests to replace e2e	2024-06-15 16:43:37 -04:00
Eric Ciarla	a6b7197737	Fix for maxDepth	2024-06-14 19:40:37 -04:00
Nicolas	e88cb314c8	Update crawler.ts	2024-06-14 13:44:54 -07:00
Eric Ciarla	2c5f5c0ea2	Merge branch 'main' into feat/maxDepthRelative	2024-06-14 11:49:12 -04:00
Eric Ciarla	ab9de0f5ab	Update maxDepth tests	2024-06-13 18:46:30 -04:00
rafaelsideguide	bb859ae9a7	Added metadata.pageStatusCode and metadata.pageError properties to the responses	2024-06-13 17:08:40 -03:00
rafaelsideguide	ee282c3d55	Added allowBackwardCrawling option	2024-06-11 15:24:39 -03:00
Nicolas	f6b06ac27a	Nick: ignoreSitemap, better crawling algo	2024-06-10 18:12:41 -07:00
Nicolas	3091f0134c	Nick:	2024-06-10 16:27:10 -07:00
rafaelsideguide	f4a3469b9e	Merge branch 'main' into bug/crawl-limit	2024-05-22 14:27:28 -03:00
Nicolas	0d187f0425	Merge pull request #77 from tractorjuice/patch-1 Add additional file extensions to crawler.ts	2024-05-22 10:16:49 -07:00
Nicolas	9e61d431f0	Nick: hyper dx integration init	2024-05-20 13:36:34 -07:00

1 2

67 Commits