* Improve URL filtering error messages with specific denial reasons
- Add FilterResult and FilterLinksResult interfaces for structured error reporting
- Define DenialReason enum with specific, human-readable error messages
- Update filterURL method to return structured results with denial reasons
- Update filterLinks method to collect and return denial reasons for each URL
- Modify error handling in queue-worker.ts to use specific denial reasons
- Add comprehensive tests for different URL filtering scenarios
- Maintain backward compatibility while improving error specificity
Fixes: Misleading 'includePaths/excludePaths rules' error now shows actual denial reason (robots.txt, exclude patterns, depth limits, etc.)
Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>
* Fix test compilation error for FilterLinksResult interface
- Update crawler.test.ts to use filteredLinks.links.length instead of filteredLinks.length
- Update test expectations to use filteredLinks.links array
- Resolves TypeScript compilation error preventing CI from passing
Co-Authored-By: mogery@sideguide.dev <mogery@sideguide.dev>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: mogery@sideguide.dev <mogery@sideguide.dev>
* feat: add credits_billed everywhere
also a bit of logging improvement for logJob
* fix(queue-worker): db auth check before doing rpc for crawl/batch_scrape
* feat(api): GET /crawl/ongoing
* fix: routers in wrong order
* feat(api/crawl/ongoing): return more details
---------
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
* poc progress
* poc
* url splits and better url normalization
* feat(index): integrate into map
* fix on selfhost
* feat: modifiers
* separate index supa logic
* debug
* fix language comparison
* feat: dontStoreInCache
* feat(index): some rudimentary testing
* feat: use url split columns
* feat(queue-worker/kickoff): use index links to kickoff crawl
* feat(scrapeURL/index): behaviour on non-200 index entries
* feat/added benchmark for scrapes
* feat(map): ignoreIndex
* feat(index): batch insert
* fix(api/tests/scrape): fix index test to work with batching
* disable cacheable lookup for self hosting tests
* feat(js-sdk): dontStoreInCache
* chore(js-sdk): bump
* feat(index): FIRECRAWL_INDEX_WRITE_ONLY
* feat(api/test): index envs
* map benchmarks
* cleanup
* further fixes
* clean up on map
* remove extraneous log
* workflow test run
* asd
* improve fns
* try again
* wow i'm an idiot
* ok fixed
* wth
* revert
* async saving to index
* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)
* feat(selfhost): deploy a playwright image (#1625)
* Testing improvements (FIR-2209) (#1623)
* yeet ad blocking tests until further notice
* feat: re-enable billing tests
* more timeout
* cache issues with billing test
* weird thing
* fix(api/tests/scrape/status): propagation time
* stupid
* no log
* sws
---------
Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
* yeet ad blocking tests until further notice
* feat: re-enable billing tests
* more timeout
* cache issues with billing test
* weird thing
* fix(api/tests/scrape/status): propagation time
* stupid
* no log
* sws