* feat(api): remove old indexes pt. 1
* feat(map): better subdomain support
* more culling
* adjust map maxage
* feat(api/tests): add tests for pdf caching
* fix(scrapeURL/index): pdf caching
* restore pdf cache
* fix __experimental_cache
* sitemap fetching
* remove extra var
* poc progress
* poc
* url splits and better url normalization
* feat(index): integrate into map
* fix on selfhost
* feat: modifiers
* separate index supa logic
* debug
* fix language comparison
* feat: dontStoreInCache
* feat(index): some rudimentary testing
* feat: use url split columns
* feat(queue-worker/kickoff): use index links to kickoff crawl
* feat(scrapeURL/index): behaviour on non-200 index entries
* feat/added benchmark for scrapes
* feat(map): ignoreIndex
* feat(index): batch insert
* fix(api/tests/scrape): fix index test to work with batching
* disable cacheable lookup for self hosting tests
* feat(js-sdk): dontStoreInCache
* chore(js-sdk): bump
* feat(index): FIRECRAWL_INDEX_WRITE_ONLY
* feat(api/test): index envs
* map benchmarks
* cleanup
* further fixes
* clean up on map
* remove extraneous log
* workflow test run
* asd
* improve fns
* try again
* wow i'm an idiot
* ok fixed
* wth
* revert
* async saving to index
* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)
* feat(selfhost): deploy a playwright image (#1625)
* Testing improvements (FIR-2209) (#1623)
* yeet ad blocking tests until further notice
* feat: re-enable billing tests
* more timeout
* cache issues with billing test
* weird thing
* fix(api/tests/scrape/status): propagation time
* stupid
* no log
* sws
---------
Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
* feat: pdf-parser, implementation in scrapeURL
* use pdf-parser for page count instead of mu
* fix(pdf-parser): bindings
* feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE
* implement post-runsync polling and fix
* fix(Dockerfile): copy in the pdf-parser source code
* fix(scrapeURL/pdf): better error for timeout below 0
* Add caching for RunPod PDF markdown results in GCS
Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>
* Update PDF caching to hash base64 directly and add metadata
Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>
* Fix PDF caching to directly hash content and fix test expectations
Co-Authored-By: thomas@sideguide.dev <thomas@sideguide.dev>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: thomas@sideguide.dev <thomas@sideguide.dev>
* feat(scrapeURL): use cacheableLookup
* feat(queue-worker): add cacheablelookup
* fix(cacheable-lookup): make it work with tailscale on local
* add devenv
* try again
* allow querying all
* log
* fixes
* asd
* fix:
* fix(lookup):
* lookup
* feat(scrapeURL): use cacheableLookup
* feat(queue-worker): add cacheablelookup
* fix(cacheable-lookup): make it work with tailscale on local
* add devenv
* try again
* allow querying all
* log
* fixes
* asd
* fix:
* fix(lookup):