19 Commits

Author SHA1 Message Date
UncleCode
b1693b1c21 Remove old quickstart files 2025-04-05 23:10:25 +08:00
unclecode
eb131bebdf Create series of quickstart files. 2024-09-04 15:33:24 +08:00
unclecode
4d283ab386 ## [v0.2.74] - 2024-07-08
A slew of exciting updates to improve the crawler's stability and robustness! 🎉

- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
2024-07-08 16:33:25 +08:00
unclecode
3f0e265baf Merge branch 'format-inline-tags' 2024-06-19 00:48:38 +08:00
unclecode
21e2538e57 Update quickstart.py 2024-06-19 00:37:53 +08:00
unclecode
77da48050d chore: Add custom headers to LocalSeleniumCrawlerStrategy 2024-06-17 15:50:03 +08:00
unclecode
9a97aacd85 chore: Add hooks for customizing the LocalSeleniumCrawlerStrategy 2024-06-17 15:37:18 +08:00
unclecode
a19379aa58 Add recipe images, update README, and REST api example 2024-06-07 20:43:50 +08:00
unclecode
226a62a3c0 feat: Add screenshot functionality to crawl_urls 2024-06-07 15:33:15 +08:00
unclecode
8e73a482a2 feat: Add screenshot functionality to crawl_urls
The code changes in this commit add the `screenshot` parameter to the `crawl_urls` function in `main.py`. This allows users to specify whether they want to take a screenshot of the page during the crawling process. The default value is `False`.

This commit message follows the established convention of starting with a type (feat for feature) and providing a concise and descriptive summary of the changes made.
2024-06-07 15:23:32 +08:00
unclecode
51f26d12fe Update for v0.2.2
- Support multiple JS scripts
- Fixed some of bugs
- Resolved a few issue relevant to Colab installation
2024-06-02 15:40:18 +08:00
unclecode
13a3b21d19 - Add ONNX embedding model for CPU devices, Update the similarithy threshold, improve the embedding speed. 2024-05-19 22:30:10 +08:00
unclecode
eb6423875f chore: Update Selenium options in crawler_strategy.py and add verbose logging in CosineStrategy 2024-05-18 14:13:06 +08:00
unclecode
b6319c6f6e chore: Add support for GPU, MPS, and CPU 2024-05-17 21:56:13 +08:00
unclecode
957a2458b1 chore: Update web crawler URLs to use NBC News business section 2024-05-17 18:11:13 +08:00
unclecode
1cc67df301 chore: Update pip installation command and requirements, add new dependencies 2024-05-17 16:53:03 +08:00
unclecode
a5f9d07dbf Remove dependency on Spacy model. 2024-05-17 15:08:03 +08:00
UncleCode
6fcaf26b4f
Update quickstart.py: Add counting items 2024-05-16 22:49:12 +08:00
unclecode
c8589f8da3 Update:
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00