2024-10-27 19:24:46 +08:00
|
|
|
# Proxy & Security
|
|
|
|
|
|
|
|
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
|
|
|
|
|
|
|
|
## Basic Proxy Setup
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
Simple proxy configuration with `BrowserConfig`:
|
2024-10-27 19:24:46 +08:00
|
|
|
|
|
|
|
```python
|
2024-12-19 21:02:29 +08:00
|
|
|
from crawl4ai.async_configs import BrowserConfig
|
|
|
|
|
2024-10-27 19:24:46 +08:00
|
|
|
# Using proxy URL
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
2024-10-27 19:24:46 +08:00
|
|
|
result = await crawler.arun(url="https://example.com")
|
|
|
|
|
|
|
|
# Using SOCKS proxy
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
2024-10-27 19:24:46 +08:00
|
|
|
result = await crawler.arun(url="https://example.com")
|
|
|
|
```
|
|
|
|
|
|
|
|
## Authenticated Proxy
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
Use an authenticated proxy with `BrowserConfig`:
|
2024-10-27 19:24:46 +08:00
|
|
|
|
|
|
|
```python
|
2024-12-19 21:02:29 +08:00
|
|
|
from crawl4ai.async_configs import BrowserConfig
|
|
|
|
|
2024-10-27 19:24:46 +08:00
|
|
|
proxy_config = {
|
|
|
|
"server": "http://proxy.example.com:8080",
|
|
|
|
"username": "user",
|
|
|
|
"password": "pass"
|
|
|
|
}
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config = BrowserConfig(proxy_config=proxy_config)
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
2024-10-27 19:24:46 +08:00
|
|
|
result = await crawler.arun(url="https://example.com")
|
|
|
|
```
|
|
|
|
|
|
|
|
## Rotating Proxies
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
2024-10-27 19:24:46 +08:00
|
|
|
|
|
|
|
```python
|
2024-12-19 21:02:29 +08:00
|
|
|
from crawl4ai.async_configs import BrowserConfig
|
|
|
|
|
2024-10-27 19:24:46 +08:00
|
|
|
async def get_next_proxy():
|
|
|
|
# Your proxy rotation logic here
|
|
|
|
return {"server": "http://next.proxy.com:8080"}
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config = BrowserConfig()
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
2024-10-27 19:24:46 +08:00
|
|
|
# Update proxy for each request
|
|
|
|
for url in urls:
|
|
|
|
proxy = await get_next_proxy()
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config.proxy_config = proxy
|
|
|
|
result = await crawler.arun(url=url, config=browser_config)
|
2024-10-27 19:24:46 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
## Custom Headers
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
Add security-related headers via `BrowserConfig`:
|
2024-10-27 19:24:46 +08:00
|
|
|
|
|
|
|
```python
|
2024-12-19 21:02:29 +08:00
|
|
|
from crawl4ai.async_configs import BrowserConfig
|
|
|
|
|
2024-10-27 19:24:46 +08:00
|
|
|
headers = {
|
|
|
|
"X-Forwarded-For": "203.0.113.195",
|
|
|
|
"Accept-Language": "en-US,en;q=0.9",
|
|
|
|
"Cache-Control": "no-cache",
|
|
|
|
"Pragma": "no-cache"
|
|
|
|
}
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
browser_config = BrowserConfig(headers=headers)
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
2024-10-27 19:24:46 +08:00
|
|
|
result = await crawler.arun(url="https://example.com")
|
|
|
|
```
|
|
|
|
|
|
|
|
## Combining with Magic Mode
|
|
|
|
|
2024-12-19 21:02:29 +08:00
|
|
|
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
|
2024-10-27 19:24:46 +08:00
|
|
|
|
|
|
|
```python
|
2024-12-19 21:02:29 +08:00
|
|
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
|
|
|
|
|
|
browser_config = BrowserConfig(
|
2024-10-27 19:24:46 +08:00
|
|
|
proxy="http://proxy.example.com:8080",
|
|
|
|
headers={"Accept-Language": "en-US"}
|
2024-12-19 21:02:29 +08:00
|
|
|
)
|
|
|
|
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
|
|
|
|
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
|
|
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
|
|
|
```
|