134 lines
3.7 KiB
Markdown
134 lines
3.7 KiB
Markdown
![]() |
# Session Management
|
||
|
|
||
|
Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
|
||
|
|
||
|
## Basic Session Usage
|
||
|
|
||
|
Use `session_id` to maintain state between requests:
|
||
|
|
||
|
```python
|
||
|
async with AsyncWebCrawler() as crawler:
|
||
|
session_id = "my_session"
|
||
|
|
||
|
# First request
|
||
|
result1 = await crawler.arun(
|
||
|
url="https://example.com/page1",
|
||
|
session_id=session_id
|
||
|
)
|
||
|
|
||
|
# Subsequent request using same session
|
||
|
result2 = await crawler.arun(
|
||
|
url="https://example.com/page2",
|
||
|
session_id=session_id
|
||
|
)
|
||
|
|
||
|
# Clean up when done
|
||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||
|
```
|
||
|
|
||
|
## Dynamic Content with Sessions
|
||
|
|
||
|
Here's a real-world example of crawling GitHub commits across multiple pages:
|
||
|
|
||
|
```python
|
||
|
async def crawl_dynamic_content():
|
||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||
|
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||
|
session_id = "typescript_commits_session"
|
||
|
all_commits = []
|
||
|
|
||
|
# Define navigation JavaScript
|
||
|
js_next_page = """
|
||
|
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||
|
if (button) button.click();
|
||
|
"""
|
||
|
|
||
|
# Define wait condition
|
||
|
wait_for = """() => {
|
||
|
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||
|
if (commits.length === 0) return false;
|
||
|
const firstCommit = commits[0].textContent.trim();
|
||
|
return firstCommit !== window.firstCommit;
|
||
|
}"""
|
||
|
|
||
|
# Define extraction schema
|
||
|
schema = {
|
||
|
"name": "Commit Extractor",
|
||
|
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||
|
"fields": [
|
||
|
{
|
||
|
"name": "title",
|
||
|
"selector": "h4.markdown-title",
|
||
|
"type": "text",
|
||
|
"transform": "strip",
|
||
|
},
|
||
|
],
|
||
|
}
|
||
|
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||
|
|
||
|
# Crawl multiple pages
|
||
|
for page in range(3):
|
||
|
result = await crawler.arun(
|
||
|
url=url,
|
||
|
session_id=session_id,
|
||
|
extraction_strategy=extraction_strategy,
|
||
|
js_code=js_next_page if page > 0 else None,
|
||
|
wait_for=wait_for if page > 0 else None,
|
||
|
js_only=page > 0,
|
||
|
bypass_cache=True
|
||
|
)
|
||
|
|
||
|
if result.success:
|
||
|
commits = json.loads(result.extracted_content)
|
||
|
all_commits.extend(commits)
|
||
|
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||
|
|
||
|
# Clean up session
|
||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||
|
return all_commits
|
||
|
```
|
||
|
|
||
|
## Session Best Practices
|
||
|
|
||
|
1. **Session Naming**:
|
||
|
```python
|
||
|
# Use descriptive session IDs
|
||
|
session_id = "login_flow_session"
|
||
|
session_id = "product_catalog_session"
|
||
|
```
|
||
|
|
||
|
2. **Resource Management**:
|
||
|
```python
|
||
|
try:
|
||
|
# Your crawling code
|
||
|
pass
|
||
|
finally:
|
||
|
# Always clean up sessions
|
||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||
|
```
|
||
|
|
||
|
3. **State Management**:
|
||
|
```python
|
||
|
# First page: login
|
||
|
result = await crawler.arun(
|
||
|
url="https://example.com/login",
|
||
|
session_id=session_id,
|
||
|
js_code="document.querySelector('form').submit();"
|
||
|
)
|
||
|
|
||
|
# Second page: verify login success
|
||
|
result = await crawler.arun(
|
||
|
url="https://example.com/dashboard",
|
||
|
session_id=session_id,
|
||
|
wait_for="css:.user-profile" # Wait for authenticated content
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## Common Use Cases
|
||
|
|
||
|
1. **Authentication Flows**
|
||
|
2. **Pagination Handling**
|
||
|
3. **Form Submissions**
|
||
|
4. **Multi-step Processes**
|
||
|
5. **Dynamic Content Navigation**
|