
- Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.
118 lines
4.2 KiB
Markdown
118 lines
4.2 KiB
Markdown
# Tutorial: Clicking Buttons to Load More Content with Crawl4AI
|
||
|
||
## Introduction
|
||
|
||
When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:
|
||
|
||
1. **Step-by-step (Session-based) Approach:** Multiple calls to `arun()` to progressively load more content.
|
||
2. **Single-call Approach:** Execute a more complex JavaScript snippet inside a single `arun()` call to handle all clicks at once before the extraction.
|
||
|
||
## Prerequisites
|
||
|
||
- A working installation of Crawl4AI
|
||
- Basic familiarity with Python’s `async`/`await` syntax
|
||
|
||
## Step-by-Step Approach
|
||
|
||
Use a session ID to maintain state across multiple `arun()` calls:
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||
|
||
js_code = [
|
||
# This JS finds the “Next” button and clicks it
|
||
"const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
|
||
]
|
||
|
||
wait_for_condition = "css:.new-content-class"
|
||
|
||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||
# 1. Load the initial page
|
||
result_initial = await crawler.arun(
|
||
url="https://example.com",
|
||
cache_mode=CacheMode.BYPASS,
|
||
session_id="my_session"
|
||
)
|
||
|
||
# 2. Click the 'Next' button and wait for new content
|
||
result_next = await crawler.arun(
|
||
url="https://example.com",
|
||
session_id="my_session",
|
||
js_code=js_code,
|
||
wait_for=wait_for_condition,
|
||
js_only=True,
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# `result_next` now contains the updated HTML after clicking 'Next'
|
||
```
|
||
|
||
**Key Points:**
|
||
- **`session_id`**: Keeps the same browser context open.
|
||
- **`js_code`**: Executes JavaScript in the context of the already loaded page.
|
||
- **`wait_for`**: Ensures the crawler waits until new content is fully loaded.
|
||
- **`js_only=True`**: Runs the JS in the current session without reloading the page.
|
||
|
||
By repeating the `arun()` call multiple times and modifying the `js_code` (e.g., clicking different modules or pages), you can iteratively load all the desired content.
|
||
|
||
## Single-call Approach
|
||
|
||
If the page allows it, you can run a single `arun()` call with a more elaborate JavaScript snippet that:
|
||
- Iterates over all the modules or "Next" buttons
|
||
- Clicks them one by one
|
||
- Waits for content updates between each click
|
||
- Once done, returns control to Crawl4AI for extraction.
|
||
|
||
Example snippet:
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||
|
||
js_code = [
|
||
# Example JS that clicks multiple modules:
|
||
"""
|
||
(async () => {
|
||
const modules = document.querySelectorAll('.module-item');
|
||
for (let i = 0; i < modules.length; i++) {
|
||
modules[i].scrollIntoView();
|
||
modules[i].click();
|
||
// Wait for each module’s content to load, adjust 100ms as needed
|
||
await new Promise(r => setTimeout(r, 100));
|
||
}
|
||
})();
|
||
"""
|
||
]
|
||
|
||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
js_code=js_code,
|
||
wait_for="css:.final-loaded-content-class",
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# `result` now contains all content after all modules have been clicked in one go.
|
||
```
|
||
|
||
**Key Points:**
|
||
- All interactions (clicks and waits) happen before the extraction.
|
||
- Ideal for pages where all steps can be done in a single pass.
|
||
|
||
## Choosing the Right Approach
|
||
|
||
- **Step-by-Step (Session-based)**:
|
||
- Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
|
||
- Useful if the page requires multiple conditions checked at runtime.
|
||
|
||
- **Single-call**:
|
||
- Perfect if the sequence of interactions is known in advance.
|
||
- Cleaner code if the page’s structure is consistent and predictable.
|
||
|
||
## Conclusion
|
||
|
||
Crawl4AI makes it easy to handle dynamic content:
|
||
- Use session IDs and multiple `arun()` calls for stepwise crawling.
|
||
- Or pack all actions into one `arun()` call if the interactions are well-defined upfront.
|
||
|
||
This flexibility ensures you can handle a wide range of dynamic web pages efficiently.
|