180 lines
8.0 KiB
Markdown
180 lines
8.0 KiB
Markdown
![]() |
# Preserve Your Identity with Crawl4AI
|
|||
|
|
|||
|
Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:
|
|||
|
|
|||
|
1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.
|
|||
|
2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 1. Managed Browsers: Your Digital Identity Solution
|
|||
|
|
|||
|
**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
|
|||
|
|
|||
|
### Key Benefits
|
|||
|
|
|||
|
- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.
|
|||
|
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
|
|||
|
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### Creating a User Data Directory (Command-Line Approach via Playwright)
|
|||
|
|
|||
|
If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
|
|||
|
|
|||
|
1. **Find** the Playwright Chromium binary:
|
|||
|
- On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
|
|||
|
- To see an overview of installed browsers, run:
|
|||
|
```bash
|
|||
|
python -m playwright install --dry-run
|
|||
|
```
|
|||
|
or
|
|||
|
```bash
|
|||
|
playwright install --dry-run
|
|||
|
```
|
|||
|
(depending on your environment). This shows where Playwright keeps Chromium.
|
|||
|
|
|||
|
- For instance, you might see a path like:
|
|||
|
```
|
|||
|
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
|
|||
|
```
|
|||
|
on Linux, or a corresponding folder on macOS/Windows.
|
|||
|
|
|||
|
2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
|
|||
|
```bash
|
|||
|
# Linux example
|
|||
|
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
|
|||
|
--user-data-dir=/home/<you>/my_chrome_profile
|
|||
|
```
|
|||
|
```bash
|
|||
|
# macOS example (Playwright’s internal binary)
|
|||
|
~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
|
|||
|
--user-data-dir=/Users/<you>/my_chrome_profile
|
|||
|
```
|
|||
|
```powershell
|
|||
|
# Windows example (PowerShell/cmd)
|
|||
|
"C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
|
|||
|
--user-data-dir="C:\Users\<you>\my_chrome_profile"
|
|||
|
```
|
|||
|
|
|||
|
**Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
|
|||
|
- This **opens** a fresh Chromium with your new or existing data folder.
|
|||
|
- **Log into** any sites or configure your browser the way you want.
|
|||
|
- **Close** when done—your profile data is saved in that folder.
|
|||
|
|
|||
|
3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
|
|||
|
```python
|
|||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
|||
|
|
|||
|
browser_config = BrowserConfig(
|
|||
|
headless=True,
|
|||
|
use_managed_browser=True,
|
|||
|
user_data_dir="/home/<you>/my_chrome_profile",
|
|||
|
browser_type="chromium"
|
|||
|
)
|
|||
|
```
|
|||
|
- Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 3. Using Managed Browsers in Crawl4AI
|
|||
|
|
|||
|
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
|
|||
|
|
|||
|
```python
|
|||
|
import asyncio
|
|||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
|||
|
|
|||
|
async def main():
|
|||
|
# 1) Reference your persistent data directory
|
|||
|
browser_config = BrowserConfig(
|
|||
|
headless=True, # 'True' for automated runs
|
|||
|
verbose=True,
|
|||
|
use_managed_browser=True, # Enables persistent browser strategy
|
|||
|
browser_type="chromium",
|
|||
|
user_data_dir="/path/to/my-chrome-profile"
|
|||
|
)
|
|||
|
|
|||
|
# 2) Standard crawl config
|
|||
|
crawl_config = CrawlerRunConfig(
|
|||
|
wait_for="css:.logged-in-content"
|
|||
|
)
|
|||
|
|
|||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|||
|
result = await crawler.arun(url="https://example.com/private", config=crawl_config)
|
|||
|
if result.success:
|
|||
|
print("Successfully accessed private data with your identity!")
|
|||
|
else:
|
|||
|
print("Error:", result.error_message)
|
|||
|
|
|||
|
if __name__ == "__main__":
|
|||
|
asyncio.run(main())
|
|||
|
```
|
|||
|
|
|||
|
### Workflow
|
|||
|
|
|||
|
1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
|
|||
|
2. **Close** that browser.
|
|||
|
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
|
|||
|
4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 4. Magic Mode: Simplified Automation
|
|||
|
|
|||
|
If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
|
|||
|
|
|||
|
```python
|
|||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|||
|
|
|||
|
async with AsyncWebCrawler() as crawler:
|
|||
|
result = await crawler.arun(
|
|||
|
url="https://example.com",
|
|||
|
config=CrawlerRunConfig(
|
|||
|
magic=True, # Simplifies a lot of interaction
|
|||
|
remove_overlay_elements=True,
|
|||
|
page_timeout=60000
|
|||
|
)
|
|||
|
)
|
|||
|
```
|
|||
|
|
|||
|
**Magic Mode**:
|
|||
|
|
|||
|
- Simulates a user-like experience
|
|||
|
- Randomizes user agent & navigator
|
|||
|
- Randomizes interactions & timings
|
|||
|
- Masks automation signals
|
|||
|
- Attempts pop-up handling
|
|||
|
|
|||
|
**But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 5. Comparing Managed Browsers vs. Magic Mode
|
|||
|
|
|||
|
| Feature | **Managed Browsers** | **Magic Mode** |
|
|||
|
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
|
|||
|
| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
|
|||
|
| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
|
|||
|
| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
|
|||
|
| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
|
|||
|
| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## 6. Summary
|
|||
|
|
|||
|
- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
|
|||
|
- **Log in** or configure sites as needed, then close the browser.
|
|||
|
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
|
|||
|
- Enjoy **persistent** sessions that reflect your real identity.
|
|||
|
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.
|
|||
|
|
|||
|
**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
|
|||
|
|
|||
|
With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.
|